acl acl2010 acl2010-262 knowledge-graph by maker-knowledge-mining

262 acl-2010-Word Alignment with Synonym Regularization


Source: pdf

Author: Hiroyuki Shindo ; Akinori Fujino ; Masaaki Nagata

Abstract: We present a novel framework for word alignment that incorporates synonym knowledge collected from monolingual linguistic resources in a bilingual probabilistic model. Synonym information is helpful for word alignment because we can expect a synonym to correspond to the same word in a different language. We design a generative model for word alignment that uses synonym information as a regularization term. The experimental results show that our proposed method significantly improves word alignment quality.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 2-4 Hikaridai Seika-cho Soraku-gun Kyoto 619-0237 Japan { shindo , a . [sent-2, score-0.055]

2 jp Abstract We present a novel framework for word alignment that incorporates synonym knowledge collected from monolingual linguistic resources in a bilingual probabilistic model. [sent-11, score-1.474]

3 Synonym information is helpful for word alignment because we can expect a synonym to correspond to the same word in a different language. [sent-12, score-1.207]

4 We design a generative model for word alignment that uses synonym information as a regularization term. [sent-13, score-1.296]

5 The experimental results show that our proposed method significantly improves word alignment quality. [sent-14, score-0.484]

6 1 Introduction Word alignment is an essential step in most phrase and syntax based statistical machine translation (SMT). [sent-15, score-0.367]

7 It is an inference problem of word correspondences between different languages given parallel sentence pairs. [sent-16, score-0.117]

8 Accurate word alignment can induce high quality phrase detection and translation probability, which leads to a significant improvement in SMT performance. [sent-17, score-0.424]

9 Many word alignment approaches based on generative models have been proposed and they learn from bilingual sentences in an unsupervised manner (Vogel et al. [sent-18, score-0.799]

10 One way to improve word alignment quality is to add linguistic knowledge derived from a monolingual corpus. [sent-20, score-0.545]

11 This monolingual knowledge makes it easier to determine corresponding words correctly. [sent-21, score-0.121]

12 For instance, functional words in one language tend to correspond to functional words in another language (Deng and Gao, 2007), and the syntactic dependency ofwords in each language can help the alignment process (Ma et al. [sent-22, score-0.447]

13 It has been shown that such grammatical information works as a constraint in word alignment models and improves word alignment quality. [sent-24, score-0.848]

14 A large number of monolingual lexical semantic resources such as WordNet (Miller, 1995) have been constructed in more than fifty languages (Sagot and Fiser, 2008). [sent-25, score-0.121]

15 Synonym information is particularly helpful for word alignment because we can expect a synonym to correspond to the same word in a different language. [sent-27, score-1.207]

16 In this paper, we explore a method for using synonym information effectively to improve word alignment quality. [sent-28, score-1.108]

17 In general, synonym relations are defined in terms of word sense, not in terms of word form. [sent-29, score-0.75]

18 In other words, synonym relations are usually context or domain dependent. [sent-30, score-0.636]

19 For instance, ‘head’ and ‘chief’ are synonyms in contexts referring to working environment, while ‘head’ and ‘forefront’ are synonyms in contexts referring to physical positions. [sent-31, score-0.184]

20 Therefore, it is easy to imagine that simply replacing all occurrences of ‘chief’ and ‘forefront’ with ‘head’ do sometimes harm with word alignment accuracy, and we have to model either the context or senses of words. [sent-33, score-0.491]

21 We propose a novel method that incorporates synonyms from monolingual resources in a bilingual word alignment model. [sent-34, score-0.917]

22 We formulate a synonym pair generative model with a topic variable and use this model as a regularization term with a bilingual word alignment model. [sent-35, score-1.713]

23 The topic variable in our synonym model is helpful for disambiguating the meanings of synonyms. [sent-36, score-0.868]

24 We extend HM-BiTAM, which is a HMM-based word alignment model with a latent topic, with a novel synonym pair generative model. [sent-37, score-1.326]

25 We applied the proposed method to an English-French word alignment task and successfully improved the word 137 UppsalaP,r Sowce ed ein ,g 1s1 o-f16 th Jeu AlyC 2L0 210 1. [sent-38, score-0.541]

26 c C2o0n1f0er Aenscseoc Sihatoirotn P faopre Crso,m papguetsat 1io3n7a–l1 L4i1n,guistics Figure 1: Graphical model of HM-BiTAM alignment quality. [sent-40, score-0.398]

27 2 Bilingual Word Alignment Model In this section, we review a conventional generative word alignment model, HM-BiTAM (Zhao and Xing, 2008). [sent-41, score-0.545]

28 HM-BiTAM is a bilingual generative model with topic z, alignment a and topic weight vec- tor θ as latent variables. [sent-42, score-1.013]

29 Topic variables such as ‘science’ or ‘economy’ assigned to individual sentences help to disambiguate the meanings of words. [sent-43, score-0.055]

30 HM-BiTAM assumes that the nth bilingual sentence pair, (En, Fn), is generated under a given latent topic zn ∈ {1, . [sent-44, score-0.528]

31 In this framework, all of the bilingual sentence pairs {E, F} = {(En, are generated as efo pllaoiwrss {. [sent-54, score-0.344]

32 For each sentence pair (En , Fn) (a) zn ∼ Multinomial (θ) : sample the topic (b) en,i:In |zn ∼ p (En |zn ; β ): sample English twoopircds z fnrom a monolingual unigram model given (c) For each position jn = 1, . [sent-57, score-0.561]

33 ajn ∼ p (ajn |ajn −1 ; T ): sample an alignment∼ ∼lin pk( ajn f|raom a first order Markov process ii. [sent-61, score-0.357]

34 fjn ∼ p (fjn |En , ajn , zn ; B ): sample a target∼ ∼w por(df fj|nE given an aligned source word and topic where alignment ajn = idenotes source word ei and target word fjn are aligned. [sent-62, score-1.257]

35 α is a parameter over the topic weight vector θ, β = {βk,e} is tteher source ew tooprdic probability given βthe = k {thβ topic: p (e |z = k ). [sent-63, score-0.14]

36 B = {Bf,e,k} represents the word translation probability from {e to }f under the kth topic: p (f |e, z = k ). [sent-64, score-0.057]

37 The total likelihood of bilingual sentence pairs {E, F} can be obtained by marginalizing out lat{eEnt, Fva}ria cbalnes b z, a taanidn eθd, p(F,E;Ψ) =∑z∑a? [sent-68, score-0.368]

38 In thisw model, we can iβn,fTer, wBo}rd i alignment a by maximizing the likelihood above. [sent-70, score-0.367]

39 1 Synonym Pair Generative Model We design a generative model for synonym pairs {f, f′} in language F, which assumes that the synonyms are cgoulalegcete Fd ,fr womhi monolingual linguistic resources. [sent-72, score-1.069]

40 We assume that each synonym pair (f, f′) is generated independently given the same ‘sense’ s. [sent-73, score-0.694]

41 Under this assumption, the probability of synonym pair (f, f′) can be formulated as, p(f,f′) ∝ ∑p(f |s)p(f′|s)p(s). [sent-74, score-0.694]

42 (2) ∑s We define a pair (e, k) as a representation of the sense s, where e and k are a word in a different language E and a latent topic, respectively. [sent-75, score-0.217]

43 It has been shown that a word e in a different language is an appropriate representation of s in synonym modeling (Bannard and Callison-Burch, 2005). [sent-76, score-0.693]

44 We assume that adding a latent topic k for the sense is very useful for disambiguating word meaning, and thus that (e, k) gives us a good approximation of s. [sent-77, score-0.301]

45 Under this assumption, the synonym pair generative model can be defined as follows. [sent-78, score-0.823]

46 2 Wored Alignment with Synonym Regularization In this section, we extend the bilingual generative model (HM-BiTAM) with our synonym pair model. [sent-81, score-1.041]

47 Our expectation is that synonym pairs 138 Figure 2: Graphical model of synonym pair generative process correspond to the same word in a different language, thus they make it easy to infer accurate word alignment. [sent-82, score-1.69]

48 HM-BiTAM and the synonym model share parameters in order to incorporate monolingual synonym information into the bilingual word alignment model. [sent-83, score-2.066]

49 (5) Overall, we ree-define the synonym pair model with the HM-BiTAM parameter set Ψ, p({f,f′} ;Ψ) ∝{∑k}′1αk′(f∏,f′)∑k,eαkβk,eBf,e,kBf′,e,k. [sent-91, score-0.755]

50 2 shows a graphical model of the synonym pair generative process. [sent-93, score-0.871]

51 We estimate the parameter values to maximize the likelihood of HM- BiTAM with respect to bilingual sentences and that of the synonym model with respect to synonym pairs collected from monolingual resources. [sent-94, score-1.825]

52 Namely, the parameter estimate, Ψˆ, is computed as Ψˆ = argΨmax{logp(F,E; Ψ) + ζ logp({f,f′} ;Ψ)} , (7) where ζ is a regularization weight that should be set for training. [sent-95, score-0.137]

53 7 to constrain parameter set Ψ and avoid overfitting for the bilingual word alignment model. [sent-97, score-0.699]

54 1 Experimental Setting For an empirical evaluation of the proposed method, we used a bilingual parallel corpus of English-French Hansards (Mihalcea and Pedersen, 2003). [sent-102, score-0.278]

55 The corpus consists of over 1 million sen- tence pairs, which include 447 manually wordaligned sentences. [sent-103, score-0.054]

56 We selected 100 sentence pairs randomly from the manually word-aligned sentences as development data for tuning the regularization weight ζ, and used the 347 remaining sentence pairs as evaluation data. [sent-104, score-0.383]

57 We also randomly selected 10k, 50k, and 100k sized sentence pairs from the corpus as additional training data. [sent-105, score-0.126]

58 We ran the unsupervised training of our proposed word alignment model on the additional training data and the 347 sentence pairs of the evaluation data. [sent-106, score-0.616]

59 Note that manual word alignment of the 347 sentence pairs was not used for the unsupervised training. [sent-107, score-0.55]

60 After the unsupervised training, we evaluated the word alignment performance of our proposed method by comparing the manual word alignment of the 347 sentence pairs with the prediction provided by the trained model. [sent-108, score-1.034]

61 We collected English and French synonym pairs from WordNet 2. [sent-109, score-0.765]

62 We selected synonym pairs where both words were included in the bilingual training set. [sent-114, score-0.945]

63 We compared the word alignment performance of our model with that of GIZA++ 1. [sent-115, score-0.455]

64 We trained the word alignment in two directions: English to French, and French to English. [sent-121, score-0.424]

65 The alignment results for both directions were refined with ‘GROW’ heuristics to yield high precision and high recall in accordance with previous work (Och and Ney, 2003; Zhao and Xing, 2006). [sent-122, score-0.39]

66 We evaluated these results for precision, recall, Fmeasure and alignment error rate (AER), which are standard metrics for word alignment accuracy (Och and Ney, 2000). [sent-123, score-0.791]

67 html 139 10kPrecision Recall F-measure AER HMGI-ZBAiPT+AroMposw seitdathnd SaR rHd 0 . [sent-126, score-0.083]

68 1 269 08937 50kPrecision Recall F-measure AER HMGI-ZBAiPT+AroMposw seitdathnd SaR rHd 0 . [sent-130, score-0.083]

69 1 46154026 100kPrecision Recall F-measure AER HMGI-ZBAiPT+AroMposw seitdathnd SaR rHd 0 . [sent-134, score-0.083]

70 1 120346 Table 1: Comparison of word alignment accuracy. [sent-138, score-0.424]

71 2 Results and Discussion Table 1 shows the word alignment accuracy of the three methods trained with 10k, 50k, and 100k additional sentence pairs. [sent-142, score-0.459]

72 For all settings, our proposed method outperformed other conventional methods. [sent-143, score-0.115]

73 This result shows that synonym information is effective for improving word alignment quality as we expected. [sent-144, score-1.06]

74 1, the main idea of our proposed method is to introduce latent topics for modeling synonym pairs, and then to utilize the synonym pair model for the regularization of word alignment models. [sent-146, score-2.083]

75 We expect the latent topics to be useful for modeling polysemous words included in synonym pairs and to enable us to incorporate synonym information effectively into word alignment models. [sent-147, score-1.977]

76 To confirm the effect of the synonym pair model with latent topics, we also tested GIZA++ and HM- BiTAM with what we call Synonym Replacement Heuristics (SRH), where all of the synonym pairs in the bilingual training sentences were simply replaced with a representative word. [sent-148, score-1.8]

77 For instance, the words ‘sick’ and ‘ill’ in the bilingual sentences # vocabularies10k50k100k EFrnegnlcish w s it a th n d S a R rd H 19850475 3 79 5812 70612908372 5724 213273089 217 97084 Table 2: The number of vocabularies in the 10k, 50k and 100k data sets. [sent-149, score-0.306]

78 As shown in Table 2, the number of vocabularies in the English and French data sets decreased as a result of employing the SRH. [sent-151, score-0.039]

79 We assume that the SRH mitigated the overfitting of these models into low-frequency word pairs in bilingual sentences, and then improved the word alignment performance. [sent-154, score-0.841]

80 The SRH regards all of the different words coupled with the same word in the synonym pairs as synonyms. [sent-155, score-0.784]

81 For instance, the words ‘head’, ‘chief’ and ‘forefront’ in the bilingual sentences are replaced with ‘chief’, since (‘head’, ‘chief’) and (‘head’, ‘forefront’) are synonyms. [sent-156, score-0.269]

82 Obviously, (‘chief’, ‘forefront’) are not synonyms, which is detrimented to word alignment. [sent-157, score-0.057]

83 The proposed method consistently outperformed GIZA++ and HM-BiTAM with the SRH in 10k, 50k and 100k data sets in F-measure. [sent-158, score-0.092]

84 The synonym pair model in our proposed method can automatically learn that (‘head’, ‘chief’) and (‘head’, ‘forefront’) are individual synonyms with different meanings by assigning these pairs to different topics. [sent-159, score-0.999]

85 By sharing latent topics between the synonym pair model and the word alignment model, the synonym information incorporated in the synonym pair model is used directly for training word alignment model. [sent-160, score-3.065]

86 The experimental results show that our proposed method was effective in improving the performance of the word alignment model by using synonym pairs including such ambiguous synonym words. [sent-161, score-1.878]

87 As shown in Table 1, using a large number of additional sentence pairs improved the performance of all the models. [sent-163, score-0.126]

88 In all our experimental settings, all the additional sen140 tence pairs and the evaluation data were selected from the Hansards data set. [sent-164, score-0.121]

89 These experimental results show that a larger number of sentence pairs was more effective in improving word alignment performance when the sentence pairs were collected from a homogeneous data source. [sent-165, score-0.758]

90 However, in practice, it might be difficult to collect a large number of such homogeneous sentence pairs for a specific target domain and language pair. [sent-166, score-0.17]

91 One direction for future work is to confirm the effect of the proposed method when training the word alignment model by using a large number of sentence pairs collected from various data sources including many topics for a specific language pair. [sent-167, score-0.731]

92 5 Conclusions and Future Work We proposed a novel framework that incorporates synonyms from monolingual linguistic resources in a word alignment generative model. [sent-168, score-0.807]

93 This approach utilizes both bilingual and monolingual synonym resources effectively for word alignment. [sent-169, score-1.055]

94 Our proposed method uses a latent topic for bilingual sentences and monolingual synonym pairs, which is helpful in terms of word sense disambiguation. [sent-170, score-1.356]

95 Our proposed method improved word alignment quality with both small and large data sets. [sent-171, score-0.484]

96 Future work will involve examining the proposed method for different language pairs such as English-Chinese and EnglishJapanese and evaluating the impact of our proposed method on SMT performance. [sent-172, score-0.211]

97 We will also apply our proposed method to a larger data sets of multiple domains since we can expect a further improvement in word alignment accuracy if we use more bilingual sentences and more monolingual knowledge. [sent-173, score-0.883]

98 In Proceedings of the HLT-NAACL 2003 Workshop on building and using parallel texts: data driven machine translation and beyond-Volume 3, page 10. [sent-224, score-0.049]

99 A systematic comparison of various statistical alignment models. [sent-245, score-0.367]

100 complete data: with application to scoring graphical model structures. [sent-275, score-0.079]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('synonym', 0.636), ('alignment', 0.367), ('bilingual', 0.218), ('chief', 0.193), ('forefront', 0.193), ('ajn', 0.165), ('srh', 0.155), ('monolingual', 0.121), ('topic', 0.11), ('regularization', 0.107), ('generative', 0.098), ('synonyms', 0.092), ('pairs', 0.091), ('xing', 0.088), ('zn', 0.086), ('aromposw', 0.083), ('fjn', 0.083), ('rhd', 0.083), ('seitdathnd', 0.083), ('latent', 0.079), ('giza', 0.079), ('zhao', 0.074), ('bitam', 0.072), ('aer', 0.067), ('sagot', 0.066), ('jn', 0.066), ('sar', 0.062), ('french', 0.058), ('pair', 0.058), ('word', 0.057), ('fiser', 0.055), ('hmbitam', 0.055), ('shindo', 0.055), ('topics', 0.052), ('och', 0.052), ('head', 0.05), ('ney', 0.049), ('vogel', 0.048), ('sick', 0.048), ('bannard', 0.048), ('graphical', 0.048), ('homogeneous', 0.044), ('lat', 0.044), ('en', 0.042), ('wolf', 0.041), ('hansards', 0.041), ('fn', 0.04), ('vocabularies', 0.039), ('ntt', 0.039), ('wordnet', 0.038), ('collected', 0.038), ('incorporates', 0.037), ('expect', 0.036), ('imagine', 0.036), ('sentence', 0.035), ('proposed', 0.035), ('deng', 0.033), ('bernardo', 0.032), ('disambiguating', 0.032), ('variational', 0.032), ('outperformed', 0.032), ('meanings', 0.031), ('smt', 0.031), ('model', 0.031), ('parameter', 0.03), ('association', 0.03), ('logp', 0.03), ('lab', 0.03), ('fraser', 0.03), ('tence', 0.03), ('helpful', 0.028), ('sample', 0.027), ('overfitting', 0.027), ('functional', 0.027), ('replaced', 0.027), ('mihalcea', 0.027), ('correspond', 0.026), ('rd', 0.025), ('method', 0.025), ('morristown', 0.025), ('parallel', 0.025), ('sentences', 0.024), ('miller', 0.024), ('republic', 0.024), ('marginalizing', 0.024), ('ill', 0.024), ('admixture', 0.024), ('wordaligned', 0.024), ('entitled', 0.024), ('reparameterizing', 0.024), ('aaki', 0.024), ('nagat', 0.024), ('nagata', 0.024), ('mitigated', 0.024), ('dawid', 0.024), ('page', 0.024), ('sense', 0.023), ('effectively', 0.023), ('conventional', 0.023), ('heuristics', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 262 acl-2010-Word Alignment with Synonym Regularization

Author: Hiroyuki Shindo ; Akinori Fujino ; Masaaki Nagata

Abstract: We present a novel framework for word alignment that incorporates synonym knowledge collected from monolingual linguistic resources in a bilingual probabilistic model. Synonym information is helpful for word alignment because we can expect a synonym to correspond to the same word in a different language. We design a generative model for word alignment that uses synonym information as a regularization term. The experimental results show that our proposed method significantly improves word alignment quality.

2 0.23682754 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell

Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.

3 0.17949231 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

Author: Zhanyi Liu ; Haifeng Wang ; Hua Wu ; Sheng Li

Abstract: This paper proposes to use monolingual collocations to improve Statistical Machine Translation (SMT). We make use of the collocation probabilities, which are estimated from monolingual corpora, in two aspects, namely improving word alignment for various kinds of SMT systems and improving phrase table for phrase-based SMT. The experimental results show that our method improves the performance of both word alignment and translation quality significantly. As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system. 1

4 0.17310935 133 acl-2010-Hierarchical Search for Word Alignment

Author: Jason Riesa ; Daniel Marcu

Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Our algorithm induces a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of features, and trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.

5 0.16538343 170 acl-2010-Letter-Phoneme Alignment: An Exploration

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

6 0.16493978 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

7 0.16089009 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

8 0.15708953 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

9 0.14985582 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

10 0.12152546 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

11 0.1194073 79 acl-2010-Cross-Lingual Latent Topic Extraction

12 0.1097828 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

13 0.100233 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

14 0.090268359 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

15 0.089275144 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

16 0.087029509 163 acl-2010-Learning Lexicalized Reordering Models from Reordering Graphs

17 0.084070571 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

18 0.078047007 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

19 0.075219706 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

20 0.071306929 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.193), (1, -0.193), (2, -0.086), (3, 0.021), (4, 0.125), (5, 0.098), (6, -0.117), (7, 0.059), (8, 0.127), (9, -0.108), (10, -0.088), (11, -0.093), (12, -0.063), (13, 0.103), (14, 0.007), (15, -0.048), (16, 0.031), (17, -0.081), (18, 0.031), (19, -0.147), (20, -0.028), (21, -0.109), (22, 0.036), (23, 0.017), (24, -0.047), (25, 0.0), (26, -0.033), (27, -0.01), (28, -0.05), (29, -0.065), (30, -0.069), (31, 0.036), (32, -0.02), (33, -0.069), (34, 0.065), (35, 0.048), (36, 0.053), (37, -0.02), (38, -0.002), (39, -0.066), (40, 0.03), (41, -0.093), (42, -0.001), (43, 0.096), (44, 0.006), (45, 0.035), (46, 0.008), (47, 0.006), (48, -0.018), (49, -0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96100998 262 acl-2010-Word Alignment with Synonym Regularization

Author: Hiroyuki Shindo ; Akinori Fujino ; Masaaki Nagata

Abstract: We present a novel framework for word alignment that incorporates synonym knowledge collected from monolingual linguistic resources in a bilingual probabilistic model. Synonym information is helpful for word alignment because we can expect a synonym to correspond to the same word in a different language. We design a generative model for word alignment that uses synonym information as a regularization term. The experimental results show that our proposed method significantly improves word alignment quality.

2 0.70737493 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

Author: Bing Xiang ; Yonggang Deng ; Bowen Zhou

Abstract: We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversified alignments based on different motivations, such as linguistic knowledge, morphology and heuristics. We demonstrate this approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. The combined alignment outperforms the baseline alignment, with significantly higher F-scores and better transla- tion performance.

3 0.70224684 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell

Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.

4 0.69251597 170 acl-2010-Letter-Phoneme Alignment: An Exploration

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

5 0.65626645 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

Author: John DeNero ; Dan Klein

Abstract: We present a discriminative model that directly predicts which set ofphrasal translation rules should be extracted from a sentence pair. Our model scores extraction sets: nested collections of all the overlapping phrase pairs consistent with an underlying word alignment. Extraction set models provide two principle advantages over word-factored alignment models. First, we can incorporate features on phrase pairs, in addition to word links. Second, we can optimize for an extraction-based loss function that relates directly to the end task of generating translations. Our model gives improvements in alignment quality relative to state-of-the-art unsupervised and supervised baselines, as well as providing up to a 1.4 improvement in BLEU score in Chinese-to-English translation experiments.

6 0.6477738 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

7 0.62418342 133 acl-2010-Hierarchical Search for Word Alignment

8 0.617562 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

9 0.61021721 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

10 0.60039496 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

11 0.55741578 79 acl-2010-Cross-Lingual Latent Topic Extraction

12 0.53881484 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

13 0.52471691 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

14 0.48822555 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

15 0.46893731 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

16 0.43950668 195 acl-2010-Phylogenetic Grammar Induction

17 0.4212504 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

18 0.4102723 163 acl-2010-Learning Lexicalized Reordering Models from Reordering Graphs

19 0.39215016 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems

20 0.38657528 162 acl-2010-Learning Common Grammar from Multilingual Corpus


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.296), (14, 0.034), (25, 0.042), (39, 0.011), (42, 0.022), (59, 0.111), (73, 0.046), (76, 0.012), (78, 0.014), (83, 0.06), (84, 0.02), (98, 0.23)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82756376 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning

Author: Peter Prettenhofer ; Benno Stein

Abstract: We present a new approach to crosslanguage text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce taskspecific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of unlabeled data and the complexity of interlanguage correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.

same-paper 2 0.80053288 262 acl-2010-Word Alignment with Synonym Regularization

Author: Hiroyuki Shindo ; Akinori Fujino ; Masaaki Nagata

Abstract: We present a novel framework for word alignment that incorporates synonym knowledge collected from monolingual linguistic resources in a bilingual probabilistic model. Synonym information is helpful for word alignment because we can expect a synonym to correspond to the same word in a different language. We design a generative model for word alignment that uses synonym information as a regularization term. The experimental results show that our proposed method significantly improves word alignment quality.

3 0.70497596 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People

Author: Nancy Ide ; Collin Baker ; Christiane Fellbaum ; Rebecca Passonneau

Abstract: The Manually Annotated Sub-Corpus (MASC) project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the American National Corpus. The MASC infrastructure enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or ported to any of a variety of other formats. MASC includes data from a much wider variety of genres than existing multiply-annotated corpora of English, and the project is committed to a fully open model of distribution, without restriction, for all data and annotations produced or contributed. As such, MASC is the first large-scale, open, communitybased effort to create much needed language resources for NLP. This paper describes the MASC project, its corpus and annotations, and serves as a call for contributions of data and annotations from the language processing community.

4 0.66600567 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models

Author: David Jurgens ; Keith Stevens

Abstract: We present the S-Space Package, an open source framework for developing and evaluating word space algorithms. The package implements well-known word space algorithms, such as LSA, and provides a comprehensive set of matrix utilities and data structures for extending new or existing models. The package also includes word space benchmarks for evaluation. Both algorithms and libraries are designed for high concurrency and scalability. We demonstrate the efficiency of the reference implementations and also provide their results on six benchmarks.

5 0.66520077 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification

Author: Wenbin Jiang ; Qun Liu

Abstract: In this paper we describe an intuitionistic method for dependency parsing, where a classifier is used to determine whether a pair of words forms a dependency edge. And we also propose an effective strategy for dependency projection, where the dependency relationships of the word pairs in the source language are projected to the word pairs of the target language, leading to a set of classification instances rather than a complete tree. Experiments show that, the classifier trained on the projected classification instances significantly outperforms previous projected dependency parsers. More importantly, when this clas- , sifier is integrated into a maximum spanning tree (MST) dependency parser, obvious improvement is obtained over the MST baseline.

6 0.66400695 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

7 0.6621173 79 acl-2010-Cross-Lingual Latent Topic Extraction

8 0.65787667 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

9 0.65776384 253 acl-2010-Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing

10 0.65540826 133 acl-2010-Hierarchical Search for Word Alignment

11 0.65352052 170 acl-2010-Letter-Phoneme Alignment: An Exploration

12 0.65342784 20 acl-2010-A Transition-Based Parser for 2-Planar Dependency Structures

13 0.65341115 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

14 0.64937043 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

15 0.6479634 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

16 0.64604217 146 acl-2010-Improving Chinese Semantic Role Labeling with Rich Syntactic Features

17 0.64597374 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction

18 0.64576608 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

19 0.64570713 163 acl-2010-Learning Lexicalized Reordering Models from Reordering Graphs

20 0.64496458 164 acl-2010-Learning Phrase-Based Spelling Error Models from Clickthrough Data