acl acl2011 acl2011-43 knowledge-graph by maker-knowledge-mining

43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

Source: pdf

Author: Graham Neubig ; Taro Watanabe ; Eiichiro Sumita ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: We present an unsupervised model for joint phrase alignment and extraction using nonparametric Bayesian methods and inversion transduction grammars (ITGs). The key contribution is that phrases of many granularities are included directly in the model through the use of a novel formulation that memorizes phrases generated not only by terminal, but also non-terminal symbols. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs. Experiments on several language pairs demonstrate that the proposed model matches the accuracy of traditional two-step word alignment/phrase extraction approach while reducing the phrase table to a fraction of the original size.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The key contribution is that phrases of many granularities are included directly in the model through the use of a novel formulation that memorizes phrases generated not only by terminal, but also non-terminal symbols. [sent-2, score-0.526]

2 This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs. [sent-3, score-0.609]

3 Experiments on several language pairs demonstrate that the proposed model matches the accuracy of traditional two-step word alignment/phrase extraction approach while reducing the phrase table to a fraction of the original size. [sent-4, score-0.703]

4 , 2003) takes unaligned bilingual training data as input, and outputs a scored table of phrase pairs. [sent-6, score-0.458]

5 This phrase table is traditionally generated by going through a pipeline of two steps, first generating word (or minimal phrase) alignments, then extracting a phrase table that is consistent with these alignments. [sent-7, score-0.977]

6 However, as DeNero and Klein (2010) note, this two step approach results in word alignments that are not optimal for the final task of generating 632 phrase tables that are used in translation. [sent-8, score-0.618]

7 As a solution to this, they proposed a supervised discriminative model that performs joint word alignment and phrase extraction, and found that joint estimation of word alignments and extraction sets improves both word alignment accuracy and translation results. [sent-9, score-1.186]

8 In this paper, we propose the first unsupervised approach to joint alignment and extraction of phrases at multiple granularities. [sent-10, score-0.461]

9 This is achieved by constructing a generative model that includes phrases at many levels of granularity, from minimal phrases all the way up to full sentences. [sent-11, score-0.501]

10 The model is similar to previously proposed phrase alignment models based on inversion transduction grammars (ITGs) (Cherry and Lin, 2007; Zhang et al. [sent-12, score-0.87]

11 , 2009), with one important change: ITG symbols and phrase pairs are generated in the opposite order. [sent-14, score-0.522]

12 In traditional ITG models, the branches of a biparse tree are generated from a nonterminal distribution, and each leaf is generated by a word or phrase pair distribution. [sent-15, score-0.638]

13 As a result, only minimal phrases are directly included in the model, while larger phrases must be generated by heuristic extraction methods. [sent-16, score-0.744]

14 In the proposed model, at each branch in the tree, we first attempt to generate a phrase pair from the phrase pair distribution, falling back to ITG-based divide and conquer strategy to generate phrase pairs that do not exist (or are given low probability) in the phrase distribution. [sent-17, score-1.955]

15 This makes it possible to directly use probabilities of the phrase model as a replacement for the phrase table generated by heuristic extraction techniques. [sent-22, score-1.272]

16 We observe that the proposed joint phrase alignment and extraction approach is able to meet or exceed results attained by a combination of GIZA++ and heuristic phrase extraction with significantly smaller phrase table size. [sent-24, score-1.844]

17 We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches. [sent-25, score-0.593]

18 (1) If θ takes the form of a scored phrase table, we can use traditional methods for phrase-based SMT to find P(e|f, θ) and concentrate on creating a model ffoinrd P(θ| hE, Fi). [sent-29, score-0.503]

19 , 2009) have used the formalism of inversion transduction grammars (ITGs) (Wu, 1997) to learn phrase alignments. [sent-35, score-0.653]

20 The traditional flat ITG generative probability for a particular phrase (or sentence) pair Pflat( he, fi; θx, θt) is parameterized by a phrase table θt haen,df a symbol distribution θx. [sent-37, score-1.303]

21 (a) If x = TERM, generate a phrase pair from the phrase table Pt(he, fi; θt). [sent-44, score-0.913]

22 (b) If x = REG, a regular ITG rule, generate phrase pairs he1, f1i and he2, f2i from Pflat, aansde pcoainrcsa hteenate ith aenmd hiento a single phrase pair he1e2, f1f2i . [sent-45, score-0.965]

23 1 Bayesian Modeling While the previous formulation can be used as-is in maximum likelihood training, this leads to a degen- erate solution where every sentence is memorized as a single phrase pair. [sent-50, score-0.529]

24 We assign θx a Dirichlet prior1, and assign the phrase table parameters θt a prior using the PitmanYor process (Pitman and Yor, 1997; Teh, 2006), which is a generalization of the Dirichlet process prior used in previous research. [sent-53, score-0.591]

25 The discount d is subtracted from observed counts, and when it is given a large value (close to one), less frequent phrase pairs will be given lower relative probability than more common phrase pairs. [sent-55, score-0.995]

26 Pbase is the prior probability of generating a particular phrase pair, which we describe in more detail in the following section. [sent-57, score-0.507]

27 Non-parametric priors are well suited for modeling the phrase distribution because every time a phrase is generated by the model, it is “memorized” and given higher probability. [sent-58, score-0.975]

28 Because of this, common phrase pairs are more likely to be re-used (the rich-get-richer effect), which results in the induction of phrase tables with fewer, but more helpful phrases. [sent-59, score-0.958]

29 It is important to note that only phrases generated by Pt are actually memorized and given higher probability by the model. [sent-60, score-0.39]

30 In FLAT, only minimal phrases generated after Px outputs the terminal symbol TERM are generated from Pt, and thus only minimal phrases are memorized by the model. [sent-61, score-0.796]

31 2 Base Measure Pbase in Equation (2) indicates the prior probability of phrase pairs according to the model. [sent-69, score-0.559]

32 We calculate Pbase by first choosing whether to generate an unaligned phrase pair (where |e| = 0 or |f| = 0) according to a fixed probability p =u3 ,0 th oren | generating cfroormdin Pba ofo ar aligned phrase pairs, or Pbu for unaligned phrase pairs. [sent-71, score-1.505]

33 , 1993) probability of one phrase given the other, which incorporates word-based alignment information as prior knowledge in the phrase translation probability. [sent-77, score-1.145]

34 It should be noted that while Model 1 probabilities are used, they are only soft constraints, compared with the hard constraint of choosing a single word alignment used in most previous phrase extraction approaches. [sent-80, score-0.788]

35 For Pbu, if g is the non-null phrase in e and f, we calculate the probability as follows: Pbu(he, fi) = Puni(g)Ppois(|g|; λ)/2. [sent-81, score-0.498]

36 4 Hierarchical ITG Model While in FLAT only minimal phrases were memorized by the model, as DeNero et al. [sent-83, score-0.372]

37 and we confirm in the experiments in Section 7, using only minimal phrases leads to inferior translation results for phrase-based SMT. [sent-89, score-0.337]

38 Because of this, previous research has combined FLAT with heuristic phrase extraction, which exhaustively combines all adjacent phrases permitted by the word alignments (Och et al. [sent-90, score-0.852]

39 By doing so, we are able to do away with heuristic phrase extraction, creating a fully probabilistic model for phrase probabilities that still yields competitive results. [sent-93, score-1.07]

40 Similarly to FLAT, HIER assigns a probability Phier (he, fi; θx , θt) to phrase pairs, and is parameterized(h by a phrase table θt and a symbol distribution θx. [sent-94, score-0.983]

41 The main difference from the generative story of the traditional ITG model is that symbols and phrase pairs are generated in the opposite order. [sent-95, score-0.668]

42 While FLAT first generates branches ofthe derivation tree using Px, then generates leaves using the phrase distribution Pt, HIER first attempts to generate the full sentence as a single phrase from Pt, then falls back to ITG-style derivations to cope with sparsity. [sent-96, score-1.013]

43 θt ∼ PY (d, s, Pdac) (3) Pdac essentially breaks the generation of a single longer phrase into two generations of shorter phrases, allowing even phrase pairs for which c( he, fi) = 0 to be given some probability. [sent-98, score-0.874]

44 (a) If x = BASE, generate a new phrase pair directly from Pbase of Section 3. [sent-105, score-0.535]

45 (b) If x = REG, generate he1, f1i and he2 , f2i from Phier, agnende rcaotnecha etenatei athnedm h einto a single phrase pair he1e2 , f1f2i . [sent-107, score-0.502]

46 As previously described, FLAT first generates from the symbol distribution Px, then from the phrase distribution Pt, while HIER generates directly from Pt, which falls back to divide-and-conquer based on Px when necessary. [sent-113, score-0.702]

47 It can be seen that while Pt in FLAT only generates minimal phrases, Pt in HIER generates (and thus memorizes) phrases at all levels of granularity. [sent-114, score-0.352]

48 Practically, while the Pitman-Yor process in HIER shares the parameters s and d over all phrase pairs in the model, long phrase pairs are much more sparse Figure 2: Learned discount values by phrase pair length. [sent-119, score-1.52]

49 than short phrase pairs, and thus it is desirable to appropriately adjust the parameters of Equation (2) according to phrase pair length. [sent-120, score-0.906]

50 In order to solve these problems, we reformulate the model so that each phrase length l = |f|+ |e| has itthse own phrase parameters θt,l anngdth symbol parameters θx,l, which are given separate priors: θt,l ∼ PY (s, d, Pdac,l) θx,l ∼ Dirichlet(α) We will call this model HLEN. [sent-121, score-1.074]

51 Ws izee et h|ee|n + generate a phrase pair tfernocme the probability Pt,l (he, fi) for length l. [sent-124, score-0.592]

52 HIER, bwaisthe one minor change: when we fall back to two shorter phrases, we choose the length of the left phrase from ll ∼ Uniform(1, l 1), set the length of the right phrase to lr = l−ll, a −nd 1 generate th leen sgmthal olefr t phrases from Pt,ll an=d Pt,lr respectively. [sent-126, score-1.089]

53 In particular, phrase pairs of length up to six (for example, |e| = 3, |f| = 3) are given d uispco tou nsitxs o(ffo nearly zero we|h =ile 3 larger phrases are more heavily discounted. [sent-131, score-0.656]

54 2 Implementation Previous research has used a variety of sampling methods to learn Bayesian phrase based alignment models (DeNero et al. [sent-135, score-0.606]

55 One important implementation detail that is different from previous models is the management of phrase counts. [sent-141, score-0.411]

56 As a phrase pair ta may have been generated from two smaller component phrases tb and tc, when a sample containing ta is removed from the distribution, it may also be necessary to decrement the counts of tb and tc as well. [sent-142, score-0.935]

57 For each table representing a phrase pair ta, we maintain not only the number of customers sitting at the table, but also the identities of phrases tb and tc that were originally used when generating the table. [sent-144, score-0.718]

58 5 Phrase Extraction In this section, we describe both traditional heuristic phrase extraction, and the proposed model-based extraction method. [sent-146, score-0.746]

59 Figure 3: The phrase, block, and word alignments used in heuristic phrase extraction. [sent-147, score-0.664]

60 1 Heuristic Phrase Extraction The traditional method for heuristic phrase extraction from word alignments exhaustively enumerates all phrases up to a certain length consistent with the alignment (Och et al. [sent-149, score-1.198]

61 Five features are used in the phrase table: the conditional phrase probabilities in both directions estimated using maximum likelihood Pml (f|e) and Pml (e|f), lexical weighting probabilities (Koehn ePt al. [sent-151, score-0.988]

62 We will call this heuristic extraction from word alignments HEUR-W. [sent-153, score-0.394]

63 We use the combination of our ITG-based alignment with traditional heuristic phrase extraction as a second baseline. [sent-155, score-0.852]

64 In model HEUR-P, minimal phrases generated from Pt are treated as aligned, and we perform phrase extraction on these alignments. [sent-157, score-0.869]

65 It should be noted that forcing alignments smaller than the model suggests is only used for generating alignments for use in heuristic extraction, and does not affect the training process. [sent-160, score-0.481]

66 2 Model-Based Phrase Extraction We also propose a method for phrase table ex- traction that directly utilizes the phrase probabil637 ities Pt(he, fi). [sent-162, score-0.855]

67 Similarly to the heuristic phrase tables, we use conditional probabilities Pt(f|e) and Pt(e|f), lexical weighting probabilities, a(nfd|e a phrase penalty. [sent-163, score-1.035]

68 { e˜:c(h∑ e˜,fi)≥1} To limit phrase table size, we include only phrase pairs that are aligned at least once in the sample. [sent-165, score-0.874]

69 We also include two more features: the phrase pair joint probability Pt(he, fi), and the average posterior probability of (ehaec,hf span tdha tth generated he, fi as computed by the inside-outside algorithm during training. [sent-166, score-0.827]

70 eWde b use eth ien span probability as mit gives a hint about the reliability of the phrase pair. [sent-167, score-0.466]

71 It will be high for common phrase pairs that are gen- erated directly from the model, and also for phrases that, while not directly included in the model, are composed of two high probability child phrases. [sent-168, score-0.742]

72 We do this by setting ∑L Pt(he,fi) = Pt,l(he,fi)c(l)/∑c(l˜) ∑l˜=1 + for every phrase pair, where l= |e| |f| and c(l) is tfoher nevuemryb perh orafs phrases wohfe length l | ein| + +th |ef sample. [sent-170, score-0.604]

73 , 2006) exhaustive phrase extraction tends to out-perform approaches that use syntax or generative models to limit phrase boundaries. [sent-175, score-0.986]

74 (2006) state that this is because generative models choose only a single phrase segmentation, and thus throw away many good phrase pairs that are in conflict with this segmentation. [sent-177, score-0.928]

75 Luckily, in the Bayesian framework it is simple to overcome this problem by combining phrase tables from multiple samples. [sent-178, score-0.495]

76 6 Related Work In addition to the previously mentioned phrase alignment techniques, there has also been a significant body of work on phrase extraction (Moore and Quirk (2007), Johnson et al. [sent-181, score-1.076]

77 DeNero and Klein (2010) presented the first work on joint phrase alignment and extraction at multiple levels. [sent-183, score-0.714]

78 We compare the accuracy of our proposed method of joint phrase alignment and extraction using the FLAT, HIER and HLEN models, with a baseline of using word alignments from GIZA++ and heuristic phrase extraction. [sent-206, score-1.416]

79 Decoding is performed using Moses (Koehn and others, 2007) using the phrase tables learned by each method under consideration, as well as standard bidirectional lexical reordering probabilities (Koehn et al. [sent-207, score-0.578]

80 Maximum phrase length is limited to 7 in all models, and for the LM we use an interpolated Kneser-Ney 5-gram model. [sent-209, score-0.446]

81 we also try averaging the phrase tables from the last ten samples as described in Section 5. [sent-215, score-0.495]

82 From these results we can see that when using a single sample, the combination of using HIER and model probabilities achieves results approximately equal to GIZA++ and heuristic phrase extraction. [sent-219, score-0.659]

83 This is the first reported result in which an unsupervised phrase alignment model has built a phrase table directly from model probabilities and achieved results that compare to heuristic phrase extraction. [sent-220, score-1.693]

84 It can also be seen that the phrase table created by the proposed method is approximately 5 times smaller than that obtained by the traditional pipeline. [sent-221, score-0.536]

85 This confirms that phrase tables containing only minimal phrases are not able to achieve results that compete with phrase tables that use multiple granularities. [sent-223, score-1.244]

86 In particular, we believe the necessity to combine probabilities from multiple Pt,l models into a single phrase table may have resulted in a distortion of the phrase probabilities. [sent-226, score-0.905]

87 In addition, the assumption that phrase lengths are generated from a uniform distribution is likely too strong, and further gains provided by Pbase. [sent-227, score-0.562]

88 could likely be achieved by more accurate modeling of phrase lengths. [sent-237, score-0.411]

89 It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER. [sent-239, score-0.495]

90 This suggests that for HIER, most of the useful phrase pairs discovered by the model are included in every iteration, and the increased recall obtained by combining multiple samples does not consistently outweigh the increased confusion caused by the larger phrase table. [sent-240, score-0.909]

91 We also evaluated the effectiveness of modelbased phrase extraction compared to heuristic phrase extraction. [sent-241, score-1.062]

92 Using the alignments from HIER, we created phrase tables using model probabilities (MOD), and heuristic extraction on words (HEUR-W), blocks (HEUR-B), and minimal phrases (HEUR-P) as described in Section 5. [sent-242, score-1.23]

93 It can be seen that model-based phrase extraction using HIER outperforms or insignificantly underperforms heuris- tic phrase extraction over all experimental settings, while keeping the phrase table to a fraction of the size of most heuristic extraction methods. [sent-244, score-1.729]

94 Finally, we varied the size of the parallel corpus for the Japanese-English task from 50k to 400k sen- Figure 4: The effect ofcorpus size on the accuracy (a) and phrase table size (b) for each method (Japanese-English). [sent-245, score-0.411]

95 Figure 4 (b) shows the size of the phrase table induced by each method over the various corpus sizes. [sent-248, score-0.411]

96 8 Conclusion In this paper, we presented a novel approach to joint phrase alignment and extraction through a hierarchical model using non-parametric Bayesian methods and inversion transduction grammars. [sent-250, score-0.909]

97 Machine translation systems using phrase tables learned directly by the proposed model were able to achieve accuracy competitive with the traditional pipeline of word alignment and heuristic phrase extraction, the first such result for an unsupervised model. [sent-251, score-1.426]

98 For future work, we plan to refine HLEN to use a more appropriate model of phrase length than the uniform distribution, particularly by attempting to bias against phrase pairs where one of the two phrases is much longer than the other. [sent-252, score-1.14]

99 We will also examine the applicability of the proposed model in the context of hierarchical phrases (Chiang, 2007), or in alignment using syntactic structure (Galley et al. [sent-254, score-0.375]

100 An iterativelytrained segmentation-free phrase translation model for statistical machine translation. [sent-363, score-0.529]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('phrase', 0.411), ('hier', 0.392), ('pt', 0.204), ('hlen', 0.178), ('phrases', 0.158), ('flat', 0.157), ('pbase', 0.156), ('itg', 0.148), ('fi', 0.146), ('alignment', 0.144), ('denero', 0.141), ('heuristic', 0.13), ('alignments', 0.123), ('memorized', 0.118), ('pflat', 0.111), ('extraction', 0.11), ('minimal', 0.096), ('adaptor', 0.092), ('blunsom', 0.092), ('pbu', 0.089), ('pdac', 0.089), ('ppois', 0.089), ('bayesian', 0.089), ('px', 0.086), ('tables', 0.084), ('transduction', 0.083), ('translation', 0.083), ('probabilities', 0.083), ('grammars', 0.082), ('itgs', 0.078), ('inversion', 0.077), ('pba', 0.067), ('puni', 0.067), ('discount', 0.066), ('koehn', 0.065), ('reg', 0.064), ('giza', 0.062), ('base', 0.06), ('generated', 0.059), ('traditional', 0.057), ('probability', 0.055), ('distribution', 0.054), ('generative', 0.054), ('symbol', 0.052), ('pairs', 0.052), ('pair', 0.052), ('pitman', 0.051), ('sampling', 0.051), ('tb', 0.05), ('generates', 0.049), ('dirichlet', 0.049), ('joint', 0.049), ('variational', 0.048), ('tc', 0.047), ('unaligned', 0.047), ('phier', 0.044), ('saers', 0.044), ('granularities', 0.044), ('kyoto', 0.042), ('prior', 0.041), ('cohen', 0.041), ('priors', 0.04), ('noted', 0.04), ('ta', 0.039), ('pitmanyor', 0.039), ('memorizes', 0.039), ('conquer', 0.039), ('inv', 0.039), ('generate', 0.039), ('uniform', 0.038), ('proposed', 0.038), ('py', 0.037), ('mdl', 0.036), ('pml', 0.036), ('underperforms', 0.036), ('yor', 0.036), ('moore', 0.036), ('length', 0.035), ('cherry', 0.035), ('model', 0.035), ('teh', 0.034), ('synchronous', 0.034), ('tm', 0.034), ('johnson', 0.034), ('process', 0.033), ('directly', 0.033), ('nonparametric', 0.033), ('parameters', 0.032), ('alia', 0.032), ('fujii', 0.032), ('patent', 0.032), ('calculate', 0.032), ('cohn', 0.031), ('ntcir', 0.031), ('concatenate', 0.031), ('call', 0.031), ('tuning', 0.031), ('smaller', 0.03), ('exhaustively', 0.03), ('parameter', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

Author: Graham Neubig ; Taro Watanabe ; Eiichiro Sumita ; Shinsuke Mori ; Tatsuya Kawahara

2 0.25834697 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

Author: Qin Gao ; Stephan Vogel

Abstract: We present an approach of expanding parallel corpora for machine translation. By utilizing Semantic role labeling (SRL) on one side of the language pair, we extract SRL substitution rules from existing parallel corpus. The rules are then used for generating new sentence pairs. An SVM classifier is built to filter the generated sentence pairs. The filtered corpus is used for training phrase-based translation models, which can be used directly in translation tasks or combined with baseline models. Experimental results on ChineseEnglish machine translation tasks show an average improvement of 0.45 BLEU and 1.22 TER points across 5 different NIST test sets.

3 0.22559758 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r

4 0.19734371 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

Author: Andreas Zollmann ; Stephan Vogel

Abstract: In this work we propose methods to label probabilistic synchronous context-free grammar (PSCFG) rules using only word tags, generated by either part-of-speech analysis or unsupervised word class induction. The proposals range from simple tag-combination schemes to a phrase clustering model that can incorporate an arbitrary number of features. Our models improve translation quality over the single generic label approach of Chiang (2005) and perform on par with the syntactically motivated approach from Zollmann and Venugopal (2006) on the NIST large Chineseto-English translation task. These results persist when using automatically learned word tags, suggesting broad applicability of our technique across diverse language pairs for which syntactic resources are not available.

5 0.18364097 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1

6 0.16488506 141 acl-2011-Gappy Phrasal Alignment By Agreement

7 0.15909626 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

8 0.15604578 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

9 0.14537318 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

10 0.14160059 263 acl-2011-Reordering Constraint Based on Document-Level Context

11 0.13751481 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

12 0.1374066 44 acl-2011-An exponential translation model for target language morphology

13 0.1354809 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices

14 0.13533597 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

15 0.1343466 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

16 0.1326457 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

17 0.12445045 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

18 0.1233796 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

19 0.11348195 245 acl-2011-Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives

20 0.10784341 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.273), (1, -0.192), (2, 0.095), (3, 0.101), (4, 0.082), (5, 0.007), (6, -0.013), (7, 0.034), (8, -0.033), (9, 0.036), (10, 0.117), (11, 0.127), (12, 0.038), (13, 0.104), (14, -0.045), (15, 0.022), (16, 0.008), (17, 0.046), (18, -0.051), (19, 0.043), (20, -0.071), (21, 0.045), (22, -0.064), (23, -0.078), (24, 0.023), (25, 0.091), (26, 0.087), (27, -0.007), (28, -0.06), (29, -0.012), (30, -0.0), (31, 0.006), (32, -0.01), (33, -0.024), (34, 0.018), (35, -0.08), (36, 0.046), (37, -0.061), (38, 0.096), (39, 0.179), (40, -0.103), (41, 0.12), (42, -0.03), (43, 0.024), (44, -0.126), (45, -0.021), (46, 0.044), (47, 0.07), (48, 0.032), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97920573 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

Author: Graham Neubig ; Taro Watanabe ; Eiichiro Sumita ; Shinsuke Mori ; Tatsuya Kawahara

2 0.8238039 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices

Author: Wang Ling ; Tiago Luis ; Joao Graca ; Isabel Trancoso ; Luisa Coheur

Abstract: In most statistical machine translation systems, the phrase/rule extraction algorithm uses alignments in the 1-best form, which might contain spurious alignment points. The usage ofweighted alignment matrices that encode all possible alignments has been shown to generate better phrase tables for phrase-based systems. We propose two algorithms to generate the well known MSD reordering model using weighted alignment matrices. Experiments on the IWSLT 2010 evaluation datasets for two language pairs with different alignment algorithms show that our methods produce more accurate reordering models, as can be shown by an increase over the regular MSD models of 0.4 BLEU points in the BTEC French to English test set, and of 1.5 BLEU points in the DIALOG Chinese to English test set.

3 0.80167878 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

Author: Qin Gao ; Stephan Vogel

4 0.78625286 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

5 0.78247452 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

Author: Shujian Huang ; Stephan Vogel ; Jiajun Chen

Abstract: Word alignment has an exponentially large search space, which often makes exact inference infeasible. Recent studies have shown that inversion transduction grammars are reasonable constraints for word alignment, and that the constrained space could be efficiently searched using synchronous parsing algorithms. However, spurious ambiguity may occur in synchronous parsing and cause problems in both search efficiency and accuracy. In this paper, we conduct a detailed study of the causes of spurious ambiguity and how it effects parsing and discriminative learning. We also propose a variant of the grammar which eliminates those ambiguities. Our grammar shows advantages over previous grammars in both synthetic and real-world experiments.

6 0.7801916 141 acl-2011-Gappy Phrasal Alignment By Agreement

7 0.77380264 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

8 0.73188335 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

9 0.69054151 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

10 0.67894393 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

11 0.63849282 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

12 0.63649434 44 acl-2011-An exponential translation model for target language morphology

13 0.59595907 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

14 0.58932495 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

15 0.58816499 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers

16 0.57690203 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

17 0.57667667 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

18 0.56548488 263 acl-2011-Reordering Constraint Based on Document-Level Context

19 0.55962104 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

20 0.55769295 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.017), (17, 0.413), (26, 0.015), (37, 0.087), (39, 0.042), (41, 0.067), (55, 0.03), (59, 0.026), (72, 0.03), (91, 0.033), (96, 0.158)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98519588 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers

Author: Marco Kuhlmann ; Carlos Gomez-Rodriguez ; Giorgio Satta

Abstract: We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms.

2 0.95757151 109 acl-2011-Effective Measures of Domain Similarity for Parsing

Author: Barbara Plank ; Gertjan van Noord

Abstract: It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective – it outperforms random data selection on both languages exam- ined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English. 1 Introduction and Motivation Previous research on domain adaptation has focused on the task of adapting a system trained on one domain, say newspaper text, to a particular new domain, say biomedical data. Usually, some amount of (labeled or unlabeled) data from the new domain was given which has been determined by a human. However, with the growth of the web, more and more data is becoming available, where each document “is potentially its own domain” (McClosky et al., 2010). It is not straightforward to determine – 1566 Gertjan van Noord University of Groningen The Netherlands G J M van Noord@ rug nl . . . . . which data or model (in case we have several source domain models) will perform best on a new (unknown) target domain. Therefore, an important issue that arises is how to measure domain similarity, i.e. whether we can find a simple yet effective method to determine which model or data is most beneficial for an arbitrary piece of new text. Moreover, if we had such a measure, a related question is whether it can tell us something more about what is actually meant by “domain”. So far, it was mostly arbitrarily used to refer to some kind of coherent unit (related to topic, style or genre), e.g.: newspaper text, biomedical abstracts, questions, fiction. Most previous work on domain adaptation, for instance Hara et al. (2005), McClosky et al. (2006), Blitzer et al. (2006), Daum e´ III (2007), sidestepped this problem of automatic domain selection and adaptation. For parsing, to our knowledge only one recent study has started to examine this issue (McClosky et al., 2010) we will discuss their approach in Section 2. Rather, an implicit assumption of all of these studies is that domains are given, i.e. that they are represented by the respective corpora. Thus, a corpus has been considered a homogeneous unit. As more data is becoming available, it is unlikely that – domains will be ‘given’ . Moreover, a given corpus might not always be as homogeneous as originally thought (Webber, 2009; Lippincott et al., 2010). For instance, recent work has shown that the well-known Penn Treebank (PT) Wall Street Journal (WSJ) actually contains a variety of genres, including letters, wit and short verse (Webber, 2009). In this study we take a different approach. Rather than viewing a given corpus as a monolithic entity, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s566–1576, we break it down to the article-level and disregard corpora boundaries. Given the resulting set of documents (articles), we evaluate various ways to automatically acquire related training data for a given test set, to find answers to the following questions: • Given a pool of data (a collection of articles fGriovmen nun ak pnooowln o domains) caonldle a test article, eiss there a way to automatically select data that is relevant for the new domain? If so: • Which similarity measure is good for parsing? • How does it compare to human-annotated data? • Is the measure also useful for other languages Iasnd th/oer mtaesakssu?r To this end, we evaluate measures of domain similarity and feature representations and their impact on dependency parsing accuracy. Given a collection of annotated articles, and a new article that we want to parse, we want to select the most similar articles to train the best parser for that new article. In the following, we will first compare automatic measures to human-annotated labels by examining parsing performance within subdomains of the Penn Treebank WSJ. Then, we extend the experiments to the domain adaptation scenario. Experiments were performed on two languages: English and Dutch. The empirical results show that a simple measure based on topic distributions is effective for both languages and works well also for Part-of-Speech tagging. As the approach is based on plain surfacelevel information (words) and it finds related data in a completely unsupervised fashion, it can be easily applied to other tasks or languages for which annotated (or automatically annotated) data is available. 2 Related Work The work most related to ours is McClosky et al. (2010). They try to find the best combination of source models to parse data from a new domain, which is related to Plank and Sima’an (2008). In the latter, unlabeled data was used to create several parsers by weighting trees in the WSJ according to their similarity to the subdomain. McClosky et al. (2010) coined the term multiple source domain adaptation. Inspired by work on parsing accuracy 1567 prediction (Ravi et al., 2008), they train a linear regression model to predict the best (linear interpolation) of source domain models. Similar to us, McClosky et al. (2010) regard a target domain as mixture of source domains, but they focus on phrasestructure parsing. Furthermore, our approach differs from theirs in two respects: we do not treat source corpora as one entity and try to mix models, but rather consider articles as base units and try to find subsets of related articles (the most similar articles); moreover, instead of creating a supervised model (in their case to predict parsing accuracy), our approach is ‘simplistic’ : we apply measures of domain simi- larity directly (in an unsupervised fashion), without the necessity to train a supervised model. Two other related studies are (Lippincott et al., 2010; Van Asch and Daelemans, 2010). Van Asch and Daelemans (2010) explore a measure of domain difference (Renyi divergence) between pairs of domains and its correlation to Part-of-Speech tagging accuracy. Their empirical results show a linear correlation between the measure and the performance loss. Their goal is different, but related: rather than finding related data for a new domain, they want to estimate the loss in accuracy of a PoS tagger when applied to a new domain. We will briefly discuss results obtained with the Renyi divergence in Section 5.1. Lippincott et al. (2010) examine subdomain variation in biomedicine corpora and propose awareness of NLP tools to such variation. However, they did not yet evaluate the effect on a practical task, thus our study is somewhat complementary to theirs. The issue of data selection has recently been examined for Language Modeling (Moore and Lewis, 2010). A subset of the available data is automatically selected as training data for a Language Model based on a scoring mechanism that compares cross- entropy scores. Their approach considerably outperformed random selection and two previous proposed approaches both based on perplexity scoring.1 3 Measures of Domain Similarity 3.1 Measuring Similarity Automatically Feature Representations A similarity function may be defined over any set of events that are con1We tested data selection by perplexity scoring, but found the Language Models too small to be useful in our setting. sidered to be relevant for the task at hand. For parsing, these might be words, characters, n-grams (of words or characters), Part-of-Speech (PoS) tags, bilexical dependencies, syntactic rules, etc. However, to obtain more abstract types such as PoS tags or dependency relations, one would first need to gather respective labels. The necessary tools for this are again trained on particular corpora, and will suffer from domain shifts, rendering labels noisy. Therefore, we want to gauge the effect of the simplest representation possible: plain surface characteristics (unlabeled text). This has the advantage that we do not need to rely on additional supervised tools; moreover, it is interesting to know how far we can get with this level of information only. We examine the following feature representations: relative frequencies of words, relative frequencies of character tetragrams, and topic models. Our motivation was as follows. Relative frequencies of words are a simple and effective representation used e.g. in text classification (Manning and Sch u¨tze, 1999), while character n-grams have proven successful in genre classification (Wu et al., 2010). Topic models (Blei et al., 2003; Steyvers and Griffiths, 2007) can be considered an advanced model over word distributions: every article is represented by a topic distribution, which in turn is a distribution over words. Similarity between documents can be measured by comparing topic distributions. Similarity Functions There are many possible similarity (or distance) functions. They fall broadly into two categories: probabilistically-motivated and geometrically-motivated functions. The similarity functions examined in this study will be described in the following. The Kullback-Leibler (KL) divergence D(q| |r) is a cTlahsesic Kaull measure oibfl ‘edri s(KtaLn)ce d’i2v ebregtweneceen D Dtw(oq probability distributions, and is defined as: D(q| |r) = Pyq(y)logrq((yy)). It is a non-negative, additive, aPsymmetric measure, and 0 iff the two distributions are identical. However, the KL-divergence is undefined if there exists an event y such that q(y) > 0 but r(y) = 0, which is a property that “makes it unsuitable for distributions derived via maximumlikelihood estimates” (Lee, 2001). 2It is not a proper distance metric since it is asymmetric. 1568 One option to overcome this limitation is to apply smoothing techniques to gather non-zero estimates for all y. The alternative, examined in this paper, is to consider an approximation to the KL divergence, such as the Jensen-Shannon (JS) divergence (Lin, 1991) and the skew divergence (Lee, 2001). The Jensen-Shannon divergence, which is symmetric, computes the KL-divergence between q, r, and the average between the two. We use the JS divergence as defined in Lee (2001): JS(q, r) = [D(q| |avg(q, r)) + D(r| |avg(q, r))] . The asymm[eDtr(icq |s|akvewg( divergence sα, proposed by Lee (2001), mixes one distribution with the other by a degree de- 21 fined by α ∈ [0, 1) : sα (q, r, α) = D(q| |αr + (1 α)q). Ays α α approaches 1, rt,hαe )sk =ew D divergence approximates the KL-divergence. An alternative way to measure similarity is to consider the distributions as vectors and apply geometrically-motivated distance functions. This family of similarity functions includes the cosine cos(q, r) = qq(y) · r(y)/ | |q(y) | | | |r(y) | |, euclidean − euc(q,r) = qPy(q(y) − r(y))2 and variational (also known asq LP1 or MPanhattan) distance function, defined as var(q, r) = Py |q(y) − r(y) |. 3.2 Human-annotatePd data In contrast to the automatic measures devised in the previous section, we might have access to human annotated data. That is, use label information such as topic or genre to define the set of similar articles. Genre For the Penn Treebank (PT) Wall Street Journal (WSJ) section, more specifically, the subset available in the Penn Discourse Treebank, there exists a partition of the data by genre (Webber, 2009). Every article is assigned one of the following genre labels: news, letters, highlights, essays, errata, wit and short verse, quarterly progress reports, notable and quotable. This classification has been made on the basis of meta-data (Webber, 2009). It is wellknown that there is no meta-data directly associated with the individual WSJ files in the Penn Treebank. However, meta-data can be obtained by looking at the articles in the ACL/DCI corpus (LDC99T42), and a mapping file that aligns document numbers of DCI (DOCNO) to WSJ keys (Webber, 2009). An example document is given in Figure 1. The metadata field HL contains headlines, SO source info, and the IN field includes topic markers.

3 0.94913578 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

Author: Gunter Neumann ; Sven Schmeier

Abstract: We present a mobile touchable application for online topic graph extraction and exploration of web content. The system has been implemented for operation on an iPad. The topic graph is constructed from N web snippets which are determined by a standard search engine. We consider the extraction of a topic graph as a specific empirical collocation extraction task where collocations are extracted between chunks. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. An initial user evaluation shows that this system is especially helpful for finding new interesting information on topics about which the user has only a vague idea or even no idea at all.

4 0.92734963 118 acl-2011-Entrainment in Speech Preceding Backchannels.

Author: Rivka Levitan ; Agustin Gravano ; Julia Hirschberg

Abstract: In conversation, when speech is followed by a backchannel, evidence of continued engagement by one’s dialogue partner, that speech displays a combination of cues that appear to signal to one’s interlocutor that a backchannel is appropriate. We term these cues backchannel-preceding cues (BPC)s, and examine the Columbia Games Corpus for evidence of entrainment on such cues. Entrainment, the phenomenon of dialogue partners becoming more similar to each other, is widely believed to be crucial to conversation quality and success. Our results show that speaking partners entrain on BPCs; that is, they tend to use similar sets of BPCs; this similarity increases over the course of a dialogue; and this similarity is associated with measures of dialogue coordination and task success. 1

5 0.92541897 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation

Author: Ashish Vaswani ; Haitao Mi ; Liang Huang ; David Chiang

Abstract: Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. Here, we take the opposite approach, where we only use minimal rules (those that cannot be formed out of other rules), and instead rely on a rule Markov model of the derivation history to capture dependencies between minimal rules. Large-scale experiments on a state-of-the-art tree-to-string translation system show that our approach leads to a slimmer model, a faster decoder, yet the same translation quality (measured using B ) as composed rules.

same-paper 6 0.89892566 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

7 0.75758278 180 acl-2011-Issues Concerning Decoding with Synchronous Context-free Grammar

8 0.74860632 30 acl-2011-Adjoining Tree-to-String Translation

9 0.71679145 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

10 0.69209003 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

11 0.69120049 141 acl-2011-Gappy Phrasal Alignment By Agreement

12 0.67514789 296 acl-2011-Terminal-Aware Synchronous Binarization

13 0.67502642 154 acl-2011-How to train your multi bottom-up tree transducer

14 0.67470288 61 acl-2011-Binarized Forest to String Translation

15 0.65460145 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

16 0.6542908 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

17 0.6524303 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

18 0.65067041 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

19 0.64817059 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

20 0.64481372 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results