acl acl2013 acl2013-46 knowledge-graph by maker-knowledge-mining

46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation


Source: pdf

Author: Trevor Cohn ; Gholamreza Haffari

Abstract: Modern phrase-based machine translation systems make extensive use of wordbased translation models for inducing alignments from parallel corpora. This is problematic, as the systems are incapable of accurately modelling many translation phenomena that do not decompose into word-for-word translation. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. Overall this leads to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Modern phrase-based machine translation systems make extensive use of wordbased translation models for inducing alignments from parallel corpora. [sent-4, score-0.457]

2 This is problematic, as the systems are incapable of accurately modelling many translation phenomena that do not decompose into word-for-word translation. [sent-5, score-0.177]

3 This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. [sent-6, score-0.58]

4 Overall this leads to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. [sent-7, score-0.227]

5 , 2003) to machine translation (MT) has transformed MT from a narrow research topic into a truly useful technology to end users. [sent-10, score-0.132]

6 , 2006) all use some kind of multi-word translation unit, which allows translations to be produced from large canned units of text from the training corpus. [sent-13, score-0.269]

7 , 1993) remain central to phrase-based model training, where they are used to infer word-level alignments from sentence aligned parallel data, from Gholamreza Haffari Faculty of Information Technology Monash University Clayton, Australia re z a@monash . [sent-16, score-0.15]

8 edu which phrasal translation units are extracted using a heuristic. [sent-17, score-0.264]

9 This paper develops a phrase-based translation model which aims to address the above short- comings of the phrase-based translation pipeline. [sent-23, score-0.264]

10 Specifically, we formulate translation using inverse transduction grammar (ITG), and seek to learn an ITG from parallel corpora. [sent-24, score-0.418]

11 The novelty of our approach is that we develop a Bayesian prior over the grammar, such that a nonterminal becomes a ‘cache’ learning each production and its complete yield, which in turn is recursively composed of its child constituents. [sent-25, score-0.232]

12 This is closely related to adaptor grammars (Johnson et al. [sent-26, score-0.185]

13 Our model learns translations of entire sentences while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. [sent-28, score-0.227]

14 The model is richly parameterised, such that it can describe phrase-based phenomena while also explicitly modelling the relationships between phrasepairs and their component expansions, thus ameliorating the disconnect between the treatment of words versus phrases in the current MT pipeline. [sent-29, score-0.133]

15 We develop a Bayesian approach using a PitmanYor process prior, which is capable of modelling a diverse range of geometrically decaying distributions over infinite event spaces (here translation phrase-pairs), an approach shown to be state of the art for language modelling (Teh, 2006). [sent-30, score-0.267]

16 ’s work was flawed in a number of respects, most notably in terms of their heuristic beam sampling algorithm which does not meet either of the Markov Chain Monte Carlo criteria of ergodicity or detailed balance. [sent-36, score-0.162]

17 Moreover our approach results in consistent translation improvements across a number of translation tasks compared to Neubig et al. [sent-39, score-0.264]

18 2 Related Work Inversion transduction grammar (or ITG) (Wu, 1997) is a well studied synchronous grammar formalism. [sent-41, score-0.418]

19 Terminal productions of the form X → e/f generate a waol rpdr idnu tcwtioo languages, ramnd X nonterminal productions allow phrasal movement in the translation process. [sent-42, score-0.433]

20 Our paper fits into the recent line of work forjointly inducing the phrase table and word alignment (DeNero and Klein, 2010; Neubig et al. [sent-58, score-0.135]

21 Another strand of related research is in estimating a broader class of synchronous grammars than ITGs, such as SCFGs (Blunsom et al. [sent-65, score-0.142]

22 This work was inspired by adaptor grammars (Johnson et al. [sent-69, score-0.185]

23 , 2007a), a monolingual grammar formalism whereby a non-terminal rewrites in a single step as a complete subtree. [sent-70, score-0.206]

24 The model prior allows for trees to be generated as a mixture of a cache and a base adaptor grammar. [sent-71, score-0.223]

25 Additionally, we have extended the model to allow recursive nesting of adapted non-terminals, such that we end up with an infinitely recursive formulation where the top-level and base distributions are explicitly linked together. [sent-73, score-0.146]

26 As mentioned above, ours is not the first work attempting to generalise adaptor grammars for machine translation; (Neubig et al. [sent-74, score-0.185]

27 Our approach improves upon theirs in terms of the model and inference, and critically, this is borne out in our experiments where we show uniform improvements in translation quality over a baseline system, as compared to their almost entirely negative results. [sent-76, score-0.132]

28 We believe that their approach had a number of flaws: For inference they use a beam-search, which may speed up processing but means that they are no longer sampling from the true distribution, nor a distribution with the same support as the posterior. [sent-77, score-0.279]

29 Finally our approach models separately the three different types of ITG production (mono- tone, swap and lexical emission), allowing for a richer parameterisation which the model exploits by learning different hyper-parameter values. [sent-84, score-0.335]

30 3 Model The generative process of the model follows that of ITG with the following simple grammar X → [X X] | hX Xi X → e/f | e/⊥ | ⊥/f , where [·] denotes monotone ordering and h·i denotes a swap ionte one language. [sent-85, score-0.57]

31 Tghuaisg corresponds otol a si dme-ple generative story, with each stage being a nonterminal rewrite starting with X and terminating when there are no frontier non-terminals. [sent-87, score-0.14]

32 A popular variant is a phrasal ITG, where the leaves of the ITG tree are phrase-pairs and the training seeks to learn a segmentation ofthe source and target which yields good phrases. [sent-88, score-0.172]

33 Our approach improves over the phrasal model by recursively generating complete phrases. [sent-90, score-0.219]

34 f{omro r = mono (a) draw the complete subtree expansion, t = X → [. [sent-95, score-0.325]

35 for r = swap (a) draw the complete subtree expansion, t = X → h. [sent-99, score-0.46]

36 for r = emit (a) draw a pair of strings, (e, f) ∼ E (b) sderta wt = pXai → e/f Note t(hba)t we split Xth →e problem of drawing a tree into two steps: first choosing the top-level rule type and then drawing a rule of that type. [sent-104, score-0.544]

37 This gives us greater control than simply drawing a tree of any type from one distribution, due to our parameterisation of the priors over the model parameters TM, TS and E. [sent-105, score-0.235]

38 3 We use Pitman-Yor Process priors for the TM and TS parameters TM ∼ PYP(aM, bM, P1(·|r = mono)) TS ∼ PYP(aS, bS, P1(· |r = swap)) where P1(t1, t2 |r) is a distribution over a pair of trees (the left an|dr) right dcihsiltdrirbeunt oofn a mveorn aot poanire or swap production). [sent-110, score-0.42]

39 set t = X → [t1 t2] or t = X → ht1 t2i depending on r This generative process is mutually recursive: P2 makes draws from P1 and P1 makes draws from P2. [sent-114, score-0.146]

40 The recursion is terminated when the rule type r = emit is drawn. [sent-115, score-0.194]

41 3We also experimented with using word translation probabilities from IBM model 1, based on the prior used by Levenberg et al. [sent-118, score-0.186]

42 In our case we can consider the process of drawing a tree from P2 as a customer entering a restaurant and choosing where to sit, from an infinite set of tables. [sent-123, score-0.365]

43 The seating decision is based on the number of other customers at each table, such that popular tables are more likely to be joined than unpopular or empty ones. [sent-124, score-0.158]

44 If the customer chooses an occupied table, the identity of the tree is then set to be the same as for the other customers also seated there. [sent-125, score-0.246]

45 For empty tables the tree must be sampled from the base distribution P1. [sent-126, score-0.179]

46 In the standard CRF analogy, this leads to another customer entering the restaurant one step up in the hierarchy, and this process can be chained many times. [sent-127, score-0.171]

47 In our case, however, every new table leads to new customers reentering the original restaurant these correspond to the left and right child trees of a monotone or swap rule. [sent-128, score-0.631]

48 , a draw from P2) under the model is P2(t) = P(r)P2(t|r) (1) where r is the rule type, one of mono, swap or emit. [sent-133, score-0.426]

49 The distribution over types, P(r), is defined as P(r) =nnrTT,,−−++ b bTT31 where nT,− are the counts over rules of types. [sent-134, score-0.153]

50 Fo(rt r = mono or r = swap rules, it is defined as P2(t|r) =nt−,nr−r−+ K bt−,rrar+Knr−r−a+r+ br brP1(t1,t2|r), (2) where nt−,r is the count for tree t in the other training sentences, Kt−,r is the table count for t and nr− 4The conditioning omitted for clarity. [sent-136, score-0.527]

51 erivation we still need to define P1, which is formulated as P1(t1, t2) = P2(t1)P2(t2|t1) , where the conditioning ofthe second recursive call to P2 reflects that the counts n− and K−may be affected by the first draw from P2. [sent-140, score-0.178]

52 5 We construct an approximating ITG following the technique used for sampling trees from monolingual tree-substitution grammars (Cohn et al. [sent-146, score-0.331]

53 , during inside inference – – we recover P2 as shown in (2). [sent-151, score-0.148]

54 The full grammar transform for inside inference is shown in Table 1. [sent-152, score-0.254]

55 The sampling algorithm closely follows the process for sampling derivations from Bayesian PCFGs (Johnson et al. [sent-153, score-0.324]

56 This involves first constructing the inside lattice using the productions in Table 1, and then performing a top-down sampling pass. [sent-156, score-0.319]

57 The function sig(t) returns a unique identifier for the complete tree t, and the function yield(t) returns the pair of terminal strings from the yield of t. [sent-160, score-0.229]

58 6 Accepted samples then replace the old tree (otherwise the old tree is retained) and the model counts are incremented. [sent-162, score-0.21]

59 Tlihseh corpora )s,ta atisntdics A orafb tihc→eseE translation tasks are summarised in Table 2. [sent-165, score-0.132]

60 The UR-EN corpus comes from NIST 2009 translation evaluation. [sent-166, score-0.132]

61 9 Sampler configuration Samplers are initialised with trees created from GIZA++ alignments constructed using a SCFG factorisation method (Blunsom et al. [sent-186, score-0.208]

62 This algorithm represents the translation of a sentence as a large SCFG rule, which it then factorises into lower rank SCFG rules, a process akin to rule binarisation commonly used in SCFG decoding. [sent-188, score-0.203]

63 After each full sampling iteration, we resample all the hyper-parameters using slice-sampling, with the following priors: a ∼ Beta(1, 1), b ∼ Gamma(10, 0. [sent-191, score-0.162]

64 Figure 1 sh∼ows the posterbio ∼r probability improves gwuriteh 1ea schho wfusll t sampling iterations. [sent-193, score-0.162]

65 The sampling was repeated for 5 independent runs, and we present results where we combine the outputs of these runs. [sent-196, score-0.162]

66 The time complexity of our inference algorithm is O(n6), which can be prohibitive for large scale machine translation tasks. [sent-198, score-0.208]

67 plexity consider by constraining We reduce the com- the inside inference only derivations to which are compatible 9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature. [sent-199, score-0.148]

68 10Using the factorised alignments directly in a translation system resulted in a slight loss in BLEU versus using the unfactorised alignments. [sent-200, score-0.288]

69 784 0 100 200 300 400 500 iteration Figure 1: Training progress on the UR-EN corpus, showing the posterior probability improving with each full sampling iteration. [sent-202, score-0.228]

70 average sentence length Figure 2: The runtime cost of bottom-up inside inference and top-down sampling as a function of sentence length (UR-EN), with time shown on a logarithmic scale. [sent-204, score-0.31]

71 Full ITG inference is shown × with red circles, and restricted inference using the intersection constraints with blue triangles. [sent-205, score-0.152]

72 11 Figure 2 shows the sampling time with respect to the average sentence length, showing that our alignment-constrained sampling algorithm is better than the unconstrained algorithm with empirical complexity of n4. [sent-208, score-0.324]

73 Presumably other means of inference may be more efficient, such as Gibbs sampling (Levenberg et al. [sent-210, score-0.238]

74 , 2012) or auxiliary variable sampling (Blunsom and Cohn, 2010); we leave these extensions to future work. [sent-211, score-0.162]

75 , 2011), we evaluate our model by using its output word alignments to construct a phrase table. [sent-215, score-0.142]

76 This alignment is used as input to the rule factorisation algorithm, producing the ITG trees with which we initialise our sampler. [sent-218, score-0.229]

77 14 In the end-to-end MT pipeline we use a standard set of features: relative-frequency and lexical translation model probabilities in both directions; distance-based distortion model; language model and word count. [sent-221, score-0.132]

78 The baselines are GIZA++ alignments and those generated by the pialign (Neubig et al. [sent-242, score-0.22]

79 rule frequency Figure 3: Fraction of rules with a given frequency, using a single sample grammar (UR-EN). [sent-244, score-0.251]

80 1 Results Table 3 shows the BLEU scores for the three translation tasks UR/AR/FA→EN based on our method against tshkes UbaRse/AliRne/Fs. [sent-246, score-0.132]

81 We believe this type of Monte Carlo model averaging should be considered in general when sampling techniques are employed for grammatical inference, e. [sent-250, score-0.162]

82 Figure 3 shows the fraction of rules with a given frequency for each of the three rule types. [sent-258, score-0.145]

83 As expected, there is a higher tendency to reuse high-frequency emissions (or single-word translation) compared to other rule types, which are the basic building blocks to compose larger rules (or phrases). [sent-260, score-0.145]

84 Table 4 lists the high frequency monotone and swap rules in the learned grammar. [sent-261, score-0.538]

85 We observe the high frequency swap rules capture reordering in verb clusters, preposition-noun inversions and adjective-noun reordering. [sent-262, score-0.362]

86 Similar patterns are seen in the monotone rules, along with some common canned phrases. [sent-263, score-0.223]

87 Note that “in Iraq” appears twice, once as an inversion in UR-EN and another time in monotone order for AR-EN. [sent-264, score-0.272]

88 There is very little spread in the inferred values, suggesting the sampling chains may have converged. [sent-267, score-0.162]

89 Furthermore, there is a large difference between the learned hyper-parameters for the monotone rules versus the swap rules. [sent-268, score-0.538]

90 Table 5: phrase pairs in the top-100 high frequency phrase pairs specific our method vs that of pialign for FA-EN and AR-EN translation tasks. [sent-2508, score-0.33]

91 If the number of observed monotone and swap rules were equal, then there would be a higher chance in reusing the monotone rules. [sent-2511, score-0.714]

92 However, the number of observed monotone and swap rules are not equal, as plotted in Figure 4. [sent-2512, score-0.538]

93 Conclusions We have presented a novel method for learning a phrase-based model of translation directly from parallel data which we have framed as learn- ing an inverse transduction grammar (ITG) using a recursive Bayesian prior. [sent-2519, score-0.491]

94 This has led to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. [sent-2520, score-0.227]

95 We have presented a Metropolis-Hastings sampling algorithm for blocked inference in our non-parametric ITG. [sent-2521, score-0.238]

96 05 650 506 0 am and as bm and bs be bt (a) 0. [sent-2534, score-0.153]

97 8 40 1760 monotone swap (b) Figure 4: (a) Posterior over the hyper-parameters, aM, aS, bM, bS, bE, bT, measured for UR-EN using samples 400–500 for 3 independent sampling chains, and the intersection constraints. [sent-2539, score-0.626]

98 (b) Posterior over the number of monotone and swap rules in the resultant grammars. [sent-2540, score-0.538]

99 The distribution for emission rules was also peaked about 147k rules. [sent-2541, score-0.164]

100 Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. [sent-3796, score-0.344]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('itg', 0.401), ('swap', 0.288), ('neubig', 0.186), ('monotone', 0.176), ('sampling', 0.162), ('mono', 0.153), ('translation', 0.132), ('transduction', 0.132), ('emit', 0.123), ('levenberg', 0.118), ('pialign', 0.118), ('adaptor', 0.117), ('blunsom', 0.115), ('grammar', 0.106), ('bayesian', 0.105), ('alignments', 0.102), ('inversion', 0.096), ('terminating', 0.095), ('tree', 0.086), ('phrasal', 0.086), ('productions', 0.085), ('ts', 0.083), ('scfg', 0.08), ('sig', 0.078), ('inference', 0.076), ('rules', 0.074), ('cohn', 0.074), ('synchronous', 0.074), ('phil', 0.073), ('draws', 0.073), ('recursive', 0.073), ('inside', 0.072), ('recursively', 0.072), ('tm', 0.071), ('rule', 0.071), ('giza', 0.07), ('monte', 0.069), ('grammars', 0.068), ('draw', 0.067), ('posterior', 0.066), ('monash', 0.066), ('scfgs', 0.066), ('drawing', 0.063), ('sampler', 0.063), ('restaurant', 0.063), ('complete', 0.061), ('denero', 0.059), ('cherry', 0.059), ('trevor', 0.058), ('bm', 0.056), ('johnson', 0.055), ('prior', 0.054), ('customer', 0.054), ('nt', 0.054), ('entering', 0.054), ('factorisation', 0.054), ('factorised', 0.054), ('pilevar', 0.054), ('seated', 0.054), ('seating', 0.054), ('customers', 0.052), ('tables', 0.052), ('alignment', 0.052), ('trees', 0.052), ('bs', 0.049), ('emission', 0.049), ('approximating', 0.049), ('bt', 0.048), ('parallel', 0.048), ('carlo', 0.048), ('parameterisation', 0.047), ('canned', 0.047), ('phrasepairs', 0.047), ('pyp', 0.047), ('tehran', 0.047), ('units', 0.046), ('modelling', 0.045), ('infinite', 0.045), ('nonterminal', 0.045), ('translations', 0.044), ('terminal', 0.044), ('subtree', 0.044), ('iid', 0.044), ('inducing', 0.043), ('decomposition', 0.042), ('marcu', 0.042), ('farsi', 0.041), ('pauls', 0.041), ('distribution', 0.041), ('bleu', 0.041), ('phrases', 0.041), ('phrase', 0.04), ('priors', 0.039), ('rewrites', 0.039), ('yield', 0.038), ('sharon', 0.038), ('counts', 0.038), ('koehn', 0.038), ('substructures', 0.037), ('urdu', 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

Author: Trevor Cohn ; Gholamreza Haffari

Abstract: Modern phrase-based machine translation systems make extensive use of wordbased translation models for inducing alignments from parallel corpora. This is problematic, as the systems are incapable of accurately modelling many translation phenomena that do not decompose into word-for-word translation. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. Overall this leads to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems.

2 0.34259322 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

Author: Conghui Zhu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao

Abstract: Typical statistical machine translation systems are batch trained with a given training data and their performances are largely influenced by the amount of data. With the growth of the available data across different domains, it is computationally demanding to perform batch training every time when new data comes. In face of the problem, we propose an efficient phrase table combination method. In particular, we train a Bayesian phrasal inversion transduction grammars for each domain separately. The learned phrase tables are hierarchically combined as if they are drawn from a hierarchical Pitman-Yor process. The performance measured by BLEU is at least as comparable to the traditional batch training method. Furthermore, each phrase table is trained separately in each domain, and while computational overhead is significantly reduced by training them in parallel.

3 0.1947455 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

Author: Yang Feng ; Trevor Cohn

Abstract: Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phraseinternal translation and reordering. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs. We propose a new model to address this imbalance, based on a word-based Markov model of translation which generates target translations left-to-right. Our model encodes word and phrase level phenomena by conditioning translation decisions on previous decisions and uses a hierarchical Pitman-Yor Process prior to provide dynamic adaptive smoothing. This mechanism implicitly supports not only traditional phrase pairs, but also gapping phrases which are non-consecutive in the source. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU.

4 0.18082234 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).

5 0.16586111 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

Author: Graham Neubig

Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, and find that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http : / /phont ron . com/t ravat ar

6 0.15433933 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

7 0.14743499 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

8 0.13907175 136 acl-2013-Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text

9 0.13130337 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

10 0.12972029 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

11 0.12650184 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

12 0.12346756 314 acl-2013-Semantic Roles for String to Tree Machine Translation

13 0.120777 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

14 0.11632863 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

15 0.11410104 312 acl-2013-Semantic Parsing as Machine Translation

16 0.10904408 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

17 0.10738189 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network

18 0.10631884 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

19 0.10526869 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

20 0.10266726 200 acl-2013-Integrating Phrase-based Reordering Features into a Chart-based Decoder for Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.28), (1, -0.176), (2, 0.144), (3, 0.093), (4, -0.07), (5, 0.042), (6, 0.05), (7, 0.006), (8, -0.051), (9, -0.016), (10, 0.029), (11, -0.036), (12, 0.045), (13, -0.052), (14, -0.026), (15, -0.112), (16, 0.083), (17, 0.075), (18, 0.012), (19, -0.039), (20, -0.01), (21, -0.015), (22, 0.055), (23, 0.071), (24, -0.032), (25, -0.061), (26, -0.033), (27, 0.007), (28, 0.075), (29, 0.039), (30, 0.071), (31, 0.024), (32, -0.008), (33, -0.033), (34, 0.023), (35, 0.09), (36, 0.039), (37, -0.036), (38, -0.036), (39, 0.016), (40, 0.001), (41, 0.0), (42, -0.059), (43, 0.013), (44, 0.068), (45, -0.011), (46, -0.1), (47, 0.06), (48, -0.034), (49, 0.13)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9441551 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

Author: Trevor Cohn ; Gholamreza Haffari

Abstract: Modern phrase-based machine translation systems make extensive use of wordbased translation models for inducing alignments from parallel corpora. This is problematic, as the systems are incapable of accurately modelling many translation phenomena that do not decompose into word-for-word translation. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. Overall this leads to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems.

2 0.76117611 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

Author: Shay B. Cohen ; Mark Johnson

Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.

3 0.73858041 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

Author: Wenduan Xu ; Yue Zhang ; Philip Williams ; Philipp Koehn

Abstract: We present a context-sensitive chart pruning method for CKY-style MT decoding. Source phrases that are unlikely to have aligned target constituents are identified using sequence labellers learned from the parallel corpus, and speed-up is obtained by pruning corresponding chart cells. The proposed method is easy to implement, orthogonal to cube pruning and additive to its pruning power. On a full-scale Englishto-German experiment with a string-totree model, we obtain a speed-up of more than 60% over a strong baseline, with no loss in BLEU.

4 0.72890657 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

Author: Conghui Zhu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao

Abstract: Typical statistical machine translation systems are batch trained with a given training data and their performances are largely influenced by the amount of data. With the growth of the available data across different domains, it is computationally demanding to perform batch training every time when new data comes. In face of the problem, we propose an efficient phrase table combination method. In particular, we train a Bayesian phrasal inversion transduction grammars for each domain separately. The learned phrase tables are hierarchically combined as if they are drawn from a hierarchical Pitman-Yor process. The performance measured by BLEU is at least as comparable to the traditional batch training method. Furthermore, each phrase table is trained separately in each domain, and while computational overhead is significantly reduced by training them in parallel.

5 0.72152972 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

Author: Yang Feng ; Trevor Cohn

Abstract: Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phraseinternal translation and reordering. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs. We propose a new model to address this imbalance, based on a word-based Markov model of translation which generates target translations left-to-right. Our model encodes word and phrase level phenomena by conditioning translation decisions on previous decisions and uses a hierarchical Pitman-Yor Process prior to provide dynamic adaptive smoothing. This mechanism implicitly supports not only traditional phrase pairs, but also gapping phrases which are non-consecutive in the source. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU.

6 0.71610695 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

7 0.71249151 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

8 0.71247989 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

9 0.69699669 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment

10 0.66551131 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

11 0.65538949 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

12 0.65499675 312 acl-2013-Semantic Parsing as Machine Translation

13 0.65400422 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

14 0.63925844 328 acl-2013-Stacking for Statistical Machine Translation

15 0.63851982 165 acl-2013-General binarization for parsing and translation

16 0.62993896 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

17 0.62932861 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?

18 0.61400473 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

19 0.61236608 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

20 0.60093576 330 acl-2013-Stem Translation with Affix-Based Rule Selection for Agglutinative Languages


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.065), (6, 0.04), (9, 0.153), (11, 0.06), (24, 0.039), (26, 0.059), (28, 0.012), (35, 0.118), (42, 0.062), (48, 0.049), (70, 0.079), (77, 0.033), (88, 0.031), (90, 0.056), (95, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8910768 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization

Author: Lu Wang ; Claire Cardie

Abstract: We address the challenge of generating natural language abstractive summaries for spoken meetings in a domain-independent fashion. We apply Multiple-Sequence Alignment to induce abstract generation templates that can be used for different domains. An Overgenerateand-Rank strategy is utilized to produce and rank candidate abstracts. Experiments using in-domain and out-of-domain training on disparate corpora show that our system uniformly outperforms state-of-the-art supervised extract-based approaches. In addition, human judges rate our system summaries significantly higher than compared systems in fluency and overall quality.

same-paper 2 0.8712765 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

Author: Trevor Cohn ; Gholamreza Haffari

Abstract: Modern phrase-based machine translation systems make extensive use of wordbased translation models for inducing alignments from parallel corpora. This is problematic, as the systems are incapable of accurately modelling many translation phenomena that do not decompose into word-for-word translation. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. Overall this leads to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems.

3 0.78100991 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

Author: Wen-tau Yih ; Ming-Wei Chang ; Christopher Meek ; Andrzej Pastusiak

Abstract: In this paper, we study the answer sentence selection problem for question answering. Unlike previous work, which primarily leverages syntactic analysis through dependency tree matching, we focus on improving the performance using models of lexical semantic resources. Experiments show that our systems can be consistently and significantly improved with rich lexical semantic information, regardless of the choice of learning algorithms. When evaluated on a benchmark dataset, the MAP and MRR scores are increased by 8 to 10 points, compared to one of our baseline systems using only surface-form matching. Moreover, our best system also outperforms pervious work that makes use of the dependency tree structure by a wide margin.

4 0.77329975 275 acl-2013-Parsing with Compositional Vector Grammars

Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.

Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.

5 0.76988763 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

Author: Wei Xu ; Raphael Hoffmann ; Le Zhao ; Ralph Grishman

Abstract: Distant supervision has attracted recent interest for training information extraction systems because it does not require any human annotation but rather employs existing knowledge bases to heuristically label a training corpus. However, previous work has failed to address the problem of false negative training examples mislabeled due to the incompleteness of knowledge bases. To tackle this problem, we propose a simple yet novel framework that combines a passage retrieval model using coarse features into a state-of-the-art relation extractor using multi-instance learning with fine features. We adapt the information retrieval technique of pseudo- relevance feedback to expand knowledge bases, assuming entity pairs in top-ranked passages are more likely to express a relation. Our proposed technique significantly improves the quality of distantly supervised relation extraction, boosting recall from 47.7% to 61.2% with a consistently high level of precision of around 93% in the experiments.

6 0.76946241 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus

7 0.76875114 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

8 0.76805723 172 acl-2013-Graph-based Local Coherence Modeling

9 0.76782316 4 acl-2013-A Context Free TAG Variant

10 0.76666534 250 acl-2013-Models of Translation Competitions

11 0.76626849 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

12 0.76387036 249 acl-2013-Models of Semantic Representation with Visual Attributes

13 0.76351535 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions

14 0.76350105 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

15 0.76294714 224 acl-2013-Learning to Extract International Relations from Political Context

16 0.76095492 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

17 0.76049381 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval

18 0.76049286 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

19 0.75906074 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

20 0.75902849 175 acl-2013-Grounded Language Learning from Video Described with Sentences