acl acl2013 acl2013-307 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sujith Ravi
Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).
Reference: text
sentIndex sentText sentNum sentScore
1 s ravi @ gooogle com Abstract In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. [sent-2, score-0.486]
2 Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. [sent-3, score-0.949]
3 In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. [sent-4, score-0.638]
4 The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. [sent-6, score-0.703]
5 The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc. [sent-10, score-0.578]
6 Learning translation models from monolingual corpora could help ad- dress the challenges faced by modern-day MT systems, especially for low resource language pairs. [sent-13, score-0.405]
7 Recently, this topic has been receiving increasing attention from researchers and new methods have been proposed to train statistical machine translation models using only monolingual data in the source and target language. [sent-14, score-0.558]
8 However, none of these methods attempt to train end-to-end MT models, instead they focus on mining bilingual lexicons from monolingual corpora and often they require parallel seed lexicons as a starting point. [sent-18, score-0.306]
9 Unsupervised training methods have also been proposed in the past for related problems in decipherment (Knight and Yamada, 1999; Snyder et al. [sent-24, score-0.382]
10 The body of work that is more closely related to ours include that of Ravi and Knight (201 1b) who introduced a decipherment approach for training translation models using only monolingual cor362 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-26, score-0.74]
11 (2012) extend the former approach and improve training efficiency by pruning translation candidates prior to EM training with the help of context similarities computed from monolingual corpora. [sent-31, score-0.573]
12 In this work we propose a new Bayesian inference method for estimating translation mod- els from scratch using only monolingual corpora. [sent-32, score-0.443]
13 Secondly, we introduce a new feature-based representation for sampling translation candidates that allows one to incorporate any amount of additional features (beyond simple bag-of-words) as sideinformation during decipherment training. [sent-33, score-0.885]
14 Finally, we also derive a new accelerated sampling mechanism using locality sensitive hashing inspired by recent work on fast, probabilistic inference for unsupervised clustering (Ahmed et al. [sent-34, score-0.589]
15 The new sampler allows us to perform fast, efficient inference with more complex translation models (than previously used) and scale better to large vocabulary and corpora sizes compared to existing methods as evidenced by our experimental results on two different corpora. [sent-36, score-0.61]
16 2 Decipherment Model for Machine Translation We now describe the decipherment problem formulation for machine translation. [sent-37, score-0.35]
17 fm) and a monolingual target language corpus, our goal is to decipher the source text and produce a target translation. [sent-43, score-0.444]
18 Contrary to standard machine translation training scenarios, here we have to estimate the translation model Pθ (f|e) parameters using only monolingual data. [sent-44, score-0.652]
19 During decipherment training, our objective is to estimate the model parameters in order to maximize the probability of the source text f as suggested by Ravi and Knight (201 1b). [sent-45, score-0.492]
20 Translation Model: Machine translation is a much more complex task than solving other decipherment tasks such as word substitution ciphers (Ravi and Knight, 2011b; Dou and Knight, 2012). [sent-48, score-0.64]
21 But training becomes intractable with complex translation models and scalability is also an issue when large corpora sizes are involved and the translation tables become huge to fit in memory. [sent-54, score-0.599]
22 For each target word token ei (including NULLs), choose a source word translation fi, with probability Pθ (fi |ei). [sent-66, score-0.704]
23 Instead, we propose a new Bayesian inference framework to estimate the translation model parameters. [sent-78, score-0.347]
24 In spite of using Bayesian inference which is typically slow in practice (with standard Gibbs sampling), we show later that our method is scalable and permits decipherment training using more complex trans- lation models (with several additional parameters). [sent-79, score-0.54]
25 3 Modeling Phrases: Finally, we extend the translation candidate set in Pθ (fi |ei) to model phrases in addition to words for the target side (i. [sent-98, score-0.32]
26 , ei can now be a word or a phrase4 previously seen in the monolingual target corpus). [sent-100, score-0.451]
27 This greatly increases the training time since in each sampling step, we now have many more ei candidates to choose from. [sent-101, score-0.528]
28 In Section 4, we describe how we deal 1Each component in the translation model (word/phrase translations Pθ (fi |ei), fertility Pθfert , etc. [sent-102, score-0.442]
29 For short sentences, a sparse prior on fertility αfert typically discourages a target word from being aligned to too many different source words. [sent-109, score-0.4]
30 with this problem by using a fast, efficient sampler based on hashing that allows us to speed up the Bayesian inference significantly whereas standard Gibbs sampling would be extremely slow. [sent-111, score-0.58]
31 As the source and target vocabulary sizes increase the size of the translation table (|Vf | · |Ve |) increases significantly aannsdl aotiftoenn ab eblceom (|eVs |to ·o | huge tcor efaits eins memory. [sent-113, score-0.501]
32 Additionally, performing Bayesian inference with such a complex model using stan- dard Gibbs sampling can be very slow in practice. [sent-114, score-0.402]
33 Here, we describe a new method for doing Bayesian inference by first introducing a featurebased representation for the source and target words (or phrases) from which we then derive a novel proposal distribution for sampling translation candidates. [sent-115, score-0.903]
34 Source Language: Words appearing in a source sentence f are represented using the corresponding target translation e = e1. [sent-129, score-0.417]
35 h eW ceo rthreesnp eoxndtriancgt all the context features of ej in the target translation sample sentence e and add these features (f−context, f+context, fscontext) with weights to the feature representation for fj. [sent-134, score-0.421]
36 Unlike the target word feature vectors (which can be pre-computed from the monolingual target corpus), the feature vector for every source word fj is dynamically constructed from the target translation sampled in each training iteration. [sent-135, score-0.934]
37 In the next section, we will describe how we mitigate this problem by projecting into a low-dimensional space by computing hash signatures. [sent-140, score-0.309]
38 We note that the new sampling framework is easily extensible to many additional feature types (for example, monolingual topic model features, etc. [sent-142, score-0.385]
39 ) which can be efficiently handled by our inference algorithm and could further improve translation performance but we leave this for future work. [sent-143, score-0.302]
40 4 Bayesian MT Decipherment via Hash Sampling The next step is to use the feature representations described earlier and iteratively sample a target word (or phrase) translation candidate ei for every word fi in the source text f. [sent-144, score-1.004]
41 One possible strategy is to compute similarity scores s(wfi , we0) between the current source word feature vector wfi and feature vectors we0∈Ve for all possible candidates in the target vocabulary. [sent-146, score-0.39]
42 Following this, we can prune the translation candidate set by keeping only the top candidates e∗ according to the similarity scores. [sent-147, score-0.291]
43 Secondly, for Bayesian inference we need to sample from a distribution that involves computing probabilities for all the components (language model, translation model, fertility, etc. [sent-157, score-0.443]
44 This distribution needs to be computed for every source word token fi in the corpus, for all possible candidates ei ∈ Ve and the process has to be repeated for multiple sampling iterations (typically more than 1000). [sent-159, score-1.012]
45 Doing standard collapsed Gibbs sampling in this scenario would be very slow and intractable. [sent-160, score-0.303]
46 We now present an alternative fast, efficient inference strategy that overcomes many of the challenges described above and helps accelerate the sampling process significantly. [sent-161, score-0.374]
47 First, we set our translation models within the context of a more generic and widely known family of distributions—mixtures of exponential fam- ilies. [sent-162, score-0.323]
48 Then we derive a novel proposal distribution for sampling translation candidates and introduce a new sampler for decipherment training that 365 is based on locality sensitive hashing (LSH). [sent-163, score-1.409]
49 Mixtures of Exponential Families: The translation models described earlier (Section 2) can be represented as mixtures of exponential families, specifically mixtures ofmultinomials. [sent-171, score-0.413]
50 Note that the (translation) model in our case consists of multiple exponential families components—a multinomial pertaining to the lan- guage model (which remains fixed5), and other components pertaining to translation probabilities Pθ(fi |ei), fertility Pθfert , etc. [sent-183, score-0.506]
51 For a given source word token fi draw target 5A high value for the LM concentration parameter α ensures that the LM probabilities do not deviate too far from the original fixed base distribution during sampling. [sent-185, score-0.576]
52 translation ei ∼ p(ei|F, E−i) ∝ p(e) · p(fi|ei, F−i, E−i) · pfert(·|ei, F−i, E−i) · . [sent-186, score-0.395]
53 (5) where, F is the full source text and E the full target translation generated during sampling. [sent-189, score-0.417]
54 Update the sufficient statistics for the changed target translation assignments. [sent-191, score-0.32]
55 So, during decipherment training a standard collapsed Gibbs sampler will waste most of its time on expensive computations that will be discarded in the end anyways. [sent-197, score-0.545]
56 Instead, we can accelerate the computation of the inner product hφ(fi) , θe∗0i using a hash sampling strategy scimt hilφa(rf to (Ahmed ge ta al. [sent-201, score-0.639]
57 The underlying idea here is to use binary hashing (Charikar, 2002) to explore only those candidates e0 that are sufficiently close to the best matching translation via a proposal distribution. [sent-203, score-0.518]
58 Next, we briefly introduce some notations and existing theoretical results related to binary hashing before describing the hash sampling procedure. [sent-204, score-0.671]
59 Let hl (v) ∈ {0, 1}l be an l-bit binary hash of v where: [(hvl (v)]i := sgn[hv, wii] ; wi ∼ Um. [sent-206, score-0.309]
60 Then the probability o:f= matching signs wis given by: zl(u,v) :=l1kh(u) − h(v)k1 (9) So, zl (u, v) measures how many bits differ between the hash vectors h(u) and h(v) associated with u, v. [sent-207, score-0.435]
61 The binary hash representation for the two vectors yield significant speedups during sampling since Hamming distance computation between h(u) and h(v) is highly optimized on modern CPUs. [sent-209, score-0.644]
62 6 ∝ · Updating the hash signatures: During training, we compute the target candidate projection h(θe0) and corresponding norm only once7 which is different from the setup of Ahmed et al. [sent-211, score-0.412]
63 The source word projection φ(fi) is dynamically updated in every sampling step. [sent-213, score-0.404]
64 Note that doing this na¨ ıvely would scale slowly as O(Dl) where D is the total number of features but instead we can update the hash signatures in a more efficient manner that scales as O(Di>0l) where Di>0 is the number ofnon-zero entries in the feature representation for the source word φ(fi). [sent-214, score-0.473]
65 Also, we do not need to store the random vectors w in practice since these can be computed on the fly using hash functions. [sent-215, score-0.355]
66 The inner product approximation also yields some theoretical guarantees for the hash sampler. [sent-216, score-0.35]
67 7In practice, we can ignore the norm terms to further speed up sampling since this is only an estimate for the proposal distribution and we follow this with the Metropolis Hastings step. [sent-218, score-0.446]
68 1 Metropolis Hastings In each sampling step, we use the distribution from Equation 10 as a proposal distribution in a Metropolis Hastings scheme to sample target translations for each source word. [sent-222, score-0.836]
69 Initialization: We initialize the starting sample as follows: for each source word token, randomly sample a target word. [sent-225, score-0.359]
70 If the source word also exists in the target vocabulary, then choose identity translation instead of the random one. [sent-226, score-0.504]
71 Hash Sampling Steps: For each source word token fi, run the hash sampler: (a) Generate a proposal distribution by computing the hamming distance between the feature vectors for the source word and each target translation candidate. [sent-228, score-1.167]
72 Sample a new target translation ei for fi from this distribution. [sent-229, score-0.718]
73 in the source sentence and swap the translations ei with ej. [sent-239, score-0.398]
74 During the sampling process, we compute the probabilities for the two samples—the original and the swapped versions, and then sample an alignment from this distribution. [sent-240, score-0.337]
75 (b) Deletion: For each source word token, delete the current target translation (i. [sent-241, score-0.446]
76 , after all sampling iterations) we choose the final sample as our target translation output for the source text. [sent-248, score-0.726]
77 Here the vocabulary sizes are much larger and we show how our new Bayesian decipherment method scales well to this task inspite of using complex translation models. [sent-256, score-0.695]
78 We use the entire Spanish source text for decipherment training and evaluate the final English output to report BLEU scores. [sent-264, score-0.479]
79 We reserve the first 1k sentences in French as our source text (also used in decipherment training). [sent-266, score-0.447]
80 The latter is used to construct a target language model used for decipherment training. [sent-269, score-0.453]
81 The last two rows display results for the new method using Bayesian hash sampling. [sent-277, score-0.309]
82 Overall, using a 3-gram language model (instead of 2-gram) for decipherment training improves the performance for all methods. [sent-278, score-0.382]
83 It is also interesting to note that the hash sampling method yields much better results than the Bayesian inference method presented in (Ravi and Knight, 2011b). [sent-281, score-0.638]
84 This is due to the accelerated sampling scheme introduced earlier which helps it converge to better solutions faster. [sent-282, score-0.341]
85 32 Table 3: MT results on the French/Spanish EMEA corpus using the new hash sampling method. [sent-307, score-0.553]
86 ∗The last row displays results when we sample target translations from a pruned candidate set (most frequent 1k Spanish words + identity translation candidates) which enables the sampler to run much faster when using more complex models. [sent-308, score-0.775]
87 (2012) reported obtaining a speedup by pruning translation candidates (to ∼1/8th the original size) prior ttioo nEM ca training. [sent-312, score-0.323]
88 The table also demonstrates the siginificant speedup achieved by the hash sampler over a standard Gibbs sampler for the same model (∼85 times faster when using a 2-gram LM). [sent-314, score-0.647]
89 In spite of this challenge and the model complexity, we can still perform decipherment training using Bayesian inference. [sent-322, score-0.382]
90 While their work also uses Bayesian inference with a slice sampling scheme, our new approach uses a novel hash sampling scheme for decipherment that can easily scale to more complex models. [sent-339, score-1.312]
91 The new decipherment framework also allows one to easily incorporate additional information (besides standard word translations) as features (e. [sent-340, score-0.379]
92 ) for unsupervised machine translation which can help further improve the performance in addition to accelerating the sampling process. [sent-343, score-0.49]
93 We already demonstrated the utility of this system by going beyond words and incorporating phrase translations in a decipherment model for the first time. [sent-344, score-0.436]
94 In the future, we can obtain further speedups (especially for large-scale tasks) by parallelizing the sampling scheme seamlessly across multiple machines and CPU cores. [sent-345, score-0.325]
95 The new framework can also be stacked with complementary techniques such as slice sampling, blocked (and type) sampling to further improve inference efficiency. [sent-346, score-0.329]
96 catalogId=LDC2003T05 8 Conclusion To summarize, our method is significantly faster than previous methods based on EM or Bayesian with standard Gibbs sampling and obtains better results than any previously published methods for the same task. [sent-352, score-0.284]
97 The new framework also allows performing Bayesian inference for decipherment applications with more complex models than previously shown. [sent-353, score-0.479]
98 We believe this framework will be useful for further extending MT models in the future to improve translation performance and for many other unsupervised decipherment application scenarios. [sent-354, score-0.596]
99 Estimating word translation probabilities from unrelated monolingual corpora using the em algorithm. [sent-404, score-0.554]
100 Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. [sent-432, score-0.393]
wordName wordTfidf (topN-words)
[('decipherment', 0.35), ('hash', 0.309), ('sampling', 0.244), ('nuhn', 0.238), ('fi', 0.22), ('translation', 0.217), ('ravi', 0.184), ('ei', 0.178), ('fert', 0.145), ('monolingual', 0.141), ('fertility', 0.139), ('emea', 0.135), ('bayesian', 0.134), ('sampler', 0.133), ('knight', 0.126), ('opus', 0.121), ('hashing', 0.118), ('proposal', 0.109), ('target', 0.103), ('source', 0.097), ('em', 0.092), ('mt', 0.088), ('bleu', 0.088), ('translations', 0.086), ('inference', 0.085), ('ahmed', 0.084), ('metropolis', 0.083), ('candidates', 0.074), ('hastings', 0.073), ('lsh', 0.073), ('ve', 0.071), ('gibbs', 0.071), ('exponential', 0.07), ('sample', 0.065), ('lm', 0.062), ('identity', 0.058), ('dou', 0.058), ('tiedemann', 0.056), ('locality', 0.054), ('sujith', 0.054), ('families', 0.052), ('parallel', 0.052), ('token', 0.051), ('sgn', 0.051), ('zl', 0.051), ('distribution', 0.048), ('mixtures', 0.047), ('corpora', 0.047), ('vectors', 0.046), ('accelerate', 0.045), ('deciphering', 0.045), ('speedups', 0.045), ('estimate', 0.045), ('foreign', 0.044), ('complex', 0.044), ('vocabulary', 0.042), ('sizes', 0.042), ('vf', 0.042), ('fscontext', 0.041), ('wfi', 0.041), ('efficiency', 0.041), ('inner', 0.041), ('faster', 0.04), ('klementiev', 0.039), ('scenarios', 0.039), ('signatures', 0.038), ('moses', 0.037), ('swap', 0.037), ('iterations', 0.037), ('speeding', 0.037), ('haghighi', 0.037), ('scheme', 0.036), ('context', 0.036), ('cpu', 0.035), ('kevin', 0.035), ('koehn', 0.035), ('every', 0.034), ('null', 0.033), ('lexicons', 0.033), ('earlier', 0.032), ('training', 0.032), ('hamming', 0.032), ('vely', 0.032), ('speedup', 0.032), ('discourages', 0.032), ('position', 0.032), ('sensitive', 0.03), ('collapsed', 0.03), ('word', 0.029), ('slow', 0.029), ('displays', 0.029), ('hv', 0.029), ('crp', 0.029), ('accelerated', 0.029), ('bits', 0.029), ('unsupervised', 0.029), ('probabilities', 0.028), ('acceptance', 0.028), ('ravichandran', 0.028), ('dempster', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
Author: Sujith Ravi
Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).
2 0.30118826 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers
Author: Malte Nuhn ; Hermann Ney
Abstract: In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard. We show that in this case the decipherment problem is equivalent to the quadratic assignment problem (QAP). To the best of our knowledge, this connection between the QAP and the decipherment problem has not been known in the literature before.
3 0.28766418 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
Author: Jiajun Zhang ; Chengqing Zong
Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1
4 0.22821806 66 acl-2013-Beam Search for Solving Substitution Ciphers
Author: Malte Nuhn ; Julian Schamper ; Hermann Ney
Abstract: In this paper we address the problem of solving substitution ciphers using a beam search approach. We present a conceptually consistent and easy to implement method that improves the current state of the art for decipherment of substitution ciphers and is able to use high order n-gram language models. We show experiments with 1:1 substitution ciphers in which the guaranteed optimal solution for 3-gram language models has 38.6% decipherment error, while our approach achieves 4.13% decipherment error in a fraction of time by using a 6-gram language model. We also apply our approach to the famous Zodiac-408 cipher and obtain slightly bet- ter (and near to optimal) results than previously published. Unlike the previous state-of-the-art approach that uses additional word lists to evaluate possible decipherments, our approach only uses a letterbased 6-gram language model. Furthermore we use our algorithm to solve large vocabulary substitution ciphers and improve the best published decipherment error rate based on the Gigaword corpus of 7.8% to 6.0% error rate.
5 0.18560202 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
Author: Yang Feng ; Trevor Cohn
Abstract: Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phraseinternal translation and reordering. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs. We propose a new model to address this imbalance, based on a word-based Markov model of translation which generates target translations left-to-right. Our model encodes word and phrase level phenomena by conditioning translation decisions on previous decisions and uses a hierarchical Pitman-Yor Process prior to provide dynamic adaptive smoothing. This mechanism implicitly supports not only traditional phrase pairs, but also gapping phrases which are non-consecutive in the source. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU.
6 0.18082234 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
7 0.17584033 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model
8 0.16663851 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
9 0.15936449 108 acl-2013-Decipherment
10 0.15837462 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
11 0.1464013 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
12 0.1430428 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
13 0.13546291 255 acl-2013-Name-aware Machine Translation
14 0.13395981 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
15 0.12686531 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
16 0.12667957 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain
17 0.11335694 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
18 0.11000935 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
19 0.10901002 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation
20 0.10897954 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
topicId topicWeight
[(0, 0.28), (1, -0.152), (2, 0.23), (3, 0.095), (4, 0.0), (5, -0.031), (6, 0.0), (7, 0.024), (8, -0.109), (9, -0.053), (10, 0.025), (11, -0.124), (12, -0.036), (13, -0.169), (14, 0.022), (15, -0.316), (16, -0.072), (17, -0.084), (18, -0.071), (19, 0.149), (20, 0.028), (21, 0.009), (22, -0.092), (23, -0.027), (24, -0.029), (25, -0.001), (26, 0.029), (27, 0.025), (28, -0.013), (29, 0.023), (30, 0.039), (31, -0.011), (32, 0.054), (33, 0.052), (34, -0.005), (35, 0.035), (36, 0.004), (37, -0.091), (38, 0.03), (39, -0.046), (40, -0.002), (41, -0.014), (42, -0.015), (43, -0.038), (44, 0.041), (45, -0.012), (46, -0.034), (47, 0.05), (48, -0.028), (49, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.88900542 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
Author: Sujith Ravi
Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).
2 0.82065248 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers
Author: Malte Nuhn ; Hermann Ney
Abstract: In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard. We show that in this case the decipherment problem is equivalent to the quadratic assignment problem (QAP). To the best of our knowledge, this connection between the QAP and the decipherment problem has not been known in the literature before.
3 0.76059002 108 acl-2013-Decipherment
Author: Kevin Knight
Abstract: The first natural language processing systems had a straightforward goal: decipher coded messages sent by the enemy. This tutorial explores connections between early decipherment research and today’s NLP work. We cover classic military and diplomatic ciphers, automatic decipherment algorithms, unsolved ciphers, language translation as decipherment, and analyzing ancient writing as decipherment. 1 Tutorial Overview The first natural language processing systems had a straightforward goal: decipher coded messages sent by the enemy. Sixty years later, we have many more applications, including web search, question answering, summarization, speech recognition, and language translation. This tutorial explores connections between early decipherment research and today’s NLP work. We find that many ideas from the earlier era have become core to the field, while others still remain to be picked up and developed. We first cover classic military and diplomatic cipher types, including complex substitution ciphers implemented in the first electro-mechanical encryption machines. We look at mathematical tools (language recognition, frequency counting, smoothing) developed to decrypt such ciphers on proto-computers. We show algorithms and extensive empirical results for solving different types of ciphers, and we show the role of algorithms in recent decipherments of historical documents. We then look at how foreign language can be viewed as a code for English, a concept developed by Alan Turing and Warren Weaver. We describe recently published work on building automatic translation systems from non-parallel data. We also demonstrate how some of the same algorithmic tools can be applied to natural language tasks like part-of-speech tagging and word alignment. Turning back to historical ciphers, we explore a number of unsolved ciphers, giving results of initial computer experiments on several of them. Finally, we look briefly at writing as a way to encipher phoneme sequences, covering ancient scripts and modern applications. 2 Outline 1. Classical military/diplomatic ciphers (15 minutes) • 60 cipher types (ACA) • Ciphers vs. codes • Enigma cipher: the mother of natural language processing computer analysis of text language recognition Good-Turing smoothing – – – 2. Foreign language as a code (10 minutes) • • Alan Turing’s ”Thinking Machines” Warren Weaver’s Memorandum 3. Automatic decipherment (55 minutes) • Cipher type detection • Substitution ciphers (simple, homophonic, polyalphabetic, etc) plaintext language recognition ∗ how much plaintext knowledge is – nheowede mdu 3 Proce diSnogfsia, of B thuleg5a r1iast, A Anungu aslt M4-9e t2in01g3 o.f ? tc he20 A1s3so Acsiasoticoinat fio rn C fo rm Cpoumtaptuiotantaioln Lainlg Luinisgtuicis ,tpi casges 3–4, – ∗ index of coincidence, unicity distance, oanf dc oointhceidr measures navigating a difficult search space ∗ frequencies of letters and words ∗ pattern words and cribs ∗ pElMin,g ILP, Bayesian models, sam– recent decipherments ∗ Jefferson cipher, Copiale cipher, cJievfifle war ciphers, n Caovaplia Enigma • • • • Application to part-of-speech tagging, Awopprdli alignment Application to machine translation withoAuptp parallel t teoxtm Parallel development of cryptography aPnarda ltrleanls dlaetvioenlo Recently released NSA internal nReewcselnettlyter (1974-1997) 4. *** Break *** (30 minutes) 5. Unsolved ciphers (40 minutes) • Zodiac 340 (1969), including computatZioodnaial cw 3o4r0k • Voynich Manuscript (early 1400s), including computational ewarolyrk • Beale (1885) • Dorabella (1897) • Taman Shud (1948) • Kryptos (1990), including computatKiorynaplt owsor (k1 • McCormick (1999) • Shoeboxes in attics: DuPonceau jour- nal, Finnerana, SYP, Mopse, diptych 6. Writing as a code (20 minutes) • Does writing encode ideas, or does it encDoodees phonemes? • Ancient script decipherment Egyptian hieroglyphs Linear B Mayan glyphs – – – – wUgoarkritic, including computational Chinese N ¨ushu, including computational work • Automatic phonetic decipherment • Application to transliteration 7. Undeciphered writing systems (15 minutes) • Indus Valley Script (3300BC) • Linear A (1900BC) • Phaistos disc (1700BC?) • Rongorongo (1800s?) – 8. Conclusion and further questions (15 minutes) 3 About the Presenter Kevin Knight is a Senior Research Scientist and Fellow at the Information Sciences Institute of the University of Southern California (USC), and a Research Professor in USC’s Computer Science Department. He received a PhD in computer science from Carnegie Mellon University and a bachelor’s degree from Harvard University. Professor Knight’s research interests include natural language processing, machine translation, automata theory, and decipherment. In 2001, he co-founded Language Weaver, Inc., and in 2011, he served as President of the Association for Computational Linguistics. Dr. Knight has taught computer science courses at USC for more than fifteen years and co-authored the widely adopted textbook Artificial Intelligence. 4
4 0.74678779 66 acl-2013-Beam Search for Solving Substitution Ciphers
Author: Malte Nuhn ; Julian Schamper ; Hermann Ney
Abstract: In this paper we address the problem of solving substitution ciphers using a beam search approach. We present a conceptually consistent and easy to implement method that improves the current state of the art for decipherment of substitution ciphers and is able to use high order n-gram language models. We show experiments with 1:1 substitution ciphers in which the guaranteed optimal solution for 3-gram language models has 38.6% decipherment error, while our approach achieves 4.13% decipherment error in a fraction of time by using a 6-gram language model. We also apply our approach to the famous Zodiac-408 cipher and obtain slightly bet- ter (and near to optimal) results than previously published. Unlike the previous state-of-the-art approach that uses additional word lists to evaluate possible decipherments, our approach only uses a letterbased 6-gram language model. Furthermore we use our algorithm to solve large vocabulary substitution ciphers and improve the best published decipherment error rate based on the Gigaword corpus of 7.8% to 6.0% error rate.
5 0.64669877 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
Author: Yang Feng ; Trevor Cohn
Abstract: Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phraseinternal translation and reordering. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs. We propose a new model to address this imbalance, based on a word-based Markov model of translation which generates target translations left-to-right. Our model encodes word and phrase level phenomena by conditioning translation decisions on previous decisions and uses a hierarchical Pitman-Yor Process prior to provide dynamic adaptive smoothing. This mechanism implicitly supports not only traditional phrase pairs, but also gapping phrases which are non-consecutive in the source. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU.
6 0.6416229 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
7 0.62514102 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
8 0.57028681 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
9 0.56701654 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
10 0.56674463 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
11 0.55899596 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
12 0.55368966 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
13 0.54362369 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
14 0.53933305 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
15 0.5374034 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
16 0.53693855 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model
17 0.53589088 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment
18 0.53333002 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
19 0.53315139 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
20 0.52310449 255 acl-2013-Name-aware Machine Translation
topicId topicWeight
[(0, 0.39), (6, 0.044), (11, 0.045), (24, 0.033), (26, 0.038), (35, 0.069), (42, 0.057), (48, 0.034), (70, 0.038), (88, 0.024), (90, 0.031), (95, 0.122)]
simIndex simValue paperId paperTitle
1 0.99057305 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity
Author: Daniel Bar ; Torsten Zesch ; Iryna Gurevych
Abstract: We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity additionally comes with a set of full-featured experimental setups which can be run out-of-the-box and be used for future systems to built upon.
Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou
Abstract: U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the METANET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with METANET’s aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both mul- tilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.
3 0.98421097 269 acl-2013-PLIS: a Probabilistic Lexical Inference System
Author: Eyal Shnarch ; Erel Segal-haLevi ; Jacob Goldberger ; Ido Dagan
Abstract: This paper presents PLIS, an open source Probabilistic Lexical Inference System which combines two functionalities: (i) a tool for integrating lexical inference knowledge from diverse resources, and (ii) a framework for scoring textual inferences based on the integrated knowledge. We provide PLIS with two probabilistic implementation of this framework. PLIS is available for download and developers of text processing applications can use it as an off-the-shelf component for injecting lexical knowledge into their applications. PLIS is easily configurable, components can be extended or replaced with user generated ones to enable system customization and further research. PLIS includes an online interactive viewer, which is a powerful tool for investigating lexical inference processes. 1 Introduction and background Semantic Inference is the process by which machines perform reasoning over natural language texts. A semantic inference system is expected to be able to infer the meaning of one text from the meaning of another, identify parts of texts which convey a target meaning, and manipulate text units in order to deduce new meanings. Semantic inference is needed for many Natural Language Processing (NLP) applications. For instance, a Question Answering (QA) system may encounter the following question and candidate answer (Example 1): Q: which explorer discovered the New World? A: Christopher Columbus revealed America. As there are no overlapping words between the two sentences, to identify that A holds an answer for Q, background world knowledge is needed to link Christopher Columbus with explorer and America with New World. Linguistic knowledge is also needed to identify that reveal and discover refer to the same concept. Knowledge is needed in order to bridge the gap between text fragments, which may be dissimilar on their surface form but share a common meaning. For the purpose of semantic inference, such knowledge can be derived from various resources (e.g. WordNet (Fellbaum, 1998) and others, detailed in Section 2.1) in a form which we denote as inference links (often called inference/entailment rules), each is an ordered pair of elements in which the first implies the meaning of the second. For instance, the link ship→vessel can be derived from tshtaen hypernym rkel sahtiiopn→ ovfe Wsseolr cdNanet b. Other applications can benefit from utilizing inference links to identify similarity between language expressions. In Information Retrieval, the user’s information need may be expressed in relevant documents differently than it is expressed in the query. Summarization systems should identify text snippets which convey the same meaning. Our work addresses a generic, application in- dependent, setting of lexical inference. We therefore adopt the terminology of Textual Entailment (Dagan et al., 2006), a generic paradigm for applied semantic inference which captures inference needs of many NLP applications in a common underlying task: given two textual fragments, termed hypothesis (H) and text (T), the task is to recognize whether T implies the meaning of H, denoted T→H. For instance, in a QA application, H reprTe→seHnts. Fthoer question, a innd a T Q a c aanpdpilidcaattei answer. pInthis setting, T is likely to hold an answer for the question if it entails the question. It is challenging to properly extract the needed inference knowledge from available resources, and to effectively utilize it within the inference process. The integration of resources, each has its own format, is technically complex and the quality 97 ProceedingSsof oiaf, th Beu 5lg1asrtia A,n Anuuaglu Mst 4ee-9tin 2g0 o1f3. th ?ec A20ss1o3ci Aastisoonci faotrio Cno fomrp Cuotamtipountaalti Loinnaglu Lisitnigcsu,is patigcess 97–102, Figure 1: PLIS schema - a text-hypothesis pair is processed by the Lexical Integrator which uses a set of lexical resources to extract inference chains which connect the two. The Lexical Inference component provides probability estimations for the validity of each level of the process. ofthe resulting inference links is often unknown in advance and varies considerably. For coping with this challenge we developed PLIS, a Probabilistic Lexical Inference System1 . PLIS, illustrated in Fig 1, has two main modules: the Lexical Integra- tor (Section 2) accepts a set of lexical resources and a text-hypothesis pair, and finds all the lexical inference relations between any pair of text term ti and hypothesis term hj, based on the available lexical relations found in the resources (and their combination). The Lexical Inference module (Section 3) provides validity scores for these relations. These term-level scores are used to estimate the sentence-level likelihood that the meaning of the hypothesis can be inferred from the text, thus making PLIS a complete lexical inference system. Lexical inference systems do not look into the structure of texts but rather consider them as bag ofterms (words or multi-word expressions). These systems are easy to implement, fast to run, practical across different genres and languages, while maintaining a competitive level of performance. PLIS can be used as a stand-alone efficient inference system or as the lexical component of any NLP application. PLIS is a flexible system, allowing users to choose the set of knowledge resources as well as the model by which inference 1The complete software package is available at http:// www.cs.biu.ac.il/nlp/downloads/PLIS.html and an online interactive viewer is available for examination at http://irsrv2. cs.biu.ac.il/nlp-net/PLIS.html. is done. PLIS can be easily extended with new knowledge resources and new inference models. It comes with a set of ready-to-use plug-ins for many common lexical resources (Section 2.1) as well as two implementation of the scoring framework. These implementations, described in (Shnarch et al., 2011; Shnarch et al., 2012), provide probability estimations for inference. PLIS has an interactive online viewer (Section 4) which provides a visualization of the entire inference process, and is very helpful for analysing lexical inference models and lexical resources usability. 2 Lexical integrator The input for the lexical integrator is a set of lexical resources and a pair of text T and hypothesis H. The lexical integrator extracts lexical inference links from the various lexical resources to connect each text term ti ∈ T with each hypothesis term hj ∈ H2. A lexical i∈nfTer wenicthe elianckh hinydpicoathteess a semantic∈ rHelation between two terms. It could be a directional relation (Columbus→navigator) or a bai ddiirreeccttiioonnaall one (car ←→ automobile). dSirinecceti knowledge resources vary lien) their representation methods, the lexical integrator wraps each lexical resource in a common plug-in interface which encapsulates resource’s inner representation method and exposes its knowledge as a list of inference links. The implemented plug-ins that come with PLIS are described in Section 2.1. Adding a new lexical resource and integrating it with the others only demands the implementation of the plug-in interface. As the knowledge needed to connect a pair of terms, ti and hj, may be scattered across few resources, the lexical integrator combines inference links into lexical inference chains to deduce new pieces of knowledge, such as Columbus −r −e −so −u −rc −e →2 −r −e −so −u −rc −e →1 navigator explorer. Therefore, the only assumption −t −he − l−e −x →ica elx integrator makes, regarding its input lexical resources, is that the inferential lexical relations they provide are transitive. The lexical integrator generates lexical infer- ence chains by expanding the text and hypothesis terms with inference links. These links lead to new terms (e.g. navigator in the above chain example and t0 in Fig 1) which can be further expanded, as all inference links are transitive. A transitivity 2Where iand j run from 1 to the length of the text and hypothesis respectively. 98 limit is set by the user to determine the maximal length for inference chains. The lexical integrator uses a graph-based representation for the inference chains, as illustrates in Fig 1. A node holds the lemma, part-of-speech and sense of a single term. The sense is the ordinal number of WordNet sense. Whenever we do not know the sense of a term we implement the most frequent sense heuristic.3 An edge represents an inference link and is labeled with the semantic relation of this link (e.g. cytokine→protein is larbeellaetdio wni othf tt hheis sW linokrd (Nee.gt .re clayttiookni hypernym). 2.1 Available plug-ins for lexical resources We have implemented plug-ins for the follow- ing resources: the English lexicon WordNet (Fellbaum, 1998)(based on either JWI, JWNL or extJWNL java APIs4), CatVar (Habash and Dorr, 2003), a categorial variations database, Wikipedia-based resource (Shnarch et al., 2009), which applies several extraction methods to derive inference links from the text and structure of Wikipedia, VerbOcean (Chklovski and Pantel, 2004), a knowledge base of fine-grained semantic relations between verbs, Lin’s distributional similarity thesaurus (Lin, 1998), and DIRECT (Kotlerman et al., 2010), a directional distributional similarity thesaurus geared for lexical inference. To summarize, the lexical integrator finds all possible inference chains (of a predefined length), resulting from any combination of inference links extracted from lexical resources, which link any t, h pair of a given text-hypothesis. Developers can use this tool to save the hassle of interfacing with the different lexical knowledge resources, and spare the labor of combining their knowledge via inference chains. The lexical inference model, described next, provides a mean to decide whether a given hypothesis is inferred from a given text, based on weighing the lexical inference chains extracted by the lexical integrator. 3 Lexical inference There are many ways to implement an inference model which identifies inference relations between texts. A simple model may consider the 3This disambiguation policy was better than considering all senses of an ambiguous term in preliminary experiments. However, it is a matter of changing a variable in the configuration of PLIS to switch between these two policies. 4http://wordnet.princeton.edu/wordnet/related-projects/ number of hypothesis terms for which inference chains, originated from text terms, were found. In PLIS, the inference model is a plug-in, similar to the lexical knowledge resources, and can be easily replaced to change the inference logic. We provide PLIS with two implemented baseline lexical inference models which are mathematically based. These are two Probabilistic Lexical Models (PLMs), HN-PLM and M-PLM which are described in (Shnarch et al., 2011; Shnarch et al., 2012) respectively. A PLM provides probability estimations for the three parts of the inference process (as shown in Fig 1): the validity probability of each inference chain (i.e. the probability for a valid inference relation between its endpoint terms) P(ti → hj), the probability of each hypothesis term to →b e i hnferred by the entire text P(T → hj) (term-level probability), eanntdir teh tee probability o hf the entire hypothesis to be inferred by the text P(T → H) (sentencelteov eble probability). HN-PLM describes a generative process by which the hypothesis is generated from the text. Its parameters are the reliability level of each of the resources it utilizes (that is, the prior probability that applying an arbitrary inference link derived from each resource corresponds to a valid inference). For learning these parameters HN-PLM applies a schema of the EM algorithm (Dempster et al., 1977). Its performance on the recognizing textual entailment task, RTE (Bentivogli et al., 2009; Bentivogli et al., 2010), are in line with the state of the art inference systems, including complex systems which perform syntactic analysis. This model is improved by M-PLM, which deduces sentence-level probability from term-level probabilities by a Markovian process. PLIS with this model was used for a passage retrieval for a question answering task (Wang et al., 2007), and outperformed state of the art inference systems. Both PLMs model the following prominent aspects of the lexical inference phenomenon: (i) considering the different reliability levels of the input knowledge resources, (ii) reducing inference chain probability as its length increases, and (iii) increasing term-level probability as we have more inference chains which suggest that the hypothesis term is inferred by the text. Both PLMs only need sentence-level annotations from which they derive term-level inference probabilities. To summarize, the lexical inference module 99 ?(? → ?) Figure 2: PLIS interactive viewer with Example 1 demonstrates knowledge integration of multiple inference chains and resource combination (additional explanations, which are not part of the demo, are provided in orange). provides the setting for interfacing with the lexical integrator. Additionally, the module provides the framework for probabilistic inference models which estimate term-level probabilities and integrate them into a sentence-level inference decision, while implementing prominent aspects of lexical inference. The user can choose to apply another inference logic, not necessarily probabilistic, by plugging a different lexical inference model into the provided inference infrastructure. 4 The PLIS interactive system PLIS comes with an online interactive viewer5 in which the user sets the parameters of PLIS, inserts a text-hypothesis pair and gets a visualization of the entire inference process. This is a powerful tool for investigating knowledge integration and lexical inference models. Fig 2 presents a screenshot of the processing of Example 1. On the right side, the user configures the system by selecting knowledge resources, adjusting their configuration, setting the transitivity limit, and choosing the lexical inference model to be applied by PLIS. After inserting a text and a hypothesis to the appropriate text boxes, the user clicks on the infer button and PLIS generates all lexical inference chains, of length up to the transitivity limit, that connect text terms with hypothesis terms, as available from the combination of the selected input re5http://irsrv2.cs.biu.ac.il/nlp-net/PLIS.html sources. Each inference chain is presented in a line between the text and hypothesis. PLIS also displays the probability estimations for all inference levels; the probability of each chain is presented at the end of its line. For each hypothesis term, term-level probability, which weighs all inference chains found for it, is given below the dashed line. The overall sentence-level probability integrates the probabilities of all hypothesis terms and is displayed in the box at the bottom right corner. Next, we detail the inference process of Example 1, as presented in Fig 2. In this QA example, the probability of the candidate answer (set as the text) to be relevant for the given question (the hypothesis) is estimated. When utilizing only two knowledge resources (WordNet and Wikipedia), PLIS is able to recognize that explorer is inferred by Christopher Columbus and that New World is inferred by America. Each one of these pairs has two independent inference chains, numbered 1–4, as evidence for its inference relation. Both inference chains 1 and 3 include a single inference link, each derived from a different relation of the Wikipedia-based resource. The inference model assigns a higher probability for chain 1since the BeComp relation is much more reliable than the Link relation. This comparison illustrates the ability of the inference model to learn how to differ knowledge resources by their reliability. Comparing the probability assigned by the in100 ference model for inference chain 2 with the probabilities assigned for chains 1 and 3, reveals the sophisticated way by which the inference model integrates lexical knowledge. Inference chain 2 is longer than chain 1, therefore its probability is lower. However, the inference model assigns chain 2 a higher probability than chain 3, even though the latter is shorter, since the model is sensitive enough to consider the difference in reliability levels between the two highly reliable hypernym relations (from WordNet) of chain 2 and the less reliable Link relation (from Wikipedia) of chain 3. Another aspect of knowledge integration is exemplified in Fig 2 by the three circled probabilities. The inference model takes into consideration the multiple pieces of evidence for the inference of New World (inference chains 3 and 4, whose probabilities are circled). This results in a termlevel probability estimation for New World (the third circled probability) which is higher than the probabilities of each chain separately. The third term of the hypothesis, discover, remains uncovered by the text as no inference chain was found for it. Therefore, the sentence-level inference probability is very low, 37%. In order to identify that the hypothesis is indeed inferred from the text, the inference model should be provided with indications for the inference of discover. To that end, the user may increase the transitivity limit in hope that longer inference chains provide the needed information. In addition, the user can examine other knowledge resources in search for the missing inference link. In this example, it is enough to add VerbOcean to the input of PLIS to expose two inference chains which connect reveal with discover by combining an inference link from WordNet and another one from VerbOcean. With this additional information, the sentence-level probability increases to 76%. This is a typical scenario of utilizing PLIS, either via the interactive system or via the software, for analyzing the usability of the different knowledge resources and their combination. A feature of the interactive system, which is useful for lexical resources analysis, is that each term in a chain is clickable and links to another screen which presents all the terms that are inferred from it and those from which it is inferred. Additionally, the interactive system communicates with a server which runs PLIS, in a fullduplex WebSocket connection6. This mode of operation is publicly available and provides a method for utilizing PLIS, without having to install it or the lexical resources it uses. Finally, since PLIS is a lexical system it can easily be adjusted to other languages. One only needs to replace the basic lexical text processing tools and plug in knowledge resources in the target language. If PLIS is provided with bilingual resources,7 it can operate also as a cross-lingual inference system (Negri et al., 2012). For instance, the text in Fig 3 is given in English, while the hypothesis is written in Spanish (given as a list of lemma:part-of-speech). The left side of the figure depicts a cross-lingual inference process in which the only lexical knowledge resource used is a man- ually built English-Spanish dictionary. As can be seen, two Spanish terms, jugador and casa remain uncovered since the dictionary alone cannot connect them to any of the English terms in the text. As illustrated in the right side of Fig 3, PLIS enables the combination of the bilingual dictionary with monolingual resources to produce cross-lingual inference chains, such as footballer−h −y −p −er−n y −m →player− −m −a −nu − →aljugador. Such inferenc−e − c−h −a −in − →s hpalavey trh− e− capability otro. overcome monolingual language variability (the first link in this chain) as well as to provide cross-lingual translation (the second link). 5 Conclusions To utilize PLIS one should gather lexical resources, obtain sentence-level annotations and train the inference model. Annotations are available in common data sets for task such as QA, Information Retrieval (queries are hypotheses and snippets are texts) and Student Response Analysis (reference answers are the hypotheses that should be inferred by the student answers). For developers of NLP applications, PLIS offers a ready-to-use lexical knowledge integrator which can interface with many common lexical knowledge resources and constructs lexical inference chains which combine the knowledge in them. A developer who wants to overcome lexical language variability, or to incorporate background knowledge, can utilize PLIS to inject lex6We used the socket.io implementation. 7A bilingual resource holds inference links which connect terms in different languages (e.g. an English-Spanish dictionary can provide the inference link explorer→explorador). 101 Figure 3 : PLIS as a cross-lingual inference system. Left: the process with a single manual bilingual resource. Right: PLIS composes cross-lingual inference chains to increase hypothesis coverage and increase sentence-level inference probability. ical knowledge into any text understanding application. PLIS can be used as a lightweight inference system or as the lexical component of larger, more complex inference systems. Additionally, PLIS provides scores for infer- ence chains and determines the way to combine them in order to recognize sentence-level inference. PLIS comes with two probabilistic lexical inference models which achieved competitive performance levels in the tasks of recognizing textual entailment and passage retrieval for QA. All aspects of PLIS are configurable. The user can easily switch between the built-in lexical resources, inference models and even languages, or extend the system with additional lexical resources and new inference models. Acknowledgments The authors thank Eden Erez for his help with the interactive viewer and Miquel Espl a` Gomis for the bilingual dictionaries. This work was partially supported by the European Community’s 7th Framework Programme (FP7/2007-2013) under grant agreement no. 287923 (EXCITEMENT) and the Israel Science Foundation grant 880/12. References Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proc. of TAC. Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2010. The sixth PASCAL recognizing textual entailment challenge. In Proc. of TAC. Timothy Chklovski and Patrick Pantel. 2004. VerbOcean: Mining the web for fine-grained semantic verb relations. In Proc. of EMNLP. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Lecture Notes in Computer Science, volume 3944, pages 177–190. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, series [B], 39(1): 1–38. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts. Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for English. In Proc. of NAACL. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. 2010. Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4):359–389. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proc. of COLOING-ACL. Matteo Negri, Alessandro Marchetti, Yashar Mehdad, Luisa Bentivogli, and Danilo Giampiccolo. 2012. Semeval-2012 task 8: Cross-lingual textual entailment for content synchronization. In Proc. of SemEval. Eyal Shnarch, Libby Barak, and Ido Dagan. 2009. Extracting lexical reference rules from Wikipedia. In Proc. of ACL. Eyal Shnarch, Jacob Goldberger, and Ido Dagan. 2011. Towards a probabilistic model for lexical entailment. In Proc. of the TextInfer Workshop. Eyal Shnarch, Ido Dagan, and Jacob Goldberger. 2012. A probabilistic lexical model for ranking textual inferences. In Proc. of *SEM. Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the Jeopardy model? A quasisynchronous grammar for QA. In Proc. of EMNLP. 102
4 0.98365575 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers
Author: Andre Martins ; Miguel Almeida ; Noah A. Smith
Abstract: We present fast, accurate, direct nonprojective dependency parsers with thirdorder features. Our approach uses AD3, an accelerated dual decomposition algorithm which we extend to handle specialized head automata and sequential head bigram models. Experiments in fourteen languages yield parsing speeds competitive to projective parsers, with state-ofthe-art accuracies for the largest datasets (English, Czech, and German).
5 0.98281825 277 acl-2013-Part-of-speech tagging with antagonistic adversaries
Author: Anders Sgaard
Abstract: Supervised NLP tools and on-line services are often used on data that is very different from the manually annotated data used during development. The performance loss observed in such cross-domain applications is often attributed to covariate shifts, with out-of-vocabulary effects as an important subclass. Many discriminative learning algorithms are sensitive to such shifts because highly indicative features may swamp other indicative features. Regularized and adversarial learning algorithms have been proposed to be more robust against covariate shifts. We present a new perceptron learning algorithm using antagonistic adversaries and compare it to previous proposals on 12 multilin- gual cross-domain part-of-speech tagging datasets. While previous approaches do not improve on our supervised baseline, our approach is better across the board with an average 4% error reduction.
6 0.97881788 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures
7 0.94175065 284 acl-2013-Probabilistic Sense Sentiment Similarity through Hidden Emotions
same-paper 8 0.94074798 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
9 0.87652469 118 acl-2013-Development and Analysis of NLP Pipelines in Argo
10 0.8508786 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation
11 0.82202423 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP
12 0.80366856 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
13 0.78088135 297 acl-2013-Recognizing Partial Textual Entailment
14 0.76714414 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit
15 0.76604354 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE
16 0.74185961 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning
17 0.741171 237 acl-2013-Margin-based Decomposed Amortized Inference
18 0.73988044 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations
19 0.73408878 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages
20 0.73189974 234 acl-2013-Linking and Extending an Open Multilingual Wordnet