acl acl2011 acl2011-94 knowledge-graph by maker-knowledge-mining

94 acl-2011-Deciphering Foreign Language

Source: pdf

Author: Sujith Ravi ; Kevin Knight

Abstract: In this work, we tackle the task of machine translation (MT) without parallel training data. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from nonparallel text.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this work, we tackle the task of machine translation (MT) without parallel training data. [sent-2, score-0.235]

2 We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from nonparallel text. [sent-3, score-1.366]

3 From these corpora, we estimate translation model parameters: wordto-word translation tables, fertilities, distortion parameters, phrase tables, syntactic transformations, etc. [sent-5, score-0.265]

4 In this paper, we address the problem of learning a full translation model from non-parallel data, and we use the 12 learned model to translate new foreign strings. [sent-10, score-0.337]

5 Intuitively, we try to construct translation model tables which, when applied to observed foreign text, consistently yield sensible En- glish. [sent-13, score-0.325]

6 A language model P(e) is typically used in SMT decoding (Koehn, 2009), but here P(e) actually plays a central role in training translation model parameters. [sent-22, score-0.195]

7 We can now draw on previous decipherment work for solving simpler substitution/transposition ciphers (Bauer, 2006; Knight et al. [sent-24, score-0.818]

8 We must keep in mind, however, that foreign language is a much more demanding code, involving highly nondeterministic mappings and very large substitution tables. [sent-26, score-0.342]

9 Word Substitution Decipherment Before we tackle machine translation without parallel data, we first solve a simpler problem—word substitution decipherment. [sent-33, score-0.35]

10 In a word substitution cipher, every word in the natural language (plaintext) sequence is substituted by a cipher token, according to a substitution key. [sent-35, score-0.588]

11 The key is deterministic—there exists a 1-to-1 mapping between cipher units and the plaintext words they encode. [sent-36, score-0.529]

12 For example, the following English plaintext se- quences: I SAW THE BOY . [sent-37, score-0.241]

13 13 may be enciphered as: xy z z fxyy crqq tmn z lxwz crqq tmn z gdxx lxwz according to the key: THE → crqq, SAW → fxyy RAN → gdxx . [sent-39, score-0.342]

14 → lx crwzq , BOY → tmn z , I → xy z z , , The goal of word substitution decipherment is to guess the original plaintext from given cipher data without any knowledge of the substitution key. [sent-40, score-1.593]

15 Probabilistic decipherment: Our decipherment method follows a noisy-channel approach. [sent-42, score-0.733]

16 The generative story for decipherment is described here: 1. [sent-47, score-0.757]

17 Substitute each plaintext word ei with a ciphertext token ci, with probability Pθ (ci |ei) in order to generate the ciphertext sequence c = c1. [sent-53, score-0.589]

18 During decipherment, our goal is to estimate the channel model parameters θ. [sent-58, score-0.222]

19 This poses some serious scalability challenges for word substitution decipherment. [sent-60, score-0.198]

20 We propose novel methods that can deal with these challenges effectively and solve word substitution ciphers: 1. [sent-61, score-0.217]

21 Secondly, we need to instantiate the entire channel and resulting derivation lattice before we can run EM, and this is too big to be stored in memory. [sent-65, score-0.219]

22 Bayesian decipherment: We also propose a novel decipherment approach using Bayesian inference. [sent-68, score-0.752]

23 Our method overcomes these challenges and does fast, efficient inference using (a) a novel strategy for selecting sampling choices, and (b) a parallelized sampling scheme. [sent-70, score-0.467]

24 Identify the top K frequent word types in both the plaintext and ciphertext data. [sent-76, score-0.383]

25 Now, instantiate a small channel with just (K + 1)2 parameters and use the EM algorithm to train this model to maximize likelihood of cipher data. [sent-78, score-0.512]

26 Extend the plaintext and ciphertext vocabular- × ies from the previous step by adding the next K most frequent word types (so the new vocabulary size becomes 2K + 1). [sent-80, score-0.383]

27 Finally, we decode the given ciphertext c by using the Viterbi algorithm to choose the plaintext decoding e that maximizes P(e) · Pθtrained (c|e)3, stretching teh eth achta mnnaxelim probabilities (Knight (ect| al. [sent-99, score-0.392]

28 Here, we propose a novel decipherment approach using Bayesian learning. [sent-108, score-0.752]

29 1 We perform inference using point-wise Gibbs sampling (Geman and Geman, 1984). [sent-112, score-0.189]

30 We define a sampling operator that samples plaintext word choices for every cipher token, one at a time. [sent-113, score-0.771]

31 Smart sample-choice selection: In the original sampling step, for each cipher token we have to sample from a list of all possible plaintext choices (10k1M English words). [sent-116, score-0.782]

32 There are 100k cipher tokens in our data which means we have to perform ∼ 109 sampling operations to nmsa wkee one ee ntotir pee pass through the data. [sent-117, score-0.496]

33 Instead, we now reduce our choices in each sampling step. [sent-119, score-0.194]

34 Say that our current plaintext hypothesis contains English words X, Y and Z at positions i− 1, iand iE+n1g respectively. [sent-120, score-0.241]

35 , X and Z never co-occurred), we randomly pick K words from the plaintext vocabulary. [sent-124, score-0.241]

36 This significantly reduces the sampling possibilities (10k-1M reduces to 100) at each step and allows us to scale to large plaintext vocabulary sizes without enumerating all possible choices at every cipher position. [sent-126, score-0.787]

37 2 Parallelized Gibbs sampling: Secondly, we parallelize our sampling step using a Map-Reduce framework. [sent-127, score-0.189]

38 In the past, others have proposed parallelized sampling schemes for topic modeling applications (Newman et al. [sent-128, score-0.213]

39 In our method, we split the entire corpus into separate chunks and we run the sampling procedure on each chunk in parallel. [sent-130, score-0.193]

40 At 1For word substitution decipherment, we want to keep the language model probabilities fixed during training, and hence we set the prior on that model to be high (α = 104). [sent-131, score-0.216]

41 15 the end of each sampling iteration, we combine the samples corresponding to each chunk and collect the counts of all events—this forms our cache for the next sampling iteration. [sent-141, score-0.354]

42 In practice, we observe that the parallelized sampling run converges quickly and runs much faster than the conventional point-wise sampling—for example, 3. [sent-142, score-0.241]

43 1 hours (using 10 nodes) versus 11 hours for one of the word substitution experiments. [sent-143, score-0.188]

44 3 Decoding the ciphertext: After the sampling run has finished, we choose the final sample and extract a trained version of the channel model Pθ (c|e) from this sample following the technique of (Cc|hei)ang et al. [sent-145, score-0.42]

45 We then use the Viterbi algorithm to choose the English plaintext e that maximizes P(e) · Pθtrained(c|e)3. [sent-147, score-0.241]

46 3 Experiments and Results Data: For the word substitution experiments, we use two corpora: • Temporal expression corpus containing short English temporal expressions osuntcahin as “ TshHoErt NEXT MONTH”, “THE LAST THREE YEARS”, etc. [sent-149, score-0.224]

47 The cipher data contains 5000 expressions (9619 tokens, 153 word types). [sent-150, score-0.343]

48 We also have access to a separate English corpus (which is not parallel to the ciphertext) containing 125k temporal expressions (242k word tokens, 201 word types) for LM training. [sent-151, score-0.218]

49 a cT choer pduasta c ocontnasiisntisn gof f 1l0lk E cipher sentences (102k tokens, 3397 word types); and a plaintext corpus of 402k English sentences (2. [sent-153, score-0.553]

50 We use all the cipher data for decipherment training but evaluate on the first 1000 cipher sentences. [sent-155, score-1.34]

51 The cipher data was originally generated from English text by substituting each English word with a unique cipher word. [sent-156, score-0.6]

52 We use the plaintext corpus to • 3Type sampling could be applied on top of our methods to further optimize performance. [sent-157, score-0.406]

53 For the Transtac corpus, decipherment performance is also shown for different training data sizes (9k versus 100k cipher tokens). [sent-160, score-1.135]

54 build an English word n-gram LM, which is used in the decipherment process. [sent-161, score-0.757]

55 Evaluation: We compute the accuracy of a particular decipherment as the percentage of cipher tokens that were correctly deciphered from the whole corpus. [sent-162, score-1.064]

56 We run the two methods (Iterative and EM4 Bayesian) and then compare them in terms of word substitution decipherment accuracies. [sent-163, score-0.911]

57 Both methods achieve high accuracies, decoding 70-90% of the two word substitution ciphers. [sent-165, score-0.183]

58 Overall, Bayesian decipherment (with sparse priors) performs better than Iterative EM and achieves the best results on this task. [sent-166, score-0.733]

59 16 text with output from Bayesian word substitution decipherment (D) for a few samples cipher (C) sentences from the Transtac corpus. [sent-171, score-1.195]

60 Problem Formulation: We formulate the MT decipherment problem as—given a foreign text f (i. [sent-173, score-0.915]

61 fm) and a monolingual English corpus, our goal is to decipher the foreign text and produce an English translation. [sent-178, score-0.227]

62 Probabilistic decipherment: Unlike parallel training, here we have to estimate the translation model Pθ(f|e) parameters using only monolingual data. [sent-179, score-0.322]

63 During decipherment training, our objective aisl t od estimate the model parameters θ in order to maximize the probability of the foreign corpus f. [sent-180, score-1.007]

64 We then estimate parameters of the translation model Pθ (f|e) during training. [sent-182, score-0.181]

65 Next, we present two novel( decipherment approaches for MT training without parallel data. [sent-183, score-0.879]

66 EM Decipherment: We propose a new translation model for MT decipherment which can be efficiently trained using the EM algorithm. [sent-185, score-0.843]

67 Bayesian Decipherment: We introduce a novel method for estimating IBM Model 3 parameters without parallel data, using Bayesian learning. [sent-187, score-0.182]

68 Unlike EM, this method does not face any memory issues and we use sampling to perform efficient inference during training. [sent-188, score-0.189]

69 But without parallel training data, EM training for IBM Model 3 becomes intractable due to (1) scalability and efficiency issues because of large-sized fertility and distortion parameter tables, and (2) the resulting derivation lattices become too big to be stored in memory. [sent-191, score-0.302]

70 For each English word token ei (including NULLs), choose a foreign word translation fi, with probability Pθ (fi |ei). [sent-202, score-0.407]

71 Swap any pair of adjacent foreign words fi−1 , fi, with probability Pθ (swap). [sent-205, score-0.182]

72 We use the EM algorithm to estimate all the parameters θ in order to maximize likelihood of the foreign corpus. [sent-213, score-0.253]

73 Finally, we use the Viterbi algorithm to decode the foreign sentence f and produce an English translation e that maximizes P(e) · Pθtrained(f|e). [sent-214, score-0.271]

74 We use identity mappings for numeric values (for example, “8” maps to “8”), and we split nouns into 17 morpheme units prior to decipherment training (for example, “YEARS” → “YEAR” “+S”). [sent-216, score-0.822]

75 Whole-segment Language Models: When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ? [sent-217, score-0.206]

76 We then use this model (in place of word ngram LMs) for decipherment training and decoding. [sent-221, score-0.809]

77 (1993) provide an efficient algorithm for training IBM Model 3 translation model when parallel sentence pairs are available. [sent-224, score-0.237]

78 ·pφ1θ0 · p0mθ−2φ0 (8) The alignment a is represented as a vector; aj = i implies that the foreign word fj is produced by the English word ei during translation. [sent-232, score-0.345]

79 Sampling IBM Model 3: We use point-wise Gibbs sampling to estimate the IBM Model 3 parameters. [sent-240, score-0.188]

80 The sampler is seeded with an initial English sample translation and a corresponding alignment for every foreign sentence. [sent-241, score-0.35]

81 We define several sampling oper- ators, which are applied in sequence one after the other to generate English samples for the entire foreign corpus. [sent-242, score-0.371]

82 Some of the sampling operators are described below: • • • TranslateWord(j): Sample a new English word tTrarannssllaatitoenW foorrd foreign wpolerd a fj, wfr Eomng laisl h possibilities (including NULL). [sent-243, score-0.396]

83 During sampling, we apply each of these operators to generate a new derivation e, a for the foreign text f and compute its score as P(e) · Pθ (f, a|e). [sent-246, score-0.243]

84 But unlike the greedy method, which can easily get stuck, our Bayesian approach guarantees that once the sampler converges we will be sampling from the true posterior distribution. [sent-250, score-0.206]

85 As with Bayesian decipherment for word substitution, we compute the probability of each new derivation incrementally, which makes sampling ef- ficient. [sent-251, score-0.958]

86 We also apply blocked sampling on top of point-wise sampling—we treat all occurrences of a particular foreign sentence as a single block and sample a single derivation for the entire block. [sent-252, score-0.421]

87 18 We also parallelize the sampling procedure (as described in Section 2. [sent-253, score-0.189]

88 5 Choosing the best translation: Once the sampling run finishes, we select the final sample and extract the corresponding English translations for every foreign sentence. [sent-255, score-0.432]

89 We create the following splits out of the resulting parallel corpus: TRAIN (English): 195k temporal expressions (7588 unique), 382k word tokens, 163 types. [sent-261, score-0.194]

90 We use the output from EM decipherment as the initial sample and run the sampler for 2000 iterations, during which we apply annealing with a linear schedule (2 → 0. [sent-270, score-0.866]

91 MOSES, (b) IBM 3 without The scores reported here are normalized distortion, and (2) decipherment settings— edit distance values with BLEU scores shown in parentheses. [sent-278, score-0.788]

92 Both Spanish/English sides of TRAIN are used for parallel MT training, whereas decipherment uses only monolingual English data for training LMs. [sent-281, score-0.905]

93 Results: Figure 3 compares the results of various MT systems (using parallel versus decipherment training) on the two test corpora in terms of edit distance scores (a lower score indicates closer match to the gold translation). [sent-292, score-0.932]

94 We observe that even without parallel training data, our decipherment strategies achieve MT accuracies comparable to parallel-trained systems. [sent-294, score-0.879]

95 On the Time corpus, the best decipherment (Method 2a in the figure) achieves an edit distance score of 28. [sent-295, score-0.769]

96 Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment. [sent-298, score-0.865]

97 However, higher improvements are observed when using parallel data in comparison to decipherment training which only uses monolingual data. [sent-302, score-0.905]

98 We see that deciphering with 10k monolingual Spanish sentences yields the same performance as training with around 200-500 parallel English/Spanish sentence pairs. [sent-309, score-0.198]

99 We discussed several novel decipherment approaches for achieving this goal. [sent-313, score-0.752]

100 Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. [sent-379, score-0.187]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('decipherment', 0.733), ('cipher', 0.288), ('plaintext', 0.241), ('foreign', 0.182), ('sampling', 0.165), ('channel', 0.13), ('mt', 0.126), ('substitution', 0.126), ('em', 0.125), ('ciphertext', 0.118), ('bayesian', 0.114), ('parallel', 0.096), ('translation', 0.089), ('ei', 0.067), ('ciphers', 0.065), ('ibm', 0.056), ('english', 0.052), ('iterative', 0.052), ('parallelized', 0.048), ('parameters', 0.048), ('sizes', 0.045), ('monolingual', 0.045), ('crqq', 0.045), ('maxyfxep', 0.045), ('knight', 0.044), ('lms', 0.044), ('distortion', 0.043), ('tokens', 0.043), ('temporal', 0.043), ('lm', 0.043), ('sampler', 0.041), ('gibbs', 0.039), ('sample', 0.038), ('versus', 0.038), ('tmn', 0.036), ('transtac', 0.036), ('derivation', 0.036), ('edit', 0.036), ('mappings', 0.034), ('crp', 0.034), ('geman', 0.034), ('arg', 0.034), ('decoding', 0.033), ('tables', 0.033), ('bleu', 0.033), ('boy', 0.032), ('expressions', 0.031), ('training', 0.031), ('fxyy', 0.03), ('gdxx', 0.03), ('lxwz', 0.03), ('moses', 0.029), ('choices', 0.029), ('kevin', 0.029), ('yl', 0.029), ('swap', 0.029), ('spanish', 0.029), ('corpora', 0.029), ('run', 0.028), ('vocabularies', 0.026), ('deciphering', 0.026), ('schedule', 0.026), ('fertilities', 0.026), ('opus', 0.026), ('subtitle', 0.026), ('deal', 0.026), ('yn', 0.026), ('scalability', 0.026), ('operators', 0.025), ('instantiate', 0.025), ('prior', 0.024), ('fj', 0.024), ('word', 0.024), ('inference', 0.024), ('story', 0.024), ('parallelize', 0.024), ('nonparallel', 0.024), ('overcomes', 0.024), ('samples', 0.024), ('translate', 0.024), ('aj', 0.024), ('estimate', 0.023), ('fi', 0.023), ('challenges', 0.022), ('model', 0.021), ('smart', 0.021), ('token', 0.021), ('simpler', 0.02), ('hidden', 0.02), ('priors', 0.02), ('koehn', 0.02), ('argm', 0.02), ('fertility', 0.02), ('novel', 0.019), ('translations', 0.019), ('sujith', 0.019), ('fung', 0.019), ('month', 0.019), ('viterbi', 0.019), ('without', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 94 acl-2011-Deciphering Foreign Language

Author: Sujith Ravi ; Kevin Knight

2 0.7366907 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

Author: Sujith Ravi ; Kevin Knight

Abstract: We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.

3 0.21189313 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r

4 0.12325773 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

Author: Dipanjan Das ; Slav Petrov

Abstract: We describe a novel approach for inducing unsupervised part-of-speech taggers for languages that have no labeled training data, but have translated text in a resource-rich language. Our method does not assume any knowledge about the target language (in particular no tagging dictionary is assumed), making it applicable to a wide array of resource-poor languages. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in an unsupervised model (BergKirkpatrick et al., 2010). Across eight European languages, our approach results in an average absolute improvement of 10.4% over a state-of-the-art baseline, and 16.7% over vanilla hidden Markov models induced with the Expectation Maximization algorithm.

5 0.09251757 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1

6 0.091389686 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

7 0.086383671 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

8 0.085427329 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

9 0.081436254 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

10 0.081045553 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

11 0.079581328 162 acl-2011-Identifying the Semantic Orientation of Foreign Words

12 0.079064339 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

13 0.077387035 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

14 0.075252928 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

15 0.075169556 313 acl-2011-Two Easy Improvements to Lexical Weighting

16 0.074286081 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

17 0.074256502 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

18 0.073646694 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

19 0.07235872 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

20 0.070033588 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.18), (1, -0.112), (2, 0.07), (3, 0.1), (4, 0.027), (5, -0.0), (6, 0.054), (7, 0.014), (8, -0.011), (9, 0.08), (10, 0.031), (11, 0.02), (12, 0.103), (13, 0.148), (14, -0.033), (15, 0.061), (16, -0.005), (17, 0.131), (18, 0.122), (19, -0.111), (20, 0.113), (21, 0.392), (22, 0.404), (23, 0.122), (24, 0.461), (25, 0.165), (26, -0.002), (27, -0.158), (28, 0.08), (29, 0.047), (30, -0.031), (31, 0.046), (32, 0.075), (33, 0.011), (34, -0.033), (35, 0.041), (36, -0.085), (37, 0.008), (38, 0.025), (39, 0.09), (40, 0.04), (41, -0.034), (42, 0.006), (43, -0.0), (44, 0.063), (45, -0.007), (46, 0.037), (47, -0.024), (48, -0.051), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98062843 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

Author: Sujith Ravi ; Kevin Knight

same-paper 2 0.89005214 94 acl-2011-Deciphering Foreign Language

Author: Sujith Ravi ; Kevin Knight

3 0.39609683 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

4 0.35301352 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

Author: Phil Blunsom ; Trevor Cohn

Abstract: In this work we address the problem of unsupervised part-of-speech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical Pitman-Yor processes prior, providing an elegant and principled means of incorporating lexical characteristics. Central to our approach is a new type-based sampling algorithm for hierarchical Pitman-Yor models in which we track fractional table counts. In an empirical evaluation we show that our model consistently out-performs the current state-of-the-art across 10 languages.

5 0.31322142 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

Author: Kristina Toutanova ; Michel Galley

Abstract: Contrary to popular belief, we show that the optimal parameters for IBM Model 1 are not unique. We demonstrate that, for a large class of words, IBM Model 1 is indifferent among a continuum of ways to allocate probability mass to their translations. We study the magnitude of the variance in optimal model parameters using a linear programming approach as well as multiple random trials, and demonstrate that it results in variance in test set log-likelihood and alignment error rate.

6 0.30510771 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

7 0.29768607 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

8 0.26745626 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

9 0.26068172 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

10 0.25117499 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

11 0.2424507 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

12 0.2420505 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

13 0.24117282 291 acl-2011-SystemT: A Declarative Information Extraction System

14 0.23119374 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation

15 0.23115189 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

16 0.22878785 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

17 0.22260013 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers

18 0.22211301 162 acl-2011-Identifying the Semantic Orientation of Foreign Words

19 0.21885696 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

20 0.21690211 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.02), (17, 0.051), (26, 0.034), (36, 0.07), (37, 0.072), (39, 0.025), (41, 0.246), (53, 0.031), (55, 0.039), (59, 0.053), (72, 0.023), (91, 0.04), (96, 0.184)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9651221 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

Author: Sujith Ravi ; Kevin Knight

2 0.96420026 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

Author: Antske Fokkens

Abstract: When designing grammars of natural language, typically, more than one formal analysis can account for a given phenomenon. Moreover, because analyses interact, the choices made by the engineer influence the possibilities available in further grammar development. The order in which phenomena are treated may therefore have a major impact on the resulting grammar. This paper proposes to tackle this problem by using metagrammar development as a methodology for grammar engineering. Iargue that metagrammar engineering as an approach facilitates the systematic exploration of grammars through comparison of competing analyses. The idea is illustrated through a comparative study of auxiliary structures in HPSG-based grammars for German and Dutch. Auxiliaries form a central phenomenon of German and Dutch and are likely to influence many components of the grammar. This study shows that a special auxiliary+verb construction significantly improves efficiency compared to the standard argument-composition analysis for both parsing and generation.

3 0.95983529 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

4 0.95724571 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

Author: Yun Huang ; Min Zhang ; Chew Lim Tan

Abstract: Machine transliteration is defined as automatic phonetic translation of names across languages. In this paper, we propose synchronous adaptor grammar, a novel nonparametric Bayesian learning approach, for machine transliteration. This model provides a general framework without heuristic or restriction to automatically learn syllable equivalents between languages. The proposed model outperforms the state-of-the-art EMbased model in the English to Chinese transliteration task.

5 0.95541835 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.

6 0.95433867 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction

7 0.95237619 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

8 0.95124424 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

9 0.95058763 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport

same-paper 10 0.89634299 94 acl-2011-Deciphering Foreign Language

11 0.87545109 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

12 0.87057692 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

13 0.85573316 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

14 0.85441643 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

15 0.84955966 135 acl-2011-Faster and Smaller N-Gram Language Models

16 0.84794903 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

17 0.84504277 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

18 0.84171522 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

19 0.84121656 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts

20 0.83721685 129 acl-2011-Extending the Entity Grid with Entity-Specific Features