acl acl2013 acl2013-66 knowledge-graph by maker-knowledge-mining

66 acl-2013-Beam Search for Solving Substitution Ciphers

Source: pdf

Author: Malte Nuhn ; Julian Schamper ; Hermann Ney

Abstract: In this paper we address the problem of solving substitution ciphers using a beam search approach. We present a conceptually consistent and easy to implement method that improves the current state of the art for decipherment of substitution ciphers and is able to use high order n-gram language models. We show experiments with 1:1 substitution ciphers in which the guaranteed optimal solution for 3-gram language models has 38.6% decipherment error, while our approach achieves 4.13% decipherment error in a fraction of time by using a 6-gram language model. We also apply our approach to the famous Zodiac-408 cipher and obtain slightly bet- ter (and near to optimal) results than previously published. Unlike the previous state-of-the-art approach that uses additional word lists to evaluate possible decipherments, our approach only uses a letterbased 6-gram language model. Furthermore we use our algorithm to solve large vocabulary substitution ciphers and improve the best published decipherment error rate based on the Gigaword corpus of 7.8% to 6.0% error rate.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We present a conceptually consistent and easy to implement method that improves the current state of the art for decipherment of substitution ciphers and is able to use high order n-gram language models. [sent-2, score-0.955]

2 We show experiments with 1:1 substitution ciphers in which the guaranteed optimal solution for 3-gram language models has 38. [sent-3, score-0.555]

3 13% decipherment error in a fraction of time by using a 6-gram language model. [sent-5, score-0.48]

4 We also apply our approach to the famous Zodiac-408 cipher and obtain slightly bet- ter (and near to optimal) results than previously published. [sent-6, score-0.512]

5 Furthermore we use our algorithm to solve large vocabulary substitution ciphers and improve the best published decipherment error rate based on the Gigaword corpus of 7. [sent-8, score-1.137]

6 de decipherment accuracy of the proposed algorithms is still low. [sent-16, score-0.388]

7 Improving the core decipherment algorithms is an important step for making decipherment techniques useful for practical applications. [sent-17, score-0.776]

8 In this paper we present an effective beam search algorithm which provides high decipherment accuracies while having low computational requirements. [sent-18, score-0.674]

9 We show significant improvements in decipherment accuracy in a variety of experiments while being computationally more effective than previous published works. [sent-20, score-0.426]

10 2 Related Work The experiments proposed in this paper touch many of previously published works in the decipherment field. [sent-21, score-0.426]

11 Regarding the decipherment of 1:1 substitution ciphers various works have been published: Most older papers do not use a statistical approach and instead define some heuristic measures for scoring candidate decipherments. [sent-22, score-0.942]

12 Approaches like (Hart, 1994) and (Olson, 2007) use a dictionary to check if a decipherment is useful. [sent-23, score-0.406]

13 We use our own implementation of these methods to report optimal solutions to 1:1 substitution ci1568 ProceedingsS ooffita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h. [sent-26, score-0.308]

14 (Ravi and Knight, 2011a) report the first automatic decipherment of the Zodiac-408 cipher. [sent-29, score-0.388]

15 We run our beam search approach on the same cipher and report better results without using an additional word dictionary—just by using a high order n-gram language model. [sent-31, score-0.823]

16 (Ravi and Knight, 2011b) report experiments on large vocabulary substitution ciphers based on the Transtac corpus. [sent-32, score-0.583]

17 (Dou and Knight, 2012) improve upon these results and provide state-of-the-art results on a large vocabulary word substitution cipher based on the Gigaword corpus. [sent-33, score-0.848]

18 Even though this work is currently only able to deal with substitution ciphers, phenomena like reordering, insertions and deletions can in principle be included in our approach. [sent-37, score-0.283]

19 3 Definitions In the following we will use the machine translation notation and denote the ciphertext with f1N = f1. [sent-38, score-0.187]

20 A general substitution cipher uses a table s(e|f) which contains for each cipher token f a probability cthha ct othntea tionske fno f eisa shub csitpihtuetred to wkeitnh fthe a plaintext token e. [sent-54, score-1.514]

21 Such a table for substituting cipher tokens {A, B, C, D} with plaintext tokens {a, b, c, d} cnosu {ldA ,foBr example loithok p lliakien a b c d A0. [sent-55, score-0.683]

22 1 The 1:1 substitution cipher encrypts a given plaintext into a ciphertext by replacing each plaintext token with a unique substitute: This means that the table s(e|f) contains all zeroes, except for one t“h1e. [sent-71, score-1.326]

23 We formalize the 1:1 substitutions with a bijective function φ : Vf → Ve and homophonic substitutions with a general Vfunction φ : Vf → Ve. [sent-78, score-0.205]

24 Following (Corlett and Penn, 2010), we call cipher functions φ, for which not all φ(f) ’s are fixed, partial cipher functions . [sent-79, score-1.111]

25 4 Beam Search In this Section we present our beam search approach to solving Equation 3. [sent-91, score-0.281]

26 1 General Algorithm Figure 1 shows the general structure of the beam search algorithm for the decipherment of substitution ciphers. [sent-95, score-0.957]

27 The general idea is to keep track of all partial hypotheses in two arrays Hs and Ht. [sent-96, score-0.138]

28 During search all possible extensions of the partial hypotheses in Hs are generated and scored. [sent-97, score-0.216]

29 Here, the function EXT ORDER chooses which cipher symbol is used next for extension, EXT LIMITS decides which extensions are allowed, and SCORE scores the new partial hypotheses. [sent-98, score-0.7]

30 Due to the structure of the algorithm the cardinality of all hypotheses in Hs increases in each step. [sent-101, score-0.197]

31 Thus only hypotheses of the same cardinality 1shorthand notation for φ0 extends φ 1: function BEAM SEARCH(EXT ORDER, 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: EXT LIMITS, PRUNE) init sets Hs, Ht CARDINALITY = 0 Hs. [sent-102, score-0.19]

32 CLEAR() end while return best scoring cipher function in Hs end function Figure 1: The general structure of the beam search algorithm for decipherment of substitution ciphers. [sent-105, score-1.469]

33 When Hs contains full cipher relations, the cipher relation with the maximal score is returned. [sent-108, score-1.052]

34 2 Figure 2 illustrates how the algorithm explores the search space for a homophonic substitution cipher. [sent-109, score-0.53]

35 fSin ∈ce V partial hypotheses violating this condition can never “recover” when being extended, it becomes clear that these partial hypotheses can be left out from search. [sent-113, score-0.293]

36 2n-best output can be implemented by returning the n best scoring hypotheses in the final array Hs. [sent-114, score-0.125]

37 At each level only those 4 hypotheses that survived the histogram pruning process are extended. [sent-118, score-0.291]

38 Homophonic substitution ciphers dled by the beam search algorithm, the condition that φ fulfill must can be hantoo. [sent-119, score-0.833]

39 Here is that the num- ber of cipher letters f ∈ Vf that map to any e nmax (which we will call ∈ Ve is at most this condition is violated, all further extensions will also EXT LIMITS HOMOPHONIC). [sent-120, score-0.679]

40 Given a partial hypothesis φ with given SCORE(φ) the score of an extension φ0 can be calculated as ≤ SCORE(φ0) = SCORE(φ) + NEWLY FIXED(φ, φ0) (8) where NEWLY FIXED only includes scores for n-grams that have been newly fixed in φ0 during the extension step from φ to φ0. [sent-136, score-0.261]

41 4 Extension Order (EXT ORDER) For the choice which ciphertext symbol should be fixed next during search, several possibilities exist: The overall goal is to choose an extension order that leads to an overall low error rate. [sent-138, score-0.468]

42 It is also clear that the choice of a good extension order is dependent on the score estimation function SCORE: The extension order should lead to informative scores early on so that misleading hypotheses can be pruned out early. [sent-140, score-0.305]

43 In most of our experiments we will make use of a very simple extension order: HIGHEST UNIGRAM FREQUENCY simply fixes the most frequent symbols first. [sent-141, score-0.143]

44 In each step it greedily chooses the symbol that will maximize the number of fixed ciphertext n-grams. [sent-143, score-0.334]

45 5 Pruning (PRUNE) We propose two pruning methods: HISTOGRAM PRUNING sorts all hypotheses according to their score and then keeps only the best nkeep hypotheses. [sent-146, score-0.279]

46 THRESHOLD PRUNING keeps only those hypotheses φkeep for which SCORE(φkeep) ≥ SCORE(φbest) −β (9) holds for a given parameter β ≥ 0. [sent-147, score-0.112]

47 The basic idea is to run a decipherment algorithm—in their case an EM algorithm based approach—on a subset of the vocabulary. [sent-151, score-0.439]

48 After having obtained the results from the restricted vocabulary run, these results are used to initialize a decipherment run with a larger vocabulary. [sent-152, score-0.469]

49 The results from this run will then be used for a further decipherment run with an even larger vocabulary and so on. [sent-153, score-0.497]

50 In our large vocabulary word substitution cipher experiments we iteratively increase the vocabulary from the 1000 most frequent words, until we finally reach the 50000 most frequent words. [sent-154, score-0.951]

51 6 Experimental Evaluation We conduct experiments on letter based 1:1 substitution ciphers, the homophonic substitution ci- pher Zodiac-408, and word based 1:1 substitution ciphers. [sent-155, score-1.085]

52 Roughly speaking, SER reports the fraction of symbols in the deciphered text that are not correct, while MER reports the fraction of incorrect mappings in φ. [sent-157, score-0.144]

53 In decipherment experiments, SER will often be lower than MER, since it is often easier to decipher frequent words. [sent-159, score-0.436]

54 1 Letter Substitution Ciphers As ciphertext we use the text of the English Wikipedia article about History4, remove all pictures, tables, and captions, convert all letters to lowercase, and then remove all non-letter and nonspace symbols. [sent-161, score-0.209]

55 We create the ciphertext using a 1:1 substitution cipher in which we fix the mapping of the space symbol ’ ’ . [sent-164, score-1.102]

56 85 Table 1: Symbol error rates (SER), Mapping error rates (MER) and runtimes (RT) in dependence of language model order (ORDER) and histogram pruning size (BEAM) for decipherment of letter substitution ciphers of length 128. [sent-230, score-1.576]

57 Results for beam size “∞” were obtained using A∗ search. [sent-232, score-0.229]

58 Note that fixing the symbol makes the problem much easier: The exact methods show much higher computational demands for lengths beyond 256 letters when not fixing the space symbol. [sent-234, score-0.195]

59 The plaintext language model we use a letter based (Ve = {a, . [sent-235, score-0.234]

60 We use extension limits fitting the 1: 1 substitution cipher nmax = 1 and histogram pruning with different beam sizes. [sent-242, score-1.443]

61 Figure 3 shows the results of our algorithm for different cipher length. [sent-244, score-0.535]

62 We use a beam size of 100k for the 4, 5 and 6-gram case. [sent-245, score-0.229]

63 Most remarkably our 6-gram beam search results are significantly better than all methods presented in the lit- ’’ erature. [sent-246, score-0.263]

64 For the cipher length of 32 we obtain a symbol error rate of just 4. [sent-247, score-0.719]

65 without search errors) for a 3-gram Cipher Length Figure 3: Symbol error rates for decipherment of letter substitution ciphers of different lengths. [sent-250, score-1.163]

66 Error bars show the 95% confidence interval based on decipherment on 50 different ciphers. [sent-251, score-0.388]

67 Beam search was performed with a beam size of “100k”. [sent-252, score-0.28]

68 language model has a symbol error rate as high as 38. [sent-253, score-0.207]

69 Table 1 shows error rates and runtimes of our algorithm for different beam sizes and language model orders given a fixed ciphertext length of 128 letters. [sent-255, score-0.722]

70 scores if only To summarize: The beam search method is significantly faster and obtains significantly better results than previously published methods. [sent-258, score-0.301]

71 Furthermore it offers a good trade-off between CPU time and decipherment accuracy. [sent-259, score-0.388]

72 2 Zodiac-408 Cipher As ciphertext we use a transcription of the Zodiac-408 cipher. [sent-262, score-0.171]

73 Furthermore, the last 17 letters of the cipher do not form understandable English when applying the same homophonic substitution that deciphers the rest of the cipher. [sent-269, score-1.006]

74 This makes the Zodiac-408 a good candidate for testing the robustness of a decipherment algorithm. [sent-270, score-0.388]

75 We assume a homophonic substitution cipher, even though the cipher is not strictly homophonic: It contains three cipher symbols that correspond to two or more plaintext symbols. [sent-271, score-1.706]

76 We ignore this fact for our experiments, and count—in case of the MER only—the decipherment for these symbols as correct when the obtained mapping is contained in the set of reference symbols. [sent-272, score-0.477]

77 We use extension limits with nmax = 8 and histogram pruning with beam sizes of 10k up to 10M. [sent-273, score-0.67]

78 The plaintext language model is based on the same subset of Gigaword (Graff et al. [sent-274, score-0.171]

79 , 2007) data as the experiments for the letter substitution ci- phers. [sent-275, score-0.346]

80 96 262 1992 17 701 167 181 Table 2: Symbol error rates (SER), Mapping error rates (MER) and runtimes (RT) in dependence of language model order (ORDER) and histogram pruning size (BEAM) for the decipherment of the Zodiac-408 cipher. [sent-300, score-0.983]

81 Figure 4 shows the first parts of the cipher and our best decipherment. [sent-304, score-0.512]

82 Table 2 shows the results of our algorithm on the Zodiac-408 cipher for different language model orders and pruning settings. [sent-305, score-0.678]

83 To summarize: Our final decipherment—for which we only use a 6-gram language model—has a symbol error rate of only 2. [sent-306, score-0.207]

84 0%, which is slightly better than the best decipherment reported in (Ravi and Knight, 2011a). [sent-307, score-0.388]

85 They used an n-gram lan- guage model together with a word dictionary and obtained a symbol error rate of 2. [sent-308, score-0.225]

86 We run experiments for three different setups: The “JRC” and “Gigaword” setups use the first half of the respective corpus as ciphertext, while the plaintext language model of order n = 3 was 1574 Setup Top Gigaword 1k 10k 20k 50k MER [%] 81. [sent-317, score-0.241]

87 58 00h 31m 13h 03m Table 3: Word error rates (WER), Mapping error rates (MER) and runtimes (RT) for iterative decipherment run on the (TOP) most frequent words. [sent-333, score-0.818]

88 We encrypt the ciphertext using a 1:1 substitution cipher on word level, imposing a much larger vocabulary size. [sent-338, score-1.019]

89 We use histogram pruning with a beam size of 128 and use extension limits of nmax = 1. [sent-339, score-0.665]

90 Different to the previous experiments, we use iterative beam search with iterations as shown in Table 3. [sent-340, score-0.296]

91 The results for the Gigaword task are directly comparable to the word substitution experiments presented in (Dou and Knight, 2012). [sent-341, score-0.283]

92 Their final decipherment has a symbol error rate of 7. [sent-342, score-0.595]

93 8% symbol error rate correspond to a larger improvement in terms of mapping error rate. [sent-347, score-0.308]

94 This can also be seen when looking at Table 3: An improvement of the symbol error rate from 6. [sent-348, score-0.207]

95 96% corresponds to an improvement of mapping error rate from 21. [sent-350, score-0.139]

96 To summarize: Using our beam search algorithm in an iterative fashion, we are able to improve the state-of-the-art decipherment accuracy for word substitution ciphers. [sent-353, score-0.99]

97 7 Conclusion We have presented a simple and effective beam search approach to the decipherment problem. [sent-354, score-0.651]

98 We have shown in a variety of experiments—letter substitution ciphers, the Zodiac-408, and word substitution ciphers—that our approach outperforms the current state of the art while being conceptually simpler and keeping computational demands low. [sent-355, score-0.602]

99 We want to note that the presented algorithm is not restricted to 1:1 and homophonic substitution ciphers: It is possible to extend the algorithm to solve n:m mappings. [sent-356, score-0.502]

100 Along with more sophisticated pruning strategies, score estimation functions, and extension orders, this will be left for future research. [sent-357, score-0.214]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cipher', 0.512), ('decipherment', 0.388), ('substitution', 0.283), ('ciphers', 0.247), ('beam', 0.212), ('mer', 0.193), ('ext', 0.173), ('homophonic', 0.173), ('plaintext', 0.171), ('ciphertext', 0.171), ('vf', 0.119), ('ser', 0.113), ('ravi', 0.108), ('pruning', 0.107), ('symbol', 0.102), ('hypotheses', 0.095), ('limits', 0.092), ('histogram', 0.089), ('nmax', 0.085), ('runtimes', 0.082), ('jrc', 0.08), ('cardinality', 0.079), ('hs', 0.073), ('error', 0.067), ('rates', 0.064), ('letter', 0.063), ('extension', 0.063), ('ve', 0.061), ('gigaword', 0.061), ('knight', 0.06), ('corlett', 0.057), ('symbols', 0.055), ('vocabulary', 0.053), ('ref', 0.051), ('search', 0.051), ('fixed', 0.045), ('partial', 0.043), ('prune', 0.04), ('deciphered', 0.039), ('fj', 0.038), ('rt', 0.038), ('letters', 0.038), ('published', 0.038), ('rate', 0.038), ('aachen', 0.037), ('nuhn', 0.037), ('graff', 0.037), ('ht', 0.037), ('orders', 0.036), ('deciphering', 0.035), ('mapping', 0.034), ('dou', 0.034), ('iterative', 0.033), ('nkeep', 0.032), ('veval', 0.032), ('sujith', 0.031), ('array', 0.03), ('cryptograms', 0.028), ('enciphered', 0.028), ('score', 0.028), ('run', 0.028), ('extensions', 0.027), ('optimal', 0.025), ('fn', 0.025), ('frequent', 0.025), ('fraction', 0.025), ('malte', 0.025), ('heuristics', 0.024), ('heuristic', 0.024), ('optimally', 0.023), ('rwth', 0.023), ('decipher', 0.023), ('algorithm', 0.023), ('summarize', 0.023), ('fulfill', 0.023), ('functions', 0.022), ('sizes', 0.022), ('steinberger', 0.022), ('setups', 0.022), ('order', 0.02), ('demands', 0.019), ('newly', 0.019), ('cpu', 0.018), ('kevin', 0.018), ('solving', 0.018), ('dictionary', 0.018), ('token', 0.018), ('fixing', 0.018), ('dependence', 0.018), ('conceptually', 0.017), ('condition', 0.017), ('keeps', 0.017), ('size', 0.017), ('notation', 0.016), ('trivial', 0.016), ('estimation', 0.016), ('roughly', 0.016), ('substitutions', 0.016), ('ilp', 0.016), ('chooses', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 66 acl-2013-Beam Search for Solving Substitution Ciphers

Author: Malte Nuhn ; Julian Schamper ; Hermann Ney

2 0.4769572 108 acl-2013-Decipherment

Author: Kevin Knight

Abstract: The first natural language processing systems had a straightforward goal: decipher coded messages sent by the enemy. This tutorial explores connections between early decipherment research and today’s NLP work. We cover classic military and diplomatic ciphers, automatic decipherment algorithms, unsolved ciphers, language translation as decipherment, and analyzing ancient writing as decipherment. 1 Tutorial Overview The first natural language processing systems had a straightforward goal: decipher coded messages sent by the enemy. Sixty years later, we have many more applications, including web search, question answering, summarization, speech recognition, and language translation. This tutorial explores connections between early decipherment research and today’s NLP work. We find that many ideas from the earlier era have become core to the field, while others still remain to be picked up and developed. We first cover classic military and diplomatic cipher types, including complex substitution ciphers implemented in the first electro-mechanical encryption machines. We look at mathematical tools (language recognition, frequency counting, smoothing) developed to decrypt such ciphers on proto-computers. We show algorithms and extensive empirical results for solving different types of ciphers, and we show the role of algorithms in recent decipherments of historical documents. We then look at how foreign language can be viewed as a code for English, a concept developed by Alan Turing and Warren Weaver. We describe recently published work on building automatic translation systems from non-parallel data. We also demonstrate how some of the same algorithmic tools can be applied to natural language tasks like part-of-speech tagging and word alignment. Turning back to historical ciphers, we explore a number of unsolved ciphers, giving results of initial computer experiments on several of them. Finally, we look briefly at writing as a way to encipher phoneme sequences, covering ancient scripts and modern applications. 2 Outline 1. Classical military/diplomatic ciphers (15 minutes) • 60 cipher types (ACA) • Ciphers vs. codes • Enigma cipher: the mother of natural language processing computer analysis of text language recognition Good-Turing smoothing – – – 2. Foreign language as a code (10 minutes) • • Alan Turing’s ”Thinking Machines” Warren Weaver’s Memorandum 3. Automatic decipherment (55 minutes) • Cipher type detection • Substitution ciphers (simple, homophonic, polyalphabetic, etc) plaintext language recognition ∗ how much plaintext knowledge is – nheowede mdu 3 Proce diSnogfsia, of B thuleg5a r1iast, A Anungu aslt M4-9e t2in01g3 o.f ? tc he20 A1s3so Acsiasoticoinat fio rn C fo rm Cpoumtaptuiotantaioln Lainlg Luinisgtuicis ,tpi casges 3–4, – ∗ index of coincidence, unicity distance, oanf dc oointhceidr measures navigating a difficult search space ∗ frequencies of letters and words ∗ pattern words and cribs ∗ pElMin,g ILP, Bayesian models, sam– recent decipherments ∗ Jefferson cipher, Copiale cipher, cJievfifle war ciphers, n Caovaplia Enigma • • • • Application to part-of-speech tagging, Awopprdli alignment Application to machine translation withoAuptp parallel t teoxtm Parallel development of cryptography aPnarda ltrleanls dlaetvioenlo Recently released NSA internal nReewcselnettlyter (1974-1997) 4. *** Break *** (30 minutes) 5. Unsolved ciphers (40 minutes) • Zodiac 340 (1969), including computatZioodnaial cw 3o4r0k • Voynich Manuscript (early 1400s), including computational ewarolyrk • Beale (1885) • Dorabella (1897) • Taman Shud (1948) • Kryptos (1990), including computatKiorynaplt owsor (k1 • McCormick (1999) • Shoeboxes in attics: DuPonceau jour- nal, Finnerana, SYP, Mopse, diptych 6. Writing as a code (20 minutes) • Does writing encode ideas, or does it encDoodees phonemes? • Ancient script decipherment Egyptian hieroglyphs Linear B Mayan glyphs – – – – wUgoarkritic, including computational Chinese N ¨ushu, including computational work • Automatic phonetic decipherment • Application to transliteration 7. Undeciphered writing systems (15 minutes) • Indus Valley Script (3300BC) • Linear A (1900BC) • Phaistos disc (1700BC?) • Rongorongo (1800s?) – 8. Conclusion and further questions (15 minutes) 3 About the Presenter Kevin Knight is a Senior Research Scientist and Fellow at the Information Sciences Institute of the University of Southern California (USC), and a Research Professor in USC’s Computer Science Department. He received a PhD in computer science from Carnegie Mellon University and a bachelor’s degree from Harvard University. Professor Knight’s research interests include natural language processing, machine translation, automata theory, and decipherment. In 2001, he co-founded Language Weaver, Inc., and in 2011, he served as President of the Association for Computational Linguistics. Dr. Knight has taught computer science courses at USC for more than fifteen years and co-authored the widely adopted textbook Artificial Intelligence. 4

3 0.45453888 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

Author: Malte Nuhn ; Hermann Ney

Abstract: In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard. We show that in this case the decipherment problem is equivalent to the quadratic assignment problem (QAP). To the best of our knowledge, this connection between the QAP and the decipherment problem has not been known in the literature before.

4 0.22821806 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).

5 0.13823324 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

Author: Ji Ma ; Jingbo Zhu ; Tong Xiao ; Nan Yang

Abstract: In this paper, we combine easy-first dependency parsing and POS tagging algorithms with beam search and structured perceptron. We propose a simple variant of “early-update” to ensure valid update in the training process. The proposed solution can also be applied to combine beam search and structured perceptron with other systems that exhibit spurious ambiguity. On CTB, we achieve 94.01% tagging accuracy and 86.33% unlabeled attachment score with a relatively small beam width. On PTB, we also achieve state-of-the-art performance. 1

6 0.10116324 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

7 0.093444638 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

8 0.079735167 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

9 0.068399638 133 acl-2013-Efficient Implementation of Beam-Search Incremental Parsers

10 0.058168367 206 acl-2013-Joint Event Extraction via Structured Prediction with Global Features

11 0.054615788 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

12 0.046606284 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

13 0.046041623 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

14 0.042624768 370 acl-2013-Unsupervised Transcription of Historical Documents

15 0.041970771 325 acl-2013-Smoothed marginal distribution constraints for language modeling

16 0.040223982 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

17 0.038645729 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation

18 0.037849456 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

19 0.037466638 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

20 0.036689125 57 acl-2013-Arguments and Modifiers from the Learner's Perspective

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.112), (1, -0.064), (2, 0.007), (3, 0.015), (4, -0.029), (5, -0.015), (6, 0.076), (7, -0.005), (8, -0.116), (9, -0.054), (10, -0.041), (11, -0.1), (12, -0.124), (13, -0.302), (14, 0.075), (15, -0.431), (16, -0.094), (17, -0.168), (18, -0.103), (19, 0.304), (20, -0.006), (21, 0.053), (22, -0.149), (23, -0.155), (24, 0.096), (25, 0.046), (26, 0.13), (27, -0.06), (28, -0.118), (29, -0.03), (30, -0.057), (31, 0.048), (32, 0.03), (33, -0.058), (34, -0.062), (35, -0.047), (36, 0.044), (37, -0.008), (38, 0.062), (39, -0.057), (40, 0.008), (41, 0.003), (42, 0.046), (43, 0.099), (44, 0.046), (45, 0.035), (46, 0.014), (47, -0.071), (48, -0.017), (49, -0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95426565 66 acl-2013-Beam Search for Solving Substitution Ciphers

Author: Malte Nuhn ; Julian Schamper ; Hermann Ney

2 0.94775707 108 acl-2013-Decipherment

Author: Kevin Knight

3 0.92487979 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

Author: Malte Nuhn ; Hermann Ney

4 0.38291961 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

5 0.27097976 370 acl-2013-Unsupervised Transcription of Historical Documents

Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein

Abstract: We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 3 1% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.

6 0.26764017 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

7 0.22506955 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

8 0.19402516 133 acl-2013-Efficient Implementation of Beam-Search Incremental Parsers

9 0.18296865 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures

10 0.18235759 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

11 0.1807857 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

12 0.17845201 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation

13 0.17388247 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

14 0.16912186 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

15 0.16487786 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

16 0.16128479 382 acl-2013-Variational Inference for Structured NLP Models

17 0.15504937 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

18 0.15133837 381 acl-2013-Variable Bit Quantisation for LSH

19 0.15023977 325 acl-2013-Smoothed marginal distribution constraints for language modeling

20 0.14871068 225 acl-2013-Learning to Order Natural Language Texts

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.084), (6, 0.029), (8, 0.026), (11, 0.025), (24, 0.02), (26, 0.025), (35, 0.049), (42, 0.061), (48, 0.042), (70, 0.031), (88, 0.015), (90, 0.014), (95, 0.479)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99027485 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

Author: Chi-kiu Lo ; Karteek Addanki ; Markus Saers ; Dekai Wu

Abstract: We present the first ever results showing that tuning a machine translation system against a semantic frame based objective function, MEANT, produces more robustly adequate translations than tuning against BLEU or TER as measured across commonly used metrics and human subjective evaluation. Moreover, for informal web forum data, human evaluators preferred MEANT-tuned systems over BLEU- or TER-tuned systems by a significantly wider margin than that for formal newswire—even though automatic semantic parsing might be expected to fare worse on informal language. We argue thatbypreserving the meaning ofthe trans- lations as captured by semantic frames right in the training process, an MT system is constrained to make more accurate choices of both lexical and reordering rules. As a result, MT systems tuned against semantic frame based MT evaluation metrics produce output that is more adequate. Tuning a machine translation system against a semantic frame based objective function is independent ofthe translation model paradigm, so, any translation model can benefit from the semantic knowledge incorporated to improve translation adequacy through our approach.

2 0.98194575 336 acl-2013-Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Author: Kang Liu ; Liheng Xu ; Jun Zhao

Abstract: Mining opinion targets is a fundamental and important task for opinion mining from online reviews. To this end, there are usually two kinds of methods: syntax based and alignment based methods. Syntax based methods usually exploited syntactic patterns to extract opinion targets, which were however prone to suffer from parsing errors when dealing with online informal texts. In contrast, alignment based methods used word alignment model to fulfill this task, which could avoid parsing errors without using parsing. However, there is no research focusing on which kind of method is more better when given a certain amount of reviews. To fill this gap, this paper empiri- cally studies how the performance of these two kinds of methods vary when changing the size, domain and language of the corpus. We further combine syntactic patterns with alignment model by using a partially supervised framework and investigate whether this combination is useful or not. In our experiments, we verify that our combination is effective on the corpus with small and medium size.

3 0.98119062 359 acl-2013-Translating Dialectal Arabic to English

Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov

Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.

4 0.98097801 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example

Author: Kareem Darwish

Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.

5 0.96916342 37 acl-2013-Adaptive Parser-Centric Text Normalization

Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li

Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.

6 0.96905059 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

same-paper 7 0.96453077 66 acl-2013-Beam Search for Solving Substitution Ciphers

8 0.95337176 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection

9 0.92591166 289 acl-2013-QuEst - A translation quality estimation framework

10 0.87168932 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

11 0.86190522 135 acl-2013-English-to-Russian MT evaluation campaign

12 0.85745811 255 acl-2013-Name-aware Machine Translation

13 0.84546041 317 acl-2013-Sentence Level Dialect Identification in Arabic

14 0.84144568 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

15 0.83862096 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation

16 0.83717233 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

17 0.82905728 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

18 0.81458747 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

19 0.81323469 240 acl-2013-Microblogs as Parallel Corpora

20 0.80277658 97 acl-2013-Cross-lingual Projections between Languages from Different Families