acl acl2011 acl2011-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sujith Ravi ; Kevin Knight
Abstract: We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.
Reference: text
sentIndex sentText sentNum sentScore
1 Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. [sent-3, score-0.692]
2 We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. [sent-5, score-1.384]
3 The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before. [sent-6, score-1.057]
4 These ciphers replace (English) plaintext letters with cipher symbols in order to generate the ciphertext sequence. [sent-8, score-1.503]
5 There exist many published works on automatic decipherment methods for solving simple lettersubstitution ciphers. [sent-9, score-0.529]
6 Many existing methods use dictionary-based attacks employing huge word dictionaries to find plaintext patterns within the ciphertext (Peleg and Rosenfeld, 1979; Ganesan and Sherman, 1993; Jakobsen, 1995; Olson, 2007). [sent-10, score-0.643]
7 Ravi and Knight (2008) formulate decipherment as an integer programming problem and provide an exact method to solve simple substitution ciphers by using letter n-gram mod- els along with deterministic key constraints. [sent-16, score-1.269]
8 Corlett and Penn (2010) work with large ciphertexts containing thousands of characters and provide another exact decipherment method using an A* search algorithm. [sent-17, score-0.519]
9 The famous Zodiac serial killer used one such cipher system for communication. [sent-21, score-0.599]
10 In 1969, the killer sent a three-part cipher message to newspapers claiming credit for recent shootings and crimes committed near the San Francisco area. [sent-22, score-0.623]
11 Oranchak (2008) presents a method for solving the Zodiac-408 cipher automatically with a dictionary-based attack using a genetic algorithm. [sent-24, score-0.591]
12 However, his method relies on using plaintext words from the known solution to solve the cipher, which departs from a strict decipherment scenario. [sent-25, score-0.945]
13 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 239–247, solving substitution ciphers using Bayesian learning. [sent-28, score-0.528]
14 Our novel contributions are as follows: • • • • We present a new probabilistic decipherment approach using Bayesian ibnifliesrtiecnc dee cwipithhe sparse priors, which can be used to solve different types of substitution ciphers. [sent-29, score-0.711]
15 Our new method combines information from Owourrd n edwict mioentharoieds along swi itnhf lremttaerti n-gram models, providing a robust decipherment model which offsets the disadvantages faced by previous approaches. [sent-30, score-0.515]
16 We evaluate the Bayesian decipherment output on t hevreaelu daitfefe threen Bt types onf d seucbipsthieturtmioenn ciphers and show that unlike a previous approach, our new method solves all the ciphers completely. [sent-31, score-1.17]
17 In a letter substitution cipher, every letter p in the natural language (plaintext) sequence is replaced by a cipher token c, according to some substitution key. [sent-34, score-1.26]
18 The sender can encrypt the message using one of many different cipher systems. [sent-43, score-0.618]
19 The particular type of cipher system chosen determines the properties of the key. [sent-44, score-0.533]
20 For example, the substitution key can be deterministic in 240 both the encipherment and decipherment directions as shown in the above example—i. [sent-45, score-0.775]
21 , there is a 1-to1 correspondence between the plaintext letters and ciphertext symbols. [sent-47, score-0.643]
22 1 Simple Substitution Ciphers The key used in a simple substitution cipher is deterministic in both the encipherment and decipherment directions, i. [sent-50, score-1.305]
23 , there is a 1-to-1 mapping between plaintext letters and ciphertext symbols. [sent-52, score-0.643]
24 The example shown earlier depicts how a simple substitution cipher works. [sent-53, score-0.711]
25 We encrypt an original English plaintext message using a randomly generated simple substitution key to create the ciphertext. [sent-55, score-0.76]
26 , plaintext character “ ” maps to ciphertext character “ ”. [sent-58, score-0.659]
27 Figure 1 (top) shows a portion of the ciphertext along with the original plaintext used to create the cipher. [sent-59, score-0.621]
28 2 Homophonic Ciphers A homophonic cipher uses a substitution key that maps a plaintext letter to more than one cipher symbol. [sent-61, score-2.039]
29 These ciphers are more complex than simple substitution ciphers. [sent-79, score-0.515]
30 The number of potential cipher symbol substitutes for a particular plaintext letter is often proportional to the frequency of that letter in the plaintext language— for example, the English letter “E” is assigned more cipher symbols than “Z”. [sent-81, score-2.457]
31 The substitution key is, however, deterministic in the decipherment direction—each ciphertext symbol maps to a single plaintext letter. [sent-83, score-1.374]
32 Data: For our decipherment experiments on homophonic ciphers, we use the same 414-letter English plaintext used in Section 2. [sent-85, score-1.092]
33 We encrypt this message using a homophonic substitution key (available from http://www. [sent-87, score-0.492]
34 Figure 1 (middle) displays a section of the homophonic cipher (with spaces) and the original plaintext message used in our experiments. [sent-92, score-1.182]
35 3 Homophonic Ciphers without spaces (Zodiac-408 cipher) In the previous two cipher systems, the wordboundary information was preserved in the cipher. [sent-94, score-0.567]
36 We now consider a more difficult homophonic cipher by removing space characters from the original plaintext. [sent-95, score-0.724]
37 ” Without the word boundary information, typical dictionary-based decipherment attacks fail on such 241 ciphers. [sent-102, score-0.522]
38 One of the most famous homophonic ciphers in history was used by the infamous Zodiac serial killer in the 1960’s. [sent-104, score-0.581]
39 The Zodiac messages include two interesting ciphers: (1) a 408-symbol homophonic cipher without spaces (which was solved manually by hand), and (2) a similar looking 340-symbol cipher that has yet to be solved. [sent-108, score-1.263]
40 Here is a sample of the Zodiac-408 cipher message: . [sent-109, score-0.541]
41 Besides the difficulty with missing word boundaries and non-determinism associated with the key, the Zodiac-408 cipher poses several additional challenges which makes standard homophonic it harder to solve than any cipher. [sent-115, score-0.716]
42 Also, the last 18 characters of the plaintext message does not seem to make any sense (“EBEORIETEMETHHPITI”). [sent-117, score-0.49]
43 Data: Figure 1 (bottom) displays the Zodiac-408 cipher (consisting of 408 tokens, 54 symbol types) along with the original plaintext message. [sent-118, score-0.978]
44 We run the new decipherment method (described in Section 3. [sent-119, score-0.515]
45 cn, the goal of decipherment is to uncover the hidden plaintext message p1. [sent-124, score-0.966]
46 , number of possible key mappings) that we have to navigate during decipherment is huge—a simple substitution cipher has a keyspace size of 26! [sent-130, score-1.255]
47 , whereas a homophonic cipher such as the Zodiac-408 cipher has 2654 possible key mappings. [sent-131, score-1.247]
48 Next, we describe a new Bayesian decipherment approach for tackling substitution ciphers. [sent-132, score-0.685]
49 (2010) proposed a Bayesian approach in an archaeological decipherment scenario. [sent-139, score-0.492]
50 For simple letter substitution ciphers, the original substitution key exhibits a 1-to-1 correspondence between the plaintext letters and cipher types. [sent-142, score-1.57]
51 Here, we propose a novel approach for deciphering substitution ciphers using Bayesian inference. [sent-145, score-0.537]
52 for a simple substitution cipher), our Bayesian framework requires us to sample only a small number of keys during the decipherment process. [sent-147, score-0.731]
53 242 Probabilistic Decipherment: Our decipherment method follows a noisy-channel approach. [sent-148, score-0.503]
54 Substitute each plaintext letter pi with a ciphertext token ci, with probability P(ci |pi) in order to generate the ciphertext sequence c = c1. [sent-164, score-1.096]
55 We build a statistical English language model (LM) for the plaintext source model P(p), which assigns a probability to any English letter sequence. [sent-168, score-0.611]
56 In our decipherment framework, a Chinese Restaurant Process formulation is used to model both the source and channel. [sent-170, score-0.492]
57 Substitute p1 with cipher token bility P0 (c1|p1) + 4. [sent-174, score-0.551]
58 Generate English plaintext letter ability pi, with prob- α · P0(pi|pi−1) + Ci1−1(pi−1,pi) α + Ci1−1(pi−1) BayesiCnPplhaoienurt oxnt:WrDi qnRE IgCc fTI PfTm npEH wNEnqRcIMs Nw nE NwoA Tf wN gCcI vS nEwTfN pHT nEkL oA wN oAaG zULk AtYoG vSa EIcnS v WOhr FuHpEnDi Rq Oh En gCTfUzHMspHEn EwN. [sent-176, score-0.599]
59 5286731 decipherment (using word+3-gram LM) for three different ciphers: (a) Simple Substitution Cipher (top), (b) Homophonic Substitution Cipher with spaces (middle), and (c) Zodiac-408 Cipher (bottom). [sent-181, score-0.541]
60 Substitute bility pi with cipher token ci, with proba- β · P0(ci|pi) + C1i−1(pi, ci) β + Ci1−1(pi) 7. [sent-183, score-0.636]
61 , any plaintext hypothesis corresponding to the given ciphertext sequence. [sent-187, score-0.62]
62 , a plaintext letter can be substituted with any given cipher type with equal probability). [sent-191, score-1.143]
63 C1i−1 represents the count of events occurring before plaintext letter pi in the derivation (we call this the “cache”). [sent-192, score-0.715]
64 We could follow a point-wise sampling strategy, where we sample plaintext letter choices for every cipher token, one at a time. [sent-196, score-1.242]
65 But we already know that the substitution ciphers described here exhibit determinism in the deciphering direction,1 i. [sent-197, score-0.55]
66 , although we have no idea about the key mappings themselves, we do know that there exists only a single plaintext letter mapping for every cipher symbol type in the true key. [sent-199, score-1.191]
67 So sampling plaintext choices for every cipher token separately is not an efficient strategy— our sampler may spend too much time exploring invalid keys (which map the same cipher symbol to different plaintext letters). [sent-200, score-2.073]
68 Under 1This assumption does not strictly apply to the Zodiac-408 cipher where a few cipher symbols exhibit non-determinism in the decipherment direction as well. [sent-203, score-1.561]
69 244 this scheme, we sample plaintext letter choices for each cipher symbol type. [sent-204, score-1.189]
70 In every step, we sample a new plaintext letter for a cipher type and update the entire plaintext hypothesis (i. [sent-205, score-1.588]
71 , plaintext letters at all corresponding positions) to reflect this change. [sent-207, score-0.456]
72 For example, if we sample a new choice pnew for a cipher symbol which occurs at positions 4, 10, 18, then we update plaintext letters p4, p10 and p18 with the new choice pnew. [sent-208, score-1.023]
73 Combining letter n-gram language models with word dictionaries: Many existing probabilistic ap- proaches use statistical letter n-gram language models of English to assign P(p) probabilities to plaintext hypotheses during decipherment. [sent-211, score-0.776]
74 Unlike previous approaches, our decipherment method combines information from both sources— letter n-grams and word dictionaries. [sent-213, score-0.692]
75 Sampling for ciphers without spaces: For ciphers without spaces, dictionaries are hard to use because we do not know where words start and end. [sent-227, score-0.662]
76 We introduce a new sampling operator which counters this problem and allows us to perform inference using the same decipherment model described earlier. [sent-228, score-0.594]
77 In a first sampling pass, we sample from 26 plaintext letter choices (e. [sent-229, score-0.724]
78 The new strategy allows us to apply Bayesian decipherment even to ciphers without spaces. [sent-251, score-0.814]
79 As a result, we now have a new decipherment method that consistently works for a range of different types of substitution ciphers. [sent-252, score-0.684]
80 ”” Decoding the ciphertext: After the sampling run has we choose the final sample as our English plaintext decipherment output. [sent-253, score-1.028]
81 finished,4 4For letter substitution decipherment we want to keep the language model probabilities fixed during training, and hence we set the prior on that model to be high (α = 104). [sent-254, score-0.864]
82 We instantiate a key which matches frequently occurring plaintext letters to frequent cipher symbols and use this to generate an initial sample for the given ciphertext and run the sampler for 5000 iterations. [sent-257, score-1.286]
83 245 4 Experiments and Results We run decipherment experiments on different types of letter substitution ciphers (described in Section 2). [sent-259, score-1.184]
84 In particular, we work with the following three ciphers: (a) 414-letter Simple Substitution Cipher (b) 414-letter Homophonic Cipher (with spaces) (c) Zodiac-408 Cipher Methods: For each cipher, we run and compare the output from two different decipherment approaches: 1. [sent-260, score-0.504]
85 They use the EM algorithm to estimate the channel parameters θ during decipherment training. [sent-263, score-0.515]
86 The given ciphertext c is then decoded by using the Viterbi algorithm to choose the plaintext decoding p that maximizes P(p) · Pθ(c|p)3, stretching itnhge pch tahnatn eml probabilities. [sent-264, score-0.629]
87 Evaluation: We evaluate the quality of a particular decipherment as the percentage of cipher tokens that are decoded correctly. [sent-268, score-1.03]
88 Results: Figure 2 compares the decipherment performance for the EM method with Bayesian decipherment (using type sampling and sparse priors) on three different types of substitution ciphers. [sent-269, score-1.288]
89 Even with a 3-gram letter LM, our method yields a +63% improvement in decipherment accuracy over EM on the homophonic cipher with spaces. [sent-271, score-1.376]
90 Figure 1 shows samples from the Bayesian decipherment output for all three ciphers. [sent-273, score-0.492]
91 For ciphers without spaces, our method automatically guesses the word boundaries for the plaintext hypothesis. [sent-274, score-0.755]
92 For the Zodiac-408 cipher, we compare the performance achieved by Bayesian decipherment under different settings: • • • Letter n-gram versus Word+n-gram LMs— Figure 2n sghraowms tvhearts using a word+3-gram Ls—M instead of a 3-gram LM results in +75% improvement in decipherment accuracy. [sent-281, score-1.002]
93 0) helps for such problems and produces better decipherment results (97. [sent-284, score-0.492]
94 Type versus Point-wise sampling—Unlike point-wise sampling, type sampling quickly converges to better decipherment solutions. [sent-287, score-0.604]
95 After 5000 sampling passes over the entire data, decipherment output from type sampling scores 97. [sent-288, score-0.665]
96 5 Conclusion In this work, we presented a novel Bayesian decipherment approach that can effectively solve a va5Both sampling runs were seeded with the same random initial sample. [sent-294, score-0.591]
97 Unlike previous approaches, our method combines information from letter n-gram language models and word dictionaries and provides a robust decipherment model. [sent-296, score-0.71]
98 Using Bayesian decipherment, we can successfully solve the Zodiac-408 cipher—the first time this is achieved by a fully automatic method in a strict decipherment scenario. [sent-298, score-0.523]
99 For future work, there are other interesting decipherment tasks where our method can be applied. [sent-299, score-0.503]
100 Evolutionary algorithm for decryption of monoalphabetic homophonic substitution ciphers encoded as constraint satisfaction problems. [sent-362, score-0.7]
wordName wordTfidf (topN-words)
[('cipher', 0.518), ('decipherment', 0.492), ('plaintext', 0.422), ('ciphers', 0.322), ('ciphertext', 0.187), ('substitution', 0.181), ('homophonic', 0.178), ('letter', 0.177), ('pi', 0.085), ('bayesian', 0.083), ('sampling', 0.079), ('message', 0.052), ('spaces', 0.049), ('encipherment', 0.048), ('encrypt', 0.048), ('zodiac', 0.048), ('killer', 0.042), ('ravi', 0.038), ('lm', 0.035), ('letters', 0.034), ('deciphering', 0.034), ('key', 0.033), ('cryptanalysis', 0.029), ('cryptologia', 0.029), ('sampler', 0.026), ('symbol', 0.026), ('solving', 0.025), ('attack', 0.025), ('sujith', 0.025), ('famous', 0.025), ('channel', 0.023), ('keys', 0.023), ('sample', 0.023), ('choices', 0.023), ('em', 0.022), ('knight', 0.022), ('geman', 0.022), ('ci', 0.022), ('deterministic', 0.021), ('solve', 0.02), ('derivation', 0.02), ('symbols', 0.02), ('decoded', 0.02), ('character', 0.019), ('corlett', 0.019), ('decryption', 0.019), ('enciphering', 0.019), ('ganesan', 0.019), ('keyspace', 0.019), ('maxxpp', 0.019), ('peleg', 0.019), ('dictionaries', 0.018), ('english', 0.018), ('sparse', 0.018), ('versus', 0.018), ('priors', 0.017), ('lms', 0.017), ('bility', 0.017), ('enciphered', 0.017), ('gibbs', 0.017), ('characters', 0.016), ('kevin', 0.016), ('token', 0.016), ('attacking', 0.016), ('attacks', 0.016), ('evolutionary', 0.015), ('type', 0.015), ('boundary', 0.014), ('serial', 0.014), ('exchangeability', 0.014), ('prior', 0.014), ('exhibit', 0.013), ('tackling', 0.012), ('genetic', 0.012), ('solves', 0.012), ('simple', 0.012), ('probability', 0.012), ('maps', 0.012), ('inference', 0.012), ('run', 0.012), ('cache', 0.012), ('substitute', 0.012), ('combines', 0.012), ('donald', 0.012), ('snyder', 0.012), ('original', 0.012), ('newspapers', 0.011), ('operator', 0.011), ('unlike', 0.011), ('alphabet', 0.011), ('arg', 0.011), ('method', 0.011), ('monte', 0.011), ('substituted', 0.011), ('hypothesis', 0.011), ('occurring', 0.011), ('carlo', 0.01), ('preserve', 0.01), ('dempster', 0.01), ('sequence', 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers
Author: Sujith Ravi ; Kevin Knight
Abstract: We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.
2 0.7366907 94 acl-2011-Deciphering Foreign Language
Author: Sujith Ravi ; Kevin Knight
Abstract: In this work, we tackle the task of machine translation (MT) without parallel training data. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from nonparallel text.
3 0.099235132 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision
Author: Fei Liu ; Fuliang Weng ; Bingqing Wang ; Yang Liu
Abstract: Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper, we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. To avoid the expensive and time consuming hand labeling process, we automatically collected a large set of noisy training pairs using a novel webbased approach and performed character-level . alignment for model training. Experiments on both Twitter and SMS messages show that our system significantly outperformed the stateof-the-art deletion-based abbreviation system and the jazzy spell checker (absolute accuracy gain of 21.69% and 18. 16% over jazzy spell checker on the two test sets respectively).
4 0.076816849 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
Author: Coskun Mermer ; Murat Saraclar
Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r
5 0.046827152 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
Author: Phil Blunsom ; Trevor Cohn
Abstract: In this work we address the problem of unsupervised part-of-speech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical Pitman-Yor processes prior, providing an elegant and principled means of incorporating lexical characteristics. Central to our approach is a new type-based sampling algorithm for hierarchical Pitman-Yor models in which we track fractional table counts. In an empirical evaluation we show that our model consistently out-performs the current state-of-the-art across 10 languages.
6 0.039111108 30 acl-2011-Adjoining Tree-to-String Translation
7 0.035104055 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations
8 0.032691404 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars
9 0.032154657 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
10 0.031757355 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling
11 0.031009471 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
12 0.029909324 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction
13 0.028896905 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation
14 0.028332276 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
15 0.028248595 197 acl-2011-Latent Class Transliteration based on Source Language Origin
16 0.027793545 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation
17 0.027706729 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
18 0.027679248 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements
19 0.026487533 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
20 0.02431659 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars
topicId topicWeight
[(0, 0.073), (1, -0.039), (2, 0.021), (3, 0.036), (4, 0.004), (5, -0.007), (6, 0.022), (7, -0.002), (8, -0.017), (9, 0.072), (10, -0.008), (11, 0.033), (12, 0.087), (13, 0.155), (14, -0.028), (15, 0.05), (16, -0.006), (17, 0.128), (18, 0.127), (19, -0.094), (20, 0.122), (21, 0.38), (22, 0.391), (23, 0.113), (24, 0.474), (25, 0.161), (26, 0.012), (27, -0.171), (28, 0.079), (29, 0.043), (30, -0.051), (31, 0.048), (32, 0.099), (33, 0.029), (34, -0.067), (35, 0.043), (36, -0.126), (37, 0.017), (38, 0.029), (39, 0.113), (40, 0.052), (41, -0.016), (42, -0.016), (43, 0.007), (44, 0.075), (45, -0.014), (46, 0.062), (47, -0.019), (48, -0.054), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.97701359 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers
Author: Sujith Ravi ; Kevin Knight
Abstract: We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.
2 0.78821039 94 acl-2011-Deciphering Foreign Language
Author: Sujith Ravi ; Kevin Knight
Abstract: In this work, we tackle the task of machine translation (MT) without parallel training data. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from nonparallel text.
Author: Fei Liu ; Fuliang Weng ; Bingqing Wang ; Yang Liu
Abstract: Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper, we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. To avoid the expensive and time consuming hand labeling process, we automatically collected a large set of noisy training pairs using a novel webbased approach and performed character-level . alignment for model training. Experiments on both Twitter and SMS messages show that our system significantly outperformed the stateof-the-art deletion-based abbreviation system and the jazzy spell checker (absolute accuracy gain of 21.69% and 18. 16% over jazzy spell checker on the two test sets respectively).
4 0.24960034 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
Author: Phil Blunsom ; Trevor Cohn
Abstract: In this work we address the problem of unsupervised part-of-speech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical Pitman-Yor processes prior, providing an elegant and principled means of incorporating lexical characteristics. Central to our approach is a new type-based sampling algorithm for hierarchical Pitman-Yor models in which we track fractional table counts. In an empirical evaluation we show that our model consistently out-performs the current state-of-the-art across 10 languages.
5 0.23549761 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
Author: Coskun Mermer ; Murat Saraclar
Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r
6 0.1911965 291 acl-2011-SystemT: A Declarative Information Extraction System
8 0.18314373 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
9 0.16887026 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars
10 0.16733773 321 acl-2011-Unsupervised Discovery of Rhyme Schemes
11 0.16045284 294 acl-2011-Temporal Evaluation
12 0.15463969 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity
13 0.15089932 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
14 0.14337642 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation
15 0.14129007 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars
16 0.13968547 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
17 0.13850446 162 acl-2011-Identifying the Semantic Orientation of Foreign Words
18 0.13236353 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
19 0.13127494 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
20 0.12908989 125 acl-2011-Exploiting Readymades in Linguistic Creativity: A System Demonstration of the Jigsaw Bard
topicId topicWeight
[(5, 0.02), (17, 0.048), (26, 0.012), (36, 0.011), (37, 0.048), (39, 0.025), (41, 0.523), (55, 0.017), (59, 0.032), (72, 0.019), (91, 0.027), (96, 0.089)]
simIndex simValue paperId paperTitle
1 0.94441336 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport
Author: Siwei Wang ; Gina-Anne Levow
Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.
2 0.93913424 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction
Author: Yu Hong ; Jianfeng Zhang ; Bin Ma ; Jianmin Yao ; Guodong Zhou ; Qiaoming Zhu
Abstract: Event extraction is the task of detecting certain specified types of events that are mentioned in the source language data. The state-of-the-art research on the task is transductive inference (e.g. cross-event inference). In this paper, we propose a new method of event extraction by well using cross-entity inference. In contrast to previous inference methods, we regard entitytype consistency as key feature to predict event mentions. We adopt this inference method to improve the traditional sentence-level event extraction system. Experiments show that we can get 8.6% gain in trigger (event) identification, and more than 11.8% gain for argument (role) classification in ACE event extraction. 1
3 0.91663438 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars
Author: Antske Fokkens
Abstract: When designing grammars of natural language, typically, more than one formal analysis can account for a given phenomenon. Moreover, because analyses interact, the choices made by the engineer influence the possibilities available in further grammar development. The order in which phenomena are treated may therefore have a major impact on the resulting grammar. This paper proposes to tackle this problem by using metagrammar development as a methodology for grammar engineering. Iargue that metagrammar engineering as an approach facilitates the systematic exploration of grammars through comparison of competing analyses. The idea is illustrated through a comparative study of auxiliary structures in HPSG-based grammars for German and Dutch. Auxiliaries form a central phenomenon of German and Dutch and are likely to influence many components of the grammar. This study shows that a special auxiliary+verb construction significantly improves efficiency compared to the standard argument-composition analysis for both parsing and generation.
4 0.90438688 189 acl-2011-K-means Clustering with Feature Hashing
Author: Hajime Senuma
Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.
5 0.9001075 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations
Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa
Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.
6 0.89806694 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars
7 0.89625931 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing
same-paper 8 0.88945293 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers
10 0.61028087 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
11 0.60826385 94 acl-2011-Deciphering Foreign Language
12 0.59094644 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
13 0.58916551 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
14 0.5763973 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue
15 0.56529218 135 acl-2011-Faster and Smaller N-Gram Language Models
16 0.55573314 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts
17 0.55560398 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base
18 0.5480656 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents
19 0.54120427 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing
20 0.53789163 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques