acl acl2011 acl2011-56 knowledge-graph by maker-knowledge-mining

56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers


Source: pdf

Author: Sujith Ravi ; Kevin Knight

Abstract: We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. [sent-3, score-0.692]

2 We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. [sent-5, score-1.384]

3 The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before. [sent-6, score-1.057]

4 These ciphers replace (English) plaintext letters with cipher symbols in order to generate the ciphertext sequence. [sent-8, score-1.503]

5 There exist many published works on automatic decipherment methods for solving simple lettersubstitution ciphers. [sent-9, score-0.529]

6 Many existing methods use dictionary-based attacks employing huge word dictionaries to find plaintext patterns within the ciphertext (Peleg and Rosenfeld, 1979; Ganesan and Sherman, 1993; Jakobsen, 1995; Olson, 2007). [sent-10, score-0.643]

7 Ravi and Knight (2008) formulate decipherment as an integer programming problem and provide an exact method to solve simple substitution ciphers by using letter n-gram mod- els along with deterministic key constraints. [sent-16, score-1.269]

8 Corlett and Penn (2010) work with large ciphertexts containing thousands of characters and provide another exact decipherment method using an A* search algorithm. [sent-17, score-0.519]

9 The famous Zodiac serial killer used one such cipher system for communication. [sent-21, score-0.599]

10 In 1969, the killer sent a three-part cipher message to newspapers claiming credit for recent shootings and crimes committed near the San Francisco area. [sent-22, score-0.623]

11 Oranchak (2008) presents a method for solving the Zodiac-408 cipher automatically with a dictionary-based attack using a genetic algorithm. [sent-24, score-0.591]

12 However, his method relies on using plaintext words from the known solution to solve the cipher, which departs from a strict decipherment scenario. [sent-25, score-0.945]

13 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 239–247, solving substitution ciphers using Bayesian learning. [sent-28, score-0.528]

14 Our novel contributions are as follows: • • • • We present a new probabilistic decipherment approach using Bayesian ibnifliesrtiecnc dee cwipithhe sparse priors, which can be used to solve different types of substitution ciphers. [sent-29, score-0.711]

15 Our new method combines information from Owourrd n edwict mioentharoieds along swi itnhf lremttaerti n-gram models, providing a robust decipherment model which offsets the disadvantages faced by previous approaches. [sent-30, score-0.515]

16 We evaluate the Bayesian decipherment output on t hevreaelu daitfefe threen Bt types onf d seucbipsthieturtmioenn ciphers and show that unlike a previous approach, our new method solves all the ciphers completely. [sent-31, score-1.17]

17 In a letter substitution cipher, every letter p in the natural language (plaintext) sequence is replaced by a cipher token c, according to some substitution key. [sent-34, score-1.26]

18 The sender can encrypt the message using one of many different cipher systems. [sent-43, score-0.618]

19 The particular type of cipher system chosen determines the properties of the key. [sent-44, score-0.533]

20 For example, the substitution key can be deterministic in 240 both the encipherment and decipherment directions as shown in the above example—i. [sent-45, score-0.775]

21 , there is a 1-to1 correspondence between the plaintext letters and ciphertext symbols. [sent-47, score-0.643]

22 1 Simple Substitution Ciphers The key used in a simple substitution cipher is deterministic in both the encipherment and decipherment directions, i. [sent-50, score-1.305]

23 , there is a 1-to-1 mapping between plaintext letters and ciphertext symbols. [sent-52, score-0.643]

24 The example shown earlier depicts how a simple substitution cipher works. [sent-53, score-0.711]

25 We encrypt an original English plaintext message using a randomly generated simple substitution key to create the ciphertext. [sent-55, score-0.76]

26 , plaintext character “ ” maps to ciphertext character “ ”. [sent-58, score-0.659]

27 Figure 1 (top) shows a portion of the ciphertext along with the original plaintext used to create the cipher. [sent-59, score-0.621]

28 2 Homophonic Ciphers A homophonic cipher uses a substitution key that maps a plaintext letter to more than one cipher symbol. [sent-61, score-2.039]

29 These ciphers are more complex than simple substitution ciphers. [sent-79, score-0.515]

30 The number of potential cipher symbol substitutes for a particular plaintext letter is often proportional to the frequency of that letter in the plaintext language— for example, the English letter “E” is assigned more cipher symbols than “Z”. [sent-81, score-2.457]

31 The substitution key is, however, deterministic in the decipherment direction—each ciphertext symbol maps to a single plaintext letter. [sent-83, score-1.374]

32 Data: For our decipherment experiments on homophonic ciphers, we use the same 414-letter English plaintext used in Section 2. [sent-85, score-1.092]

33 We encrypt this message using a homophonic substitution key (available from http://www. [sent-87, score-0.492]

34 Figure 1 (middle) displays a section of the homophonic cipher (with spaces) and the original plaintext message used in our experiments. [sent-92, score-1.182]

35 3 Homophonic Ciphers without spaces (Zodiac-408 cipher) In the previous two cipher systems, the wordboundary information was preserved in the cipher. [sent-94, score-0.567]

36 We now consider a more difficult homophonic cipher by removing space characters from the original plaintext. [sent-95, score-0.724]

37 ” Without the word boundary information, typical dictionary-based decipherment attacks fail on such 241 ciphers. [sent-102, score-0.522]

38 One of the most famous homophonic ciphers in history was used by the infamous Zodiac serial killer in the 1960’s. [sent-104, score-0.581]

39 The Zodiac messages include two interesting ciphers: (1) a 408-symbol homophonic cipher without spaces (which was solved manually by hand), and (2) a similar looking 340-symbol cipher that has yet to be solved. [sent-108, score-1.263]

40 Here is a sample of the Zodiac-408 cipher message: . [sent-109, score-0.541]

41 Besides the difficulty with missing word boundaries and non-determinism associated with the key, the Zodiac-408 cipher poses several additional challenges which makes standard homophonic it harder to solve than any cipher. [sent-115, score-0.716]

42 Also, the last 18 characters of the plaintext message does not seem to make any sense (“EBEORIETEMETHHPITI”). [sent-117, score-0.49]

43 Data: Figure 1 (bottom) displays the Zodiac-408 cipher (consisting of 408 tokens, 54 symbol types) along with the original plaintext message. [sent-118, score-0.978]

44 We run the new decipherment method (described in Section 3. [sent-119, score-0.515]

45 cn, the goal of decipherment is to uncover the hidden plaintext message p1. [sent-124, score-0.966]

46 , number of possible key mappings) that we have to navigate during decipherment is huge—a simple substitution cipher has a keyspace size of 26! [sent-130, score-1.255]

47 , whereas a homophonic cipher such as the Zodiac-408 cipher has 2654 possible key mappings. [sent-131, score-1.247]

48 Next, we describe a new Bayesian decipherment approach for tackling substitution ciphers. [sent-132, score-0.685]

49 (2010) proposed a Bayesian approach in an archaeological decipherment scenario. [sent-139, score-0.492]

50 For simple letter substitution ciphers, the original substitution key exhibits a 1-to-1 correspondence between the plaintext letters and cipher types. [sent-142, score-1.57]

51 Here, we propose a novel approach for deciphering substitution ciphers using Bayesian inference. [sent-145, score-0.537]

52 for a simple substitution cipher), our Bayesian framework requires us to sample only a small number of keys during the decipherment process. [sent-147, score-0.731]

53 242 Probabilistic Decipherment: Our decipherment method follows a noisy-channel approach. [sent-148, score-0.503]

54 Substitute each plaintext letter pi with a ciphertext token ci, with probability P(ci |pi) in order to generate the ciphertext sequence c = c1. [sent-164, score-1.096]

55 We build a statistical English language model (LM) for the plaintext source model P(p), which assigns a probability to any English letter sequence. [sent-168, score-0.611]

56 In our decipherment framework, a Chinese Restaurant Process formulation is used to model both the source and channel. [sent-170, score-0.492]

57 Substitute p1 with cipher token bility P0 (c1|p1) + 4. [sent-174, score-0.551]

58 Generate English plaintext letter ability pi, with prob- α · P0(pi|pi−1) + Ci1−1(pi−1,pi) α + Ci1−1(pi−1) BayesiCnPplhaoienurt oxnt:WrDi qnRE IgCc fTI PfTm npEH wNEnqRcIMs Nw nE NwoA Tf wN gCcI vS nEwTfN pHT nEkL oA wN oAaG zULk AtYoG vSa EIcnS v WOhr FuHpEnDi Rq Oh En gCTfUzHMspHEn EwN. [sent-176, score-0.599]

59 5286731 decipherment (using word+3-gram LM) for three different ciphers: (a) Simple Substitution Cipher (top), (b) Homophonic Substitution Cipher with spaces (middle), and (c) Zodiac-408 Cipher (bottom). [sent-181, score-0.541]

60 Substitute bility pi with cipher token ci, with proba- β · P0(ci|pi) + C1i−1(pi, ci) β + Ci1−1(pi) 7. [sent-183, score-0.636]

61 , any plaintext hypothesis corresponding to the given ciphertext sequence. [sent-187, score-0.62]

62 , a plaintext letter can be substituted with any given cipher type with equal probability). [sent-191, score-1.143]

63 C1i−1 represents the count of events occurring before plaintext letter pi in the derivation (we call this the “cache”). [sent-192, score-0.715]

64 We could follow a point-wise sampling strategy, where we sample plaintext letter choices for every cipher token, one at a time. [sent-196, score-1.242]

65 But we already know that the substitution ciphers described here exhibit determinism in the deciphering direction,1 i. [sent-197, score-0.55]

66 , although we have no idea about the key mappings themselves, we do know that there exists only a single plaintext letter mapping for every cipher symbol type in the true key. [sent-199, score-1.191]

67 So sampling plaintext choices for every cipher token separately is not an efficient strategy— our sampler may spend too much time exploring invalid keys (which map the same cipher symbol to different plaintext letters). [sent-200, score-2.073]

68 Under 1This assumption does not strictly apply to the Zodiac-408 cipher where a few cipher symbols exhibit non-determinism in the decipherment direction as well. [sent-203, score-1.561]

69 244 this scheme, we sample plaintext letter choices for each cipher symbol type. [sent-204, score-1.189]

70 In every step, we sample a new plaintext letter for a cipher type and update the entire plaintext hypothesis (i. [sent-205, score-1.588]

71 , plaintext letters at all corresponding positions) to reflect this change. [sent-207, score-0.456]

72 For example, if we sample a new choice pnew for a cipher symbol which occurs at positions 4, 10, 18, then we update plaintext letters p4, p10 and p18 with the new choice pnew. [sent-208, score-1.023]

73 Combining letter n-gram language models with word dictionaries: Many existing probabilistic ap- proaches use statistical letter n-gram language models of English to assign P(p) probabilities to plaintext hypotheses during decipherment. [sent-211, score-0.776]

74 Unlike previous approaches, our decipherment method combines information from both sources— letter n-grams and word dictionaries. [sent-213, score-0.692]

75 Sampling for ciphers without spaces: For ciphers without spaces, dictionaries are hard to use because we do not know where words start and end. [sent-227, score-0.662]

76 We introduce a new sampling operator which counters this problem and allows us to perform inference using the same decipherment model described earlier. [sent-228, score-0.594]

77 In a first sampling pass, we sample from 26 plaintext letter choices (e. [sent-229, score-0.724]

78 The new strategy allows us to apply Bayesian decipherment even to ciphers without spaces. [sent-251, score-0.814]

79 As a result, we now have a new decipherment method that consistently works for a range of different types of substitution ciphers. [sent-252, score-0.684]

80 ”” Decoding the ciphertext: After the sampling run has we choose the final sample as our English plaintext decipherment output. [sent-253, score-1.028]

81 finished,4 4For letter substitution decipherment we want to keep the language model probabilities fixed during training, and hence we set the prior on that model to be high (α = 104). [sent-254, score-0.864]

82 We instantiate a key which matches frequently occurring plaintext letters to frequent cipher symbols and use this to generate an initial sample for the given ciphertext and run the sampler for 5000 iterations. [sent-257, score-1.286]

83 245 4 Experiments and Results We run decipherment experiments on different types of letter substitution ciphers (described in Section 2). [sent-259, score-1.184]

84 In particular, we work with the following three ciphers: (a) 414-letter Simple Substitution Cipher (b) 414-letter Homophonic Cipher (with spaces) (c) Zodiac-408 Cipher Methods: For each cipher, we run and compare the output from two different decipherment approaches: 1. [sent-260, score-0.504]

85 They use the EM algorithm to estimate the channel parameters θ during decipherment training. [sent-263, score-0.515]

86 The given ciphertext c is then decoded by using the Viterbi algorithm to choose the plaintext decoding p that maximizes P(p) · Pθ(c|p)3, stretching itnhge pch tahnatn eml probabilities. [sent-264, score-0.629]

87 Evaluation: We evaluate the quality of a particular decipherment as the percentage of cipher tokens that are decoded correctly. [sent-268, score-1.03]

88 Results: Figure 2 compares the decipherment performance for the EM method with Bayesian decipherment (using type sampling and sparse priors) on three different types of substitution ciphers. [sent-269, score-1.288]

89 Even with a 3-gram letter LM, our method yields a +63% improvement in decipherment accuracy over EM on the homophonic cipher with spaces. [sent-271, score-1.376]

90 Figure 1 shows samples from the Bayesian decipherment output for all three ciphers. [sent-273, score-0.492]

91 For ciphers without spaces, our method automatically guesses the word boundaries for the plaintext hypothesis. [sent-274, score-0.755]

92 For the Zodiac-408 cipher, we compare the performance achieved by Bayesian decipherment under different settings: • • • Letter n-gram versus Word+n-gram LMs— Figure 2n sghraowms tvhearts using a word+3-gram Ls—M instead of a 3-gram LM results in +75% improvement in decipherment accuracy. [sent-281, score-1.002]

93 0) helps for such problems and produces better decipherment results (97. [sent-284, score-0.492]

94 Type versus Point-wise sampling—Unlike point-wise sampling, type sampling quickly converges to better decipherment solutions. [sent-287, score-0.604]

95 After 5000 sampling passes over the entire data, decipherment output from type sampling scores 97. [sent-288, score-0.665]

96 5 Conclusion In this work, we presented a novel Bayesian decipherment approach that can effectively solve a va5Both sampling runs were seeded with the same random initial sample. [sent-294, score-0.591]

97 Unlike previous approaches, our method combines information from letter n-gram language models and word dictionaries and provides a robust decipherment model. [sent-296, score-0.71]

98 Using Bayesian decipherment, we can successfully solve the Zodiac-408 cipher—the first time this is achieved by a fully automatic method in a strict decipherment scenario. [sent-298, score-0.523]

99 For future work, there are other interesting decipherment tasks where our method can be applied. [sent-299, score-0.503]

100 Evolutionary algorithm for decryption of monoalphabetic homophonic substitution ciphers encoded as constraint satisfaction problems. [sent-362, score-0.7]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cipher', 0.518), ('decipherment', 0.492), ('plaintext', 0.422), ('ciphers', 0.322), ('ciphertext', 0.187), ('substitution', 0.181), ('homophonic', 0.178), ('letter', 0.177), ('pi', 0.085), ('bayesian', 0.083), ('sampling', 0.079), ('message', 0.052), ('spaces', 0.049), ('encipherment', 0.048), ('encrypt', 0.048), ('zodiac', 0.048), ('killer', 0.042), ('ravi', 0.038), ('lm', 0.035), ('letters', 0.034), ('deciphering', 0.034), ('key', 0.033), ('cryptanalysis', 0.029), ('cryptologia', 0.029), ('sampler', 0.026), ('symbol', 0.026), ('solving', 0.025), ('attack', 0.025), ('sujith', 0.025), ('famous', 0.025), ('channel', 0.023), ('keys', 0.023), ('sample', 0.023), ('choices', 0.023), ('em', 0.022), ('knight', 0.022), ('geman', 0.022), ('ci', 0.022), ('deterministic', 0.021), ('solve', 0.02), ('derivation', 0.02), ('symbols', 0.02), ('decoded', 0.02), ('character', 0.019), ('corlett', 0.019), ('decryption', 0.019), ('enciphering', 0.019), ('ganesan', 0.019), ('keyspace', 0.019), ('maxxpp', 0.019), ('peleg', 0.019), ('dictionaries', 0.018), ('english', 0.018), ('sparse', 0.018), ('versus', 0.018), ('priors', 0.017), ('lms', 0.017), ('bility', 0.017), ('enciphered', 0.017), ('gibbs', 0.017), ('characters', 0.016), ('kevin', 0.016), ('token', 0.016), ('attacking', 0.016), ('attacks', 0.016), ('evolutionary', 0.015), ('type', 0.015), ('boundary', 0.014), ('serial', 0.014), ('exchangeability', 0.014), ('prior', 0.014), ('exhibit', 0.013), ('tackling', 0.012), ('genetic', 0.012), ('solves', 0.012), ('simple', 0.012), ('probability', 0.012), ('maps', 0.012), ('inference', 0.012), ('run', 0.012), ('cache', 0.012), ('substitute', 0.012), ('combines', 0.012), ('donald', 0.012), ('snyder', 0.012), ('original', 0.012), ('newspapers', 0.011), ('operator', 0.011), ('unlike', 0.011), ('alphabet', 0.011), ('arg', 0.011), ('method', 0.011), ('monte', 0.011), ('substituted', 0.011), ('hypothesis', 0.011), ('occurring', 0.011), ('carlo', 0.01), ('preserve', 0.01), ('dempster', 0.01), ('sequence', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

Author: Sujith Ravi ; Kevin Knight

Abstract: We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.

2 0.7366907 94 acl-2011-Deciphering Foreign Language

Author: Sujith Ravi ; Kevin Knight

Abstract: In this work, we tackle the task of machine translation (MT) without parallel training data. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from nonparallel text.

3 0.099235132 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

Author: Fei Liu ; Fuliang Weng ; Bingqing Wang ; Yang Liu

Abstract: Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper, we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. To avoid the expensive and time consuming hand labeling process, we automatically collected a large set of noisy training pairs using a novel webbased approach and performed character-level . alignment for model training. Experiments on both Twitter and SMS messages show that our system significantly outperformed the stateof-the-art deletion-based abbreviation system and the jazzy spell checker (absolute accuracy gain of 21.69% and 18. 16% over jazzy spell checker on the two test sets respectively).

4 0.076816849 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r

5 0.046827152 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

Author: Phil Blunsom ; Trevor Cohn

Abstract: In this work we address the problem of unsupervised part-of-speech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical Pitman-Yor processes prior, providing an elegant and principled means of incorporating lexical characteristics. Central to our approach is a new type-based sampling algorithm for hierarchical Pitman-Yor models in which we track fractional table counts. In an empirical evaluation we show that our model consistently out-performs the current state-of-the-art across 10 languages.

6 0.039111108 30 acl-2011-Adjoining Tree-to-String Translation

7 0.035104055 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

8 0.032691404 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars

9 0.032154657 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

10 0.031757355 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

11 0.031009471 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

12 0.029909324 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

13 0.028896905 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation

14 0.028332276 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

15 0.028248595 197 acl-2011-Latent Class Transliteration based on Source Language Origin

16 0.027793545 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

17 0.027706729 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

18 0.027679248 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

19 0.026487533 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

20 0.02431659 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.073), (1, -0.039), (2, 0.021), (3, 0.036), (4, 0.004), (5, -0.007), (6, 0.022), (7, -0.002), (8, -0.017), (9, 0.072), (10, -0.008), (11, 0.033), (12, 0.087), (13, 0.155), (14, -0.028), (15, 0.05), (16, -0.006), (17, 0.128), (18, 0.127), (19, -0.094), (20, 0.122), (21, 0.38), (22, 0.391), (23, 0.113), (24, 0.474), (25, 0.161), (26, 0.012), (27, -0.171), (28, 0.079), (29, 0.043), (30, -0.051), (31, 0.048), (32, 0.099), (33, 0.029), (34, -0.067), (35, 0.043), (36, -0.126), (37, 0.017), (38, 0.029), (39, 0.113), (40, 0.052), (41, -0.016), (42, -0.016), (43, 0.007), (44, 0.075), (45, -0.014), (46, 0.062), (47, -0.019), (48, -0.054), (49, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97701359 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

Author: Sujith Ravi ; Kevin Knight

Abstract: We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.

2 0.78821039 94 acl-2011-Deciphering Foreign Language

Author: Sujith Ravi ; Kevin Knight

Abstract: In this work, we tackle the task of machine translation (MT) without parallel training data. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from nonparallel text.

3 0.25893068 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

Author: Fei Liu ; Fuliang Weng ; Bingqing Wang ; Yang Liu

Abstract: Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper, we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. To avoid the expensive and time consuming hand labeling process, we automatically collected a large set of noisy training pairs using a novel webbased approach and performed character-level . alignment for model training. Experiments on both Twitter and SMS messages show that our system significantly outperformed the stateof-the-art deletion-based abbreviation system and the jazzy spell checker (absolute accuracy gain of 21.69% and 18. 16% over jazzy spell checker on the two test sets respectively).

4 0.24960034 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

Author: Phil Blunsom ; Trevor Cohn

Abstract: In this work we address the problem of unsupervised part-of-speech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical Pitman-Yor processes prior, providing an elegant and principled means of incorporating lexical characteristics. Central to our approach is a new type-based sampling algorithm for hierarchical Pitman-Yor models in which we track fractional table counts. In an empirical evaluation we show that our model consistently out-performs the current state-of-the-art across 10 languages.

5 0.23549761 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r

6 0.1911965 291 acl-2011-SystemT: A Declarative Information Extraction System

7 0.18769141 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

8 0.18314373 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

9 0.16887026 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars

10 0.16733773 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

11 0.16045284 294 acl-2011-Temporal Evaluation

12 0.15463969 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

13 0.15089932 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

14 0.14337642 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation

15 0.14129007 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

16 0.13968547 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

17 0.13850446 162 acl-2011-Identifying the Semantic Orientation of Foreign Words

18 0.13236353 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

19 0.13127494 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

20 0.12908989 125 acl-2011-Exploiting Readymades in Linguistic Creativity: A System Demonstration of the Jigsaw Bard


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.02), (17, 0.048), (26, 0.012), (36, 0.011), (37, 0.048), (39, 0.025), (41, 0.523), (55, 0.017), (59, 0.032), (72, 0.019), (91, 0.027), (96, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94441336 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport

Author: Siwei Wang ; Gina-Anne Levow

Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.

2 0.93913424 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction

Author: Yu Hong ; Jianfeng Zhang ; Bin Ma ; Jianmin Yao ; Guodong Zhou ; Qiaoming Zhu

Abstract: Event extraction is the task of detecting certain specified types of events that are mentioned in the source language data. The state-of-the-art research on the task is transductive inference (e.g. cross-event inference). In this paper, we propose a new method of event extraction by well using cross-entity inference. In contrast to previous inference methods, we regard entitytype consistency as key feature to predict event mentions. We adopt this inference method to improve the traditional sentence-level event extraction system. Experiments show that we can get 8.6% gain in trigger (event) identification, and more than 11.8% gain for argument (role) classification in ACE event extraction. 1

3 0.91663438 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

Author: Antske Fokkens

Abstract: When designing grammars of natural language, typically, more than one formal analysis can account for a given phenomenon. Moreover, because analyses interact, the choices made by the engineer influence the possibilities available in further grammar development. The order in which phenomena are treated may therefore have a major impact on the resulting grammar. This paper proposes to tackle this problem by using metagrammar development as a methodology for grammar engineering. Iargue that metagrammar engineering as an approach facilitates the systematic exploration of grammars through comparison of competing analyses. The idea is illustrated through a comparative study of auxiliary structures in HPSG-based grammars for German and Dutch. Auxiliaries form a central phenomenon of German and Dutch and are likely to influence many components of the grammar. This study shows that a special auxiliary+verb construction significantly improves efficiency compared to the standard argument-composition analysis for both parsing and generation.

4 0.90438688 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.

5 0.9001075 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

6 0.89806694 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

7 0.89625931 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

same-paper 8 0.88945293 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

9 0.81004125 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

10 0.61028087 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

11 0.60826385 94 acl-2011-Deciphering Foreign Language

12 0.59094644 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

13 0.58916551 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts

14 0.5763973 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

15 0.56529218 135 acl-2011-Faster and Smaller N-Gram Language Models

16 0.55573314 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

17 0.55560398 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

18 0.5480656 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

19 0.54120427 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

20 0.53789163 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques