acl acl2012 acl2012-68 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sravana Reddy ; Kevin Knight
Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. [sent-4, score-0.758]
2 We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. [sent-5, score-0.586]
3 We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models. [sent-6, score-0.317]
4 1 Introduction The running key cipher is an encoding scheme that uses a secret key R that is typically a string of words, usually taken from a book or other text that is agreed upon by the sender and receiver. [sent-7, score-0.875]
5 When sending a plaintext message P, the sender truncates R to the length of the plaintext. [sent-8, score-0.715]
6 The scheme also relies on a substitution function f, which is usually publicly known, that maps a plaintext letter p and key letter r to a unique ciphertext letter c. [sent-9, score-1.868]
7 The most common choice for f is the tabula recta, where c = (p + r) mod 26 for letters in the English alphabet, with A = 0, B = 1, and so on. [sent-10, score-0.177]
8 To encode a plaintext with a running key, the spaces in the plaintext and the key are removed, and for every 0 ≤ i < |P|, the ciphertext letter at positfioorn e iv eisr computed |tPo b|,e t Ci ← f(Pi, Ri). [sent-11, score-2.084]
9 Figure 1shows an example encoding using Pthe tabula recta. [sent-12, score-0.144]
10 For a given ciphertext and known f, the plaintext uniquely determines the running key and vice versa. [sent-13, score-1.226]
11 80 Kevin Knight Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292, USA knight @ i i s . [sent-15, score-0.045]
12 edu Since we know that the plaintext and running key are both drawn from natural language, our objective function for the solution plaintext under some language model is: Pˆ = argPmaxlogPr(P)Pr(RP,C) (1) where the running key RP,C is the key that corresponds to plaintext P and ciphertext C. [sent-16, score-2.952]
13 Note that if RP,C is a perfectly random sequence ofletters, this scheme is effectively a ‘one-time pad’, which is provably unbreakable (Shannon, 1949). [sent-17, score-0.028]
14 The knowledge that both the plaintext and the key are natural language strings is important in breaking a running key cipher. [sent-18, score-1.122]
15 The letter-frequency distribution of running key ciphertexts is notably flatter than than the plaintext distribution, unlike substitution ciphers where the frequency profile remains unchanged, modulo letter substitutions. [sent-19, score-1.874]
16 However, the ciphertext letter distri- bution is not uniform; there are peaks corresponding to letters (like I) that are formed by high-frequency plaintext/key pairs (like E and E). [sent-20, score-0.503]
17 1 Running Key Ciphers Bauer and Tate (2002) use letter n-grams (without smoothing) up to order 6 to find the most probable plaintext/key character pair at each position in the ciphertext. [sent-22, score-0.25]
18 They test their method on 1000-character ciphertexts produced from plaintexts and keys extracted from Project Gutenberg. [sent-23, score-0.306]
19 5%, where accuracy is measured as the percentage of correctly decoded charProce Jedijung, sR oefpu thbeli c50 othf K Aonrneua,a8l -M14e Jtiunlgy o 2f0 t1h2e. [sent-26, score-0.048]
20 c s 2o0c1ia2ti Aosns fo cria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsteiscs 80–84, Figure 1: Example of a running key cipher. [sent-28, score-0.338]
21 Note that key is truncated to the length of the plaintext. [sent-29, score-0.248]
22 Plaintext – linguistics is fun, Running Key – colorless green ideas, tabula recta substitution where Ci ← (Pi Ri) mod 26 + Plaintext:LINGUISTICSISF←U (PN acters. [sent-30, score-0.391]
23 Such figures are too low to produce readable plaintexts, especially if the decoded regions are not contiguous. [sent-31, score-0.028]
24 Griffing (2006) uses Viterbi decoding and letter 6-grams to improve on the above result, achieving a median 87% accuracy over several 1000-character ciphertexts. [sent-32, score-0.318]
25 A key shortcoming of this work is that it requires searching through about 265 states at each position in the ciphertext. [sent-33, score-0.179]
26 2 Letter Substitution Ciphers Previous work in decipherment of classical ciphers has mainly focused on letter substitution. [sent-35, score-0.807]
27 These ciphers use a substitution table as the secret key. [sent-36, score-0.53]
28 The ciphertext is generated by substituting each letter of the plaintext according to the substitution table. [sent-37, score-1.199]
29 The table may be homophonic; that is, a single plaintext letter could map to more than one possible ciphertext letter. [sent-38, score-1.082]
30 Just as in running key ciphers, spaces in the plaintext are usually removed before encoding. [sent-39, score-1.002]
31 3 Vigen `ere Ciphers A scheme similar to the running key cipher is the Vigen e`re cipher, also known as the periodic key cipher. [sent-42, score-0.837]
32 Instead of a single long string spanning the length of the plaintext, the key is a short string usually but not always a single word or phrase repeated to the length of the plaintext. [sent-43, score-0.277]
33 Figure 2 shows an example Vigen `ere cipher encoding. [sent-44, score-0.202]
34 This cipher is less secure than the running key, since the short length of the key vastly reduces the size of the search space, and the periodic repetition of the key leaks information. [sent-45, score-0.856]
35 Recent work on decoding periodic key ciphers perform Viterbi search on the key using letter ngram models (Olsen et al. [sent-46, score-1.099]
36 , 2011), with the assump– – 81 tion that the length of the key is known. [sent-47, score-0.216]
37 If unknown, the key length can be inferred using the Kasiski Test (Kasiski, 1863) which takes advantage of repeated plaintext/key character pairs. [sent-48, score-0.259]
38 3 Solution with Gibbs Sampling In this paper, we describe a search algorithm that uses Gibbs Sampling to break a running key cipher. [sent-49, score-0.386]
39 1 Choice of Language Model The main advantage of a sampling-based approach over Viterbi decoding is that it allows us to seam- lessly use word-based language models. [sent-51, score-0.067]
40 Lower order letter n-grams may fail to decipher most ciphertexts even with perfect search, since an incorrect plaintext and key could have higher likelihood under a weak language model than the actual message. [sent-52, score-1.237]
41 2 Blocked Sampling One possible approach is to sample a plaintext letter at each position in the ciphertext. [sent-54, score-0.858]
42 The limitation of such a sampler for the running key problem is that is extremely slow to mix, especially for longer ciphertexts: we found that in practice, it does not usually converge to the optimal solution in a reasonable number of iterations even with simulated annealing. [sent-55, score-0.409]
43 We therefore propose a blocked sampling algorithm that samples words rather than letters in the plaintext, as follows: 1. [sent-56, score-0.173]
44 p|C| , fix R as the key that corresponds to P, C 2. [sent-60, score-0.179]
45 Plaintext – linguistics is fun, Periodic Key – green, tabula recta substitution. [sent-62, score-0.222]
46 Remove spaces and return P, R Note that every time a word in P is sampled, it induces a change in R that may not be a word or a sequence of words, and vice versa. [sent-64, score-0.077]
47 For this reason, we use a word trigram model linearly interpolated with letter trigrams (including the space character). [sent-66, score-0.266]
48 1 The interpolation mainly serves to smooth the search space, with the added benefit of accounting for out-of-vocabulary, misspelled, or truncated words in the actual plaintext or key. [sent-67, score-0.699]
49 Table 1 shows an example of one sampling iteration on the ciphertext shown in Figure 1. [sent-68, score-0.337]
50 Table 1: First sampling iteration on the ciphertext NWYULTWLAIJMWSCQ Swaim tGhepnl trwesapo trcdPeisg,rRnamPRs :WANRDSETJRWHPUCGAONSTLXAOIWTLSENRYUTSOMC YHKBE NRVOSWEIYLSDPHALO CWUXT 4 Experiments 4. [sent-69, score-0.337]
51 1 Data We randomly select passages from the Project Gutenberg and Wall Street Journal Corpus extracts that are included in the NLTK toolkit (Bird et al. [sent-70, score-0.03]
52 The passages are used as plaintext and key pairs, and combined to generate synthetic ciphertext data. [sent-72, score-1.088]
53 Unlike previous works which used constantlength ciphertexts, we study the effect of message length on decipherment by varying the ciphertext length (10, 100, and 1000 characters). [sent-73, score-0.569]
54 Our language model is an interpolation of word trigrams and letter trigrams trained on the Brown 1Pr(P) = λ Pr(P|word LM) + (1 − λ) Pr(P|letter LM), and similarly fo λr Pr(R). [sent-74, score-0.334]
55 We fixed the word language model interpolation weight to λ = 0. [sent-76, score-0.033]
56 2 Baseline and Evaluation For comparison with the previous work, we reimplement Viterbi decoding over letter 6-grams (Griffing, 2006) trained on the Brown Corpus. [sent-79, score-0.298]
57 In addition to decipherment accuracy, we compare the running time in seconds of the two algorithms. [sent-80, score-0.37]
58 Both decipherment programs were implemented in Python and run on the same machines. [sent-81, score-0.211]
59 3 Results Table 2 shows the average decipherment accuracy of our algorithm and the baseline on each dataset. [sent-85, score-0.25]
60 Also shown is the number of times that the Gibbs Sampling search failed that is, when the algorithm did not hypothesize a solution that had a probability at least as high as the reference within 10000 iterations. [sent-86, score-0.103]
61 Performance on the short (length 10) ciphertexts is poor under both algorithms. [sent-88, score-0.222]
62 This is expected, since the degree of message uncertainty, or message equivocation as defined by Shannon, is high for short ciphertexts: there are several possible plaintexts and keys besides the original that are likely under an English language model. [sent-89, score-0.16]
63 Consider the ciphertext WAEEXFPROV which was generated by the plaintext segment ON A REFEREN and key INENTAL AKI. [sent-90, score-1.03]
64 The algorithm hypothesizes that the plaintext is THE STRAND S and key DTAME OPELD, which both receive high language model probability. [sent-91, score-0.803]
65 – Table 2: Decipherment accuracy (proportion of correctly deciphered characters). [sent-92, score-0.02]
66 Plaintext and key sources for the ciphertext test data were extracted by starting at random points in the corpora, and selecting the following n characters. [sent-93, score-0.425]
67 r165b23u9ingGt12m340i61b827e45(sc)#Faeil123r5d9chGeisb Table 3: Substitution function parameterized by the keyword, CIPHER. [sent-95, score-0.023]
68 PCAIBPHIHPECEDHR E AFABRGBDADFHBFIGDJ FGK GJLJLKMKMLN LMOOMNQNOPOSQQTQRSUS TVT UWU VXV WYWWXZX YCY ZIZ C However, on the long ciphertexts, our algorithm gets close to perfect decipherment, surpassing the Viterbi algorithm by a large margin. [sent-99, score-0.038]
69 2 Accuracies on the Wall Street Journal ciphertexts are higher than on the Gutenberg ciphertexts for our algorithm, which may be because the latter is more divergent from the Brown Corpus language model. [sent-100, score-0.462]
70 1 Unknown substitution functions Some running key ciphers also use a secret substitution function f rather than the tabula recta or another known function. [sent-102, score-1.226]
71 In typical cases, these functions are not arbitrary, but are parameterized by a secret keyword that mutates the tabula recta table. [sent-103, score-0.356]
72 For example, the function with the keyword CIPHER would be the substitution table shown in Table 3. [sent-104, score-0.158]
73 Decoding a running key ciphertext under a latent substitution function is an open line ofresearch. [sent-105, score-0.701]
74 One possibility is to extend our approach by sampling the keyword or function in addition to the plaintext. [sent-106, score-0.132]
75 83 or other exact decoding algorithms like A* to use word-level language models. [sent-109, score-0.067]
76 A naive implementation of Viterbi word-based decoding results in computationally inefficient search spaces for large vocabularies, so more sophisticated methods or heuristics will be required. [sent-110, score-0.173]
77 6 Conclusion We propose a decipherment algorithm for running key ciphers that uses Blocked Gibbs Sampling and word-based language models, which shows significant speed and accuracy improvements over previous research into this problem. [sent-113, score-0.931]
78 Solving the running key cipher with the Viterbi algorithm. [sent-130, score-0.54]
79 A speech recognition solution to an ancient cryptography problem. [sent-144, score-0.054]
80 Evolutionary algorithm for decryption of monoalphabetic homophonic substitution ciphers encoded as constraint satisfaction problems. [sent-148, score-0.55]
81 Attacking decipherment problems optimally with low-order n-gram models. [sent-152, score-0.211]
wordName wordTfidf (topN-words)
[('plaintext', 0.605), ('ciphers', 0.343), ('ciphertext', 0.246), ('letter', 0.231), ('ciphertexts', 0.222), ('decipherment', 0.211), ('cipher', 0.202), ('key', 0.179), ('running', 0.159), ('tabula', 0.121), ('substitution', 0.117), ('ravi', 0.108), ('viterbi', 0.103), ('recta', 0.101), ('sampling', 0.091), ('pr', 0.081), ('griffing', 0.081), ('vigen', 0.081), ('periodic', 0.071), ('secret', 0.07), ('decoding', 0.067), ('gibbs', 0.061), ('cryptologia', 0.061), ('plaintexts', 0.061), ('sujith', 0.06), ('spaces', 0.059), ('homophonic', 0.053), ('shannon', 0.053), ('knight', 0.045), ('ere', 0.042), ('keyword', 0.041), ('corlett', 0.04), ('kasiski', 0.04), ('olsen', 0.04), ('hypothesized', 0.039), ('brown', 0.039), ('message', 0.038), ('blocked', 0.037), ('length', 0.037), ('solution', 0.036), ('sender', 0.035), ('trigrams', 0.035), ('sampler', 0.035), ('interpolation', 0.033), ('die', 0.032), ('truncated', 0.032), ('gutenberg', 0.032), ('genetic', 0.032), ('fun', 0.03), ('attacking', 0.03), ('bauer', 0.03), ('nelson', 0.03), ('mod', 0.03), ('passages', 0.03), ('und', 0.03), ('search', 0.029), ('street', 0.029), ('scheme', 0.028), ('synthetic', 0.028), ('decoded', 0.028), ('letters', 0.026), ('bird', 0.026), ('evolutionary', 0.026), ('accuracies', 0.025), ('repeated', 0.024), ('kevin', 0.024), ('parameterized', 0.023), ('keys', 0.023), ('updating', 0.023), ('encoding', 0.023), ('classical', 0.022), ('green', 0.022), ('sample', 0.022), ('chicago', 0.021), ('wall', 0.021), ('accuracy', 0.02), ('reference', 0.019), ('character', 0.019), ('algorithm', 0.019), ('known', 0.019), ('ri', 0.018), ('lengths', 0.018), ('vice', 0.018), ('pi', 0.018), ('steven', 0.018), ('divergent', 0.018), ('deciphering', 0.018), ('admiralty', 0.018), ('claude', 0.018), ('cryptography', 0.018), ('decryption', 0.018), ('ewan', 0.018), ('flatter', 0.018), ('friedrich', 0.018), ('inefficient', 0.018), ('kucera', 0.018), ('peder', 0.018), ('reddy', 0.018), ('reserve', 0.018), ('security', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 68 acl-2012-Decoding Running Key Ciphers
Author: Sravana Reddy ; Kevin Knight
Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.
2 0.085998602 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
Author: Malte Nuhn ; Arne Mauser ; Hermann Ney
Abstract: In this paper we show how to train statistical machine translation systems on reallife tasks using only non-parallel monolingual data from two languages. We present a modification of the method shown in (Ravi and Knight, 2011) that is scalable to vocabulary sizes of several thousand words. On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model. The efficiency improvement of our method allows us to run experiments with vocabulary sizes of around 5,000 words, such as a non-parallel version of the VERBMOBIL corpus. We also report results using data from the monolingual French and English GIGAWORD corpora.
3 0.085102737 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
Author: Zhiheng Huang ; Yi Chang ; Bo Long ; Jean-Francois Crespo ; Anlei Dong ; Sathiya Keerthi ; Su-Lin Wu
Abstract: Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing. However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. We show the efficiency of these proposed algorithms for five NLP tagging tasks. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels. This algorithm makes real-time large-scale tagging applications with thousands of labels feasible.
4 0.072656006 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language
Author: Fei Liu ; Fuliang Weng ; Xiao Jiang
Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.
5 0.052118 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation
Author: Nicola Cancedda
Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.
6 0.046834815 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
8 0.03915102 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition
9 0.037591167 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
10 0.035475817 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing
11 0.034600563 79 acl-2012-Efficient Tree-Based Topic Modeling
12 0.033449568 18 acl-2012-A Probabilistic Model for Canonicalizing Named Entity Mentions
13 0.030737784 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
14 0.030183692 78 acl-2012-Efficient Search for Transformation-based Inference
15 0.030123759 154 acl-2012-Native Language Detection with Tree Substitution Grammars
16 0.029629886 140 acl-2012-Machine Translation without Words through Substring Alignment
17 0.028396919 111 acl-2012-How Are Spelling Errors Generated and Corrected? A Study of Corrected and Uncorrected Spelling Errors Using Keystroke Logs
18 0.027127592 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection
19 0.027007647 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment
20 0.02528541 107 acl-2012-Heuristic Cube Pruning in Linear Time
topicId topicWeight
[(0, -0.079), (1, -0.017), (2, -0.004), (3, 0.005), (4, -0.006), (5, 0.047), (6, 0.035), (7, 0.013), (8, 0.024), (9, 0.007), (10, -0.054), (11, -0.039), (12, -0.032), (13, -0.019), (14, 0.005), (15, -0.007), (16, 0.034), (17, 0.042), (18, 0.032), (19, 0.024), (20, 0.019), (21, -0.047), (22, -0.002), (23, 0.011), (24, -0.052), (25, 0.039), (26, 0.073), (27, 0.045), (28, -0.108), (29, -0.084), (30, -0.012), (31, -0.11), (32, -0.02), (33, 0.084), (34, 0.069), (35, -0.046), (36, 0.011), (37, 0.151), (38, -0.163), (39, 0.048), (40, -0.06), (41, -0.124), (42, -0.046), (43, -0.06), (44, -0.076), (45, -0.04), (46, 0.13), (47, 0.028), (48, -0.11), (49, -0.085)]
simIndex simValue paperId paperTitle
same-paper 1 0.9594242 68 acl-2012-Decoding Running Key Ciphers
Author: Sravana Reddy ; Kevin Knight
Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.
2 0.65462029 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
Author: Zhiheng Huang ; Yi Chang ; Bo Long ; Jean-Francois Crespo ; Anlei Dong ; Sathiya Keerthi ; Su-Lin Wu
Abstract: Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing. However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. We show the efficiency of these proposed algorithms for five NLP tagging tasks. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels. This algorithm makes real-time large-scale tagging applications with thousands of labels feasible.
3 0.46318895 107 acl-2012-Heuristic Cube Pruning in Linear Time
Author: Andrea Gesmundo ; Giorgio Satta ; James Henderson
Abstract: We propose a novel heuristic algorithm for Cube Pruning running in linear time in the beam size. Empirically, we show a gain in running time of a standard machine translation system, at a small loss in accuracy.
4 0.45803544 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language
Author: Fei Liu ; Fuliang Weng ; Xiao Jiang
Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.
5 0.43408996 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation
Author: Nicola Cancedda
Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.
6 0.39983845 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition
9 0.35488656 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords
10 0.33215374 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
11 0.32151416 160 acl-2012-Personalized Normalization for a Multilingual Chat System
12 0.29288003 218 acl-2012-You Had Me at Hello: How Phrasing Affects Memorability
13 0.27985454 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
14 0.27820998 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit
15 0.2713851 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
16 0.23282927 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts
17 0.23085994 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
18 0.21313368 42 acl-2012-Bootstrapping via Graph Propagation
19 0.21106832 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction
20 0.20879795 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
topicId topicWeight
[(25, 0.027), (26, 0.019), (28, 0.031), (37, 0.02), (39, 0.058), (59, 0.021), (74, 0.022), (82, 0.014), (84, 0.484), (85, 0.025), (90, 0.096), (92, 0.037), (94, 0.013), (99, 0.037)]
simIndex simValue paperId paperTitle
1 0.9098984 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson
Abstract: We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. The protocol uses distance-based metrics defined for the space of trees over lattices. Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios.
same-paper 2 0.88679653 68 acl-2012-Decoding Running Key Ciphers
Author: Sravana Reddy ; Kevin Knight
Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.
3 0.86976969 195 acl-2012-The Creation of a Corpus of English Metalanguage
Author: Shomir Wilson
Abstract: Metalanguage is an essential linguistic mechanism which allows us to communicate explicit information about language itself. However, it has been underexamined in research in language technologies, to the detriment of the performance of systems that could exploit it. This paper describes the creation of the first tagged and delineated corpus of English metalanguage, accompanied by an explicit definition and a rubric for identifying the phenomenon in text. This resource will provide a basis for further studies of metalanguage and enable its utilization in language technologies.
4 0.85678166 135 acl-2012-Learning to Temporally Order Medical Events in Clinical Text
Author: Preethi Raghavan ; Albert Lai ; Eric Fosler-Lussier
Abstract: We investigate the problem of ordering medical events in unstructured clinical narratives by learning to rank them based on their time of occurrence. We represent each medical event as a time duration, with a corresponding start and stop, and learn to rank the starts/stops based on their proximity to the admission date. Such a representation allows us to learn all of Allen’s temporal relations between medical events. Interestingly, we observe that this methodology performs better than a classification-based approach for this domain, but worse on the relationships found in the Timebank corpus. This finding has important implications for styles of data representation and resources used for temporal relation learning: clinical narratives may have different language attributes corresponding to temporal ordering relative to Timebank, implying that the field may need to look at a wider range ofdomains to fully understand the nature of temporal ordering.
5 0.79794586 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition
Author: David Chen
Abstract: Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (201 1) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the stateof-the-art results. We show that by changing the grammar of the formal meaning represen- . tation language and training on additional data collected from Amazon’s Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.
6 0.43929651 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
7 0.41204879 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
8 0.40460595 194 acl-2012-Text Segmentation by Language Using Minimum Description Length
9 0.40361059 34 acl-2012-Automatically Learning Measures of Child Language Development
10 0.40312517 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese
12 0.39975283 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction
13 0.3940554 99 acl-2012-Finding Salient Dates for Building Thematic Timelines
14 0.39123166 139 acl-2012-MIX Is Not a Tree-Adjoining Language
15 0.38444233 59 acl-2012-Corpus-based Interpretation of Instructions in Virtual Environments
16 0.38403541 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
17 0.38058725 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP
18 0.37626371 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
19 0.3723737 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing
20 0.37228566 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT