acl acl2013 acl2013-108 knowledge-graph by maker-knowledge-mining

108 acl-2013-Decipherment

Source: pdf

Author: Kevin Knight

Abstract: The first natural language processing systems had a straightforward goal: decipher coded messages sent by the enemy. This tutorial explores connections between early decipherment research and today’s NLP work. We cover classic military and diplomatic ciphers, automatic decipherment algorithms, unsolved ciphers, language translation as decipherment, and analyzing ancient writing as decipherment. 1 Tutorial Overview The first natural language processing systems had a straightforward goal: decipher coded messages sent by the enemy. Sixty years later, we have many more applications, including web search, question answering, summarization, speech recognition, and language translation. This tutorial explores connections between early decipherment research and today’s NLP work. We find that many ideas from the earlier era have become core to the field, while others still remain to be picked up and developed. We first cover classic military and diplomatic cipher types, including complex substitution ciphers implemented in the first electro-mechanical encryption machines. We look at mathematical tools (language recognition, frequency counting, smoothing) developed to decrypt such ciphers on proto-computers. We show algorithms and extensive empirical results for solving different types of ciphers, and we show the role of algorithms in recent decipherments of historical documents. We then look at how foreign language can be viewed as a code for English, a concept developed by Alan Turing and Warren Weaver. We describe recently published work on building automatic translation systems from non-parallel data. We also demonstrate how some of the same algorithmic tools can be applied to natural language tasks like part-of-speech tagging and word alignment. Turning back to historical ciphers, we explore a number of unsolved ciphers, giving results of initial computer experiments on several of them. Finally, we look briefly at writing as a way to encipher phoneme sequences, covering ancient scripts and modern applications. 2 Outline 1. Classical military/diplomatic ciphers (15 minutes) • 60 cipher types (ACA) • Ciphers vs. codes • Enigma cipher: the mother of natural language processing computer analysis of text language recognition Good-Turing smoothing – – – 2. Foreign language as a code (10 minutes) • • Alan Turing’s ”Thinking Machines” Warren Weaver’s Memorandum 3. Automatic decipherment (55 minutes) • Cipher type detection • Substitution ciphers (simple, homophonic, polyalphabetic, etc) plaintext language recognition ∗ how much plaintext knowledge is – nheowede mdu 3 Proce diSnogfsia, of B thuleg5a r1iast, A Anungu aslt M4-9e t2in01g3 o.f ? tc he20 A1s3so Acsiasoticoinat fio rn C fo rm Cpoumtaptuiotantaioln Lainlg Luinisgtuicis ,tpi casges 3–4, – ∗ index of coincidence, unicity distance, oanf dc oointhceidr measures navigating a difficult search space ∗ frequencies of letters and words ∗ pattern words and cribs ∗ pElMin,g ILP, Bayesian models, sam– recent decipherments ∗ Jefferson cipher, Copiale cipher, cJievfifle war ciphers, n Caovaplia Enigma • • • • Application to part-of-speech tagging, Awopprdli alignment Application to machine translation withoAuptp parallel t teoxtm Parallel development of cryptography aPnarda ltrleanls dlaetvioenlo Recently released NSA internal nReewcselnettlyter (1974-1997) 4. *** Break *** (30 minutes) 5. Unsolved ciphers (40 minutes) • Zodiac 340 (1969), including computatZioodnaial cw 3o4r0k • Voynich Manuscript (early 1400s), including computational ewarolyrk • Beale (1885) • Dorabella (1897) • Taman Shud (1948) • Kryptos (1990), including computatKiorynaplt owsor (k1 • McCormick (1999) • Shoeboxes in attics: DuPonceau jour- nal, Finnerana, SYP, Mopse, diptych 6. Writing as a code (20 minutes) • Does writing encode ideas, or does it encDoodees phonemes? • Ancient script decipherment Egyptian hieroglyphs Linear B Mayan glyphs – – – – wUgoarkritic, including computational Chinese N ¨ushu, including computational work • Automatic phonetic decipherment • Application to transliteration 7. Undeciphered writing systems (15 minutes) • Indus Valley Script (3300BC) • Linear A (1900BC) • Phaistos disc (1700BC?) • Rongorongo (1800s?) – 8. Conclusion and further questions (15 minutes) 3 About the Presenter Kevin Knight is a Senior Research Scientist and Fellow at the Information Sciences Institute of the University of Southern California (USC), and a Research Professor in USC’s Computer Science Department. He received a PhD in computer science from Carnegie Mellon University and a bachelor’s degree from Harvard University. Professor Knight’s research interests include natural language processing, machine translation, automata theory, and decipherment. In 2001, he co-founded Language Weaver, Inc., and in 2011, he served as President of the Association for Computational Linguistics. Dr. Knight has taught computer science courses at USC for more than fifteen years and co-authored the widely adopted textbook Artificial Intelligence. 4

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Decipherment Kevin Knight USC/ISI 4676 Admiralty Way Marina del Rey CA 90292 knight @ i i edu s . [sent-1, score-0.144]

2 Abstract The first natural language processing systems had a straightforward goal: decipher coded messages sent by the enemy. [sent-2, score-0.337]

3 This tutorial explores connections between early decipherment research and today’s NLP work. [sent-3, score-0.586]

4 We cover classic military and diplomatic ciphers, automatic decipherment algorithms, unsolved ciphers, language translation as decipherment, and analyzing ancient writing as decipherment. [sent-4, score-1.003]

5 1 Tutorial Overview The first natural language processing systems had a straightforward goal: decipher coded messages sent by the enemy. [sent-5, score-0.337]

6 Sixty years later, we have many more applications, including web search, question answering, summarization, speech recognition, and language translation. [sent-6, score-0.038]

7 This tutorial explores connections between early decipherment research and today’s NLP work. [sent-7, score-0.586]

8 We find that many ideas from the earlier era have become core to the field, while others still remain to be picked up and developed. [sent-8, score-0.088]

9 We first cover classic military and diplomatic cipher types, including complex substitution ciphers implemented in the first electro-mechanical encryption machines. [sent-9, score-1.301]

10 We look at mathematical tools (language recognition, frequency counting, smoothing) developed to decrypt such ciphers on proto-computers. [sent-10, score-0.619]

11 We show algorithms and extensive empirical results for solving different types of ciphers, and we show the role of algorithms in recent decipherments of historical documents. [sent-11, score-0.16]

12 We then look at how foreign language can be viewed as a code for English, a concept developed by Alan Turing and Warren Weaver. [sent-12, score-0.158]

13 We describe recently published work on building automatic translation systems from non-parallel data. [sent-13, score-0.032]

14 We also demonstrate how some of the same algorithmic tools can be applied to natural language tasks like part-of-speech tagging and word alignment. [sent-14, score-0.034]

15 Turning back to historical ciphers, we explore a number of unsolved ciphers, giving results of initial computer experiments on several of them. [sent-15, score-0.201]

16 Finally, we look briefly at writing as a way to encipher phoneme sequences, covering ancient scripts and modern applications. [sent-16, score-0.366]

17 Classical military/diplomatic ciphers (15 minutes) • 60 cipher types (ACA) • Ciphers vs. [sent-18, score-0.87]

18 codes • Enigma cipher: the mother of natural language processing computer analysis of text language recognition Good-Turing smoothing – – – 2. [sent-19, score-0.12]

19 Foreign language as a code (10 minutes) • • Alan Turing’s ”Thinking Machines” Warren Weaver’s Memorandum 3. [sent-20, score-0.054]

20 Automatic decipherment (55 minutes) • Cipher type detection • Substitution ciphers (simple, homophonic, polyalphabetic, etc) plaintext language recognition ∗ how much plaintext knowledge is – nheowede mdu 3 Proce diSnogfsia, of B thuleg5a r1iast, A Anungu aslt M4-9e t2in01g3 o. [sent-21, score-1.18]

21 Writing as a code (20 minutes) • Does writing encode ideas, or does it encDoodees phonemes? [sent-26, score-0.155]

22 • Ancient script decipherment Egyptian hieroglyphs Linear B Mayan glyphs – – – – wUgoarkritic, including computational Chinese N ¨ushu, including computational work • Automatic phonetic decipherment • Application to transliteration 7. [sent-27, score-0.827]

23 Undeciphered writing systems (15 minutes) • Indus Valley Script (3300BC) • Linear A (1900BC) • Phaistos disc (1700BC? [sent-28, score-0.148]

24 Professor Knight’s research interests include natural language processing, machine translation, automata theory, and decipherment. [sent-33, score-0.072]

25 , and in 2011, he served as President of the Association for Computational Linguistics. [sent-35, score-0.035]

26 Knight has taught computer science courses at USC for more than fifteen years and co-authored the widely adopted textbook Artificial Intelligence. [sent-37, score-0.147]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ciphers', 0.568), ('decipherment', 0.321), ('cipher', 0.302), ('minutes', 0.281), ('unsolved', 0.142), ('usc', 0.129), ('enigma', 0.123), ('ancient', 0.12), ('weaver', 0.109), ('knight', 0.104), ('writing', 0.101), ('turing', 0.101), ('decipherments', 0.101), ('plaintext', 0.101), ('diplomatic', 0.101), ('warren', 0.095), ('professor', 0.095), ('tutorial', 0.094), ('decipher', 0.09), ('coded', 0.083), ('military', 0.083), ('classic', 0.067), ('sent', 0.065), ('today', 0.064), ('explores', 0.06), ('connections', 0.06), ('script', 0.059), ('historical', 0.059), ('messages', 0.058), ('encryption', 0.054), ('apnarda', 0.054), ('encipher', 0.054), ('mccormick', 0.054), ('oanf', 0.054), ('senior', 0.054), ('sixty', 0.054), ('syp', 0.054), ('textbook', 0.054), ('undeciphered', 0.054), ('code', 0.054), ('foreign', 0.053), ('substitution', 0.052), ('ideas', 0.052), ('look', 0.051), ('early', 0.051), ('courses', 0.05), ('cryptography', 0.05), ('bachelor', 0.05), ('glyphs', 0.05), ('admiralty', 0.05), ('presenter', 0.05), ('alan', 0.048), ('scientist', 0.047), ('fellow', 0.047), ('homophonic', 0.047), ('zodiac', 0.047), ('disc', 0.047), ('navigating', 0.047), ('rey', 0.047), ('phonemes', 0.045), ('acsiasoticoinat', 0.045), ('anungu', 0.045), ('aslt', 0.045), ('casges', 0.045), ('cpoumtaptuiotantaioln', 0.045), ('egyptian', 0.045), ('fio', 0.045), ('lainlg', 0.045), ('luinisgtuicis', 0.045), ('recognition', 0.044), ('marina', 0.043), ('fifteen', 0.043), ('manuscript', 0.043), ('cw', 0.042), ('valley', 0.042), ('straightforward', 0.041), ('smoothing', 0.041), ('disnogfsia', 0.04), ('phoneme', 0.04), ('del', 0.04), ('southern', 0.039), ('automata', 0.038), ('turning', 0.038), ('war', 0.038), ('including', 0.038), ('sam', 0.037), ('thinking', 0.037), ('nal', 0.037), ('cover', 0.036), ('etc', 0.036), ('era', 0.036), ('harvard', 0.035), ('served', 0.035), ('mother', 0.035), ('kevin', 0.035), ('algorithmic', 0.034), ('interests', 0.034), ('phd', 0.033), ('president', 0.032), ('translation', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 108 acl-2013-Decipherment

Author: Kevin Knight

2 0.4769572 66 acl-2013-Beam Search for Solving Substitution Ciphers

Author: Malte Nuhn ; Julian Schamper ; Hermann Ney

Abstract: In this paper we address the problem of solving substitution ciphers using a beam search approach. We present a conceptually consistent and easy to implement method that improves the current state of the art for decipherment of substitution ciphers and is able to use high order n-gram language models. We show experiments with 1:1 substitution ciphers in which the guaranteed optimal solution for 3-gram language models has 38.6% decipherment error, while our approach achieves 4.13% decipherment error in a fraction of time by using a 6-gram language model. We also apply our approach to the famous Zodiac-408 cipher and obtain slightly bet- ter (and near to optimal) results than previously published. Unlike the previous state-of-the-art approach that uses additional word lists to evaluate possible decipherments, our approach only uses a letterbased 6-gram language model. Furthermore we use our algorithm to solve large vocabulary substitution ciphers and improve the best published decipherment error rate based on the Gigaword corpus of 7.8% to 6.0% error rate.

3 0.31208256 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

Author: Malte Nuhn ; Hermann Ney

Abstract: In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard. We show that in this case the decipherment problem is equivalent to the quadratic assignment problem (QAP). To the best of our knowledge, this connection between the QAP and the decipherment problem has not been known in the literature before.

4 0.15936449 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).

5 0.085511848 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

6 0.054125108 382 acl-2013-Variational Inference for Structured NLP Models

7 0.043246463 146 acl-2013-Exploiting Social Media for Natural Language Processing: Bridging the Gap between Language-centric and Real-world Applications

8 0.037676845 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

9 0.03564449 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars

10 0.035292979 370 acl-2013-Unsupervised Transcription of Historical Documents

11 0.03485854 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

12 0.034450617 359 acl-2013-Translating Dialectal Arabic to English

13 0.032594033 257 acl-2013-Natural Language Models for Predicting Programming Comments

14 0.032444615 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

15 0.027882926 240 acl-2013-Microblogs as Parallel Corpora

16 0.027664399 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

17 0.026955741 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures

18 0.026579723 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

19 0.02639875 349 acl-2013-The mathematics of language learning

20 0.025886418 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.074), (1, -0.027), (2, 0.031), (3, 0.008), (4, 0.005), (5, -0.022), (6, 0.045), (7, -0.01), (8, -0.07), (9, -0.036), (10, -0.056), (11, -0.109), (12, -0.082), (13, -0.255), (14, 0.019), (15, -0.391), (16, -0.045), (17, -0.156), (18, -0.108), (19, 0.264), (20, -0.006), (21, 0.03), (22, -0.146), (23, -0.159), (24, 0.08), (25, 0.051), (26, 0.12), (27, -0.053), (28, -0.078), (29, -0.056), (30, -0.031), (31, 0.043), (32, 0.031), (33, -0.066), (34, -0.054), (35, -0.019), (36, 0.018), (37, -0.017), (38, 0.111), (39, -0.041), (40, -0.003), (41, 0.049), (42, 0.043), (43, 0.064), (44, 0.038), (45, 0.02), (46, -0.025), (47, -0.061), (48, -0.012), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97340935 108 acl-2013-Decipherment

Author: Kevin Knight

2 0.91766411 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

Author: Malte Nuhn ; Hermann Ney

3 0.90808779 66 acl-2013-Beam Search for Solving Substitution Ciphers

Author: Malte Nuhn ; Julian Schamper ; Hermann Ney

4 0.36552215 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

5 0.25593805 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

6 0.23881643 370 acl-2013-Unsupervised Transcription of Historical Documents

7 0.18292348 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

8 0.16117714 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

9 0.15929872 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

10 0.15513237 171 acl-2013-Grammatical Error Correction Using Integer Linear Programming

11 0.14276822 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures

12 0.13647847 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

13 0.13475318 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation

14 0.13337445 382 acl-2013-Variational Inference for Structured NLP Models

15 0.1319809 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints

16 0.12836336 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

17 0.1279764 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics

18 0.12538055 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

19 0.12432951 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

20 0.12121025 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.037), (6, 0.014), (11, 0.012), (24, 0.035), (26, 0.026), (28, 0.019), (35, 0.046), (42, 0.033), (48, 0.039), (56, 0.427), (70, 0.049), (90, 0.018), (95, 0.141)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8637833 108 acl-2013-Decipherment

Author: Kevin Knight

2 0.71633089 190 acl-2013-Implicatures and Nested Beliefs in Approximate Decentralized-POMDPs

Author: Adam Vogel ; Christopher Potts ; Dan Jurafsky

Abstract: Conversational implicatures involve reasoning about multiply nested belief structures. This complexity poses significant challenges for computational models of conversation and cognition. We show that agents in the multi-agent DecentralizedPOMDP reach implicature-rich interpretations simply as a by-product of the way they reason about each other to maximize joint utility. Our simulations involve a reference game of the sort studied in psychology and linguistics as well as a dynamic, interactional scenario involving implemented artificial agents.

3 0.61964577 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation

Author: Gozde Ozbal ; Daniele Pighin ; Carlo Strapparava

Abstract: Daniele Pighin Google Inc. Z ¨urich, Switzerland danie le . pighin@ gmai l com . Carlo Strapparava FBK-irst Trento, Italy st rappa@ fbk . eu you”. As another scenario, creative sentence genWe present BRAINSUP, an extensible framework for the generation of creative sentences in which users are able to force several words to appear in the sentences and to control the generation process across several semantic dimensions, namely emotions, colors, domain relatedness and phonetic properties. We evaluate its performance on a creative sentence generation task, showing its capability of generating well-formed, catchy and effective sentences that have all the good qualities of slogans produced by human copywriters.

4 0.58397079 178 acl-2013-HEADY: News headline abstraction through event pattern clustering

Author: Enrique Alfonseca ; Daniele Pighin ; Guillermo Garrido

Abstract: This paper presents HEADY: a novel, abstractive approach for headline generation from news collections. From a web-scale corpus of English news, we mine syntactic patterns that a Noisy-OR model generalizes into event descriptions. At inference time, we query the model with the patterns observed in an unseen news collection, identify the event that better captures the gist of the collection and retrieve the most appropriate pattern to generate a headline. HEADY improves over a state-of-theart open-domain title abstraction method, bridging half of the gap that separates it from extractive methods using humangenerated titles in manual evaluations, and performs comparably to human-generated headlines as evaluated with ROUGE.

5 0.56623286 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context

Author: Sudha Bhingardive ; Samiulla Shaikh ; Pushpak Bhattacharyya

Abstract: Word Sense Disambiguation (WSD) is one of the toughest problems in NLP, and in WSD, verb disambiguation has proved to be extremely difficult, because of high degree of polysemy, too fine grained senses, absence of deep verb hierarchy and low inter annotator agreement in verb sense annotation. Unsupervised WSD has received widespread attention, but has performed poorly, specially on verbs. Recently an unsupervised bilingual EM based algorithm has been proposed, which makes use only of the raw counts of the translations in comparable corpora (Marathi and Hindi). But the performance of this approach is poor on verbs with accuracy level at 25-38%. We suggest a modifica- tion to this mentioned formulation, using context and semantic relatedness of neighboring words. An improvement of 17% 35% in the accuracy of verb WSD is obtained compared to the existing EM based approach. On a general note, the work can be looked upon as contributing to the framework of unsupervised WSD through context aware expectation maximization.

6 0.36864984 37 acl-2013-Adaptive Parser-Centric Text Normalization

7 0.36850166 66 acl-2013-Beam Search for Solving Substitution Ciphers

8 0.36542299 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection

9 0.36476374 336 acl-2013-Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

10 0.36160952 255 acl-2013-Name-aware Machine Translation

11 0.36131412 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

12 0.3599973 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

13 0.35956812 289 acl-2013-QuEst - A translation quality estimation framework

14 0.35911271 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example

15 0.3582094 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

16 0.35819393 359 acl-2013-Translating Dialectal Arabic to English

17 0.35378176 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

18 0.35317481 135 acl-2013-English-to-Russian MT evaluation campaign

19 0.35287046 240 acl-2013-Microblogs as Parallel Corpora

20 0.35070014 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers