acl acl2010 acl2010-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eric Corlett ; Gerald Penn
Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. [sent-3, score-0.69]
2 It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. [sent-4, score-0.184]
3 We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems. [sent-7, score-0.622]
4 1 Introduction Letter-substitution ciphers encode a document from a known language into an unknown writing system or an unknown encoding of a known writing system. [sent-8, score-0.69]
5 While this is not a problem in languages like English and Chinese, which have a small set of well known standard encodings such as ASCII, Big5 and Unicode, there are other languages such as Hindi in which there is no dominant encoding standard for the writing system. [sent-10, score-0.143]
6 In these languages, we would like to be able to automatically retrieve and display the information in electronic documents which use unknown encodings when we find them. [sent-11, score-0.19]
7 The purpose of this paper, then, is to simplify the problem of reading documents in unknown encodings by presenting a new algorithm to be used in their decipherment. [sent-14, score-0.196]
8 Our algorithm operates by running a search over the n-gram probabilities ofpossible solutions to the cipher, using a generalization of the Viterbi algorithm that is wrapped in an A* search, which determines at each step which partial solutions to expand. [sent-15, score-0.694]
9 We specifically consider the problem of finding decodings of electronic documents drawn from the internet, and we test our algorithm on ciphers drawn from randomly selected pages of Wikipedia. [sent-17, score-0.584]
10 It may seem at first that automatically decoding (as opposed to deciphering) a document is a sim- ple matter, but studies have shown that simple algorithms such as letter frequency counting do not always produce optimal solutions (Bauer, 2007). [sent-19, score-0.311]
11 If the text from which a language model is trained is of a different genre than the plaintext of a cipher, the unigraph letter frequencies may differ substantially from those of the language model, and so frequency counting will be misleading. [sent-20, score-0.514]
12 Because of the perceived simplicity of the problem, however, little work was performed to understand its computational properties until Peleg and Rosenfeld (1979), who developed a method that repeatedly swaps letters in a cipher to find a maximum probability solution. [sent-21, score-0.463]
13 Since then, several different approaches to this problem have been suggested, some of which use word counts in the language to arrive at a solution (Hart, 1994), and some of 1040 Proce dingUsp opfs thaela 4, 8Stwhe Adnen u,a 1l1- M16e Jtiunlgy o 2f0 t1h0e. [sent-22, score-0.138]
14 Unlike the present method, however, Ravi and Knight (2008) treat the decipherment of letter-substitution ciphers as an integer programming problem. [sent-28, score-0.575]
15 Clever though this constraint-based encoding is, their paper does not quantify the massive running times required to decode even very short documents with this sort of approach. [sent-29, score-0.187]
16 In any case, an exact method is available with a much more efficient A* search that is linear-time in the length of the cipher (though still horribly exponential in the size of the cipher and plain text alphabets), and has the additional advantage of being massively parallelizable. [sent-31, score-0.78]
17 (Ravi and Knight, 2008) also seem to believe that short cipher texts are somehow inherently more difficult to solve than long cipher texts. [sent-32, score-0.731]
18 Uniform character models equivocate regardless of the length of the cipher, and sharp character models with many zeroes can quickly converge even on short ciphers of only a few characters. [sent-34, score-0.853]
19 In fact, we must use add-one smoothing to decipher texts of even modest lengths because even one unseen plain-text letter sequence is enough to knock out the correct solution. [sent-36, score-0.201]
20 Applications of decipherment are also explored by (Nagy et al. [sent-38, score-0.117]
21 , 1987), who uses it in the context of optical character recognition (OCR). [sent-39, score-0.182]
22 2 Terminology Substitution ciphers are ciphers that are defined by some permutation of a plaintext alphabet. [sent-42, score-1.298]
23 Every character of a plaintext string is consistently mapped to a single character of an output string using this permutation. [sent-43, score-0.872]
24 For example, if we took the string ” hello world” to be the plaintext, then the string ” ifmmp xpsme” would be a cipher that maps e to f, l to m, and so on. [sent-44, score-0.489]
25 It is easy to extend this kind of cipher so that the plaintext alphabet is different from the ciphertext alphabet, but still stands in a one to one correspondence to it. [sent-45, score-1.124]
26 Given a ciphertext C, we say that the set of characters used in C is the ciphertext alphabet ΣC, and that its size is nC. [sent-46, score-0.858]
27 Similarly, the entire possible plaintext alphabet is ΣP, and its size is is nP. [sent-47, score-0.495]
28 Since nC is the number of letters actually used in the cipher, rather than the entire alphabet it is sampled from, we may find that nC < nP even when the two alphabets are the same. [sent-48, score-0.187]
29 We refer to the length of the cipher string C as clen. [sent-49, score-0.432]
30 Given the ciphertext C, we say that a partial solution of size k is a map σ = {p1 : c1, . [sent-54, score-0.708]
31 If for a partial sdol aurteio dni σtin0, we hnadv we htheraet σ ⊂ σ0, then we say that σ0 extends σ. [sent-64, score-0.174]
32 wIfet hheav vseiz teh aoft σσ0 ⊂is ≤ k+1 and σ is size k, we say that σ0 is an immediate extension of σ. [sent-65, score-0.113]
33 A full solution is a partial solution of size nC. [sent-66, score-0.497]
34 In the above example, σ1 = { : d : e} would be a partial solution of size 2, a {nd : σ2 = {: ,d : e, g : m} swolouutilod n be o a partial nsodl σution ld o{f size, d3 t:h ea,t immediately dex bteen ads p σ1. [sent-67, score-0.533]
35 lA s partial solution σT{: , : e, e : f,h : i,l : m, o : d : , 1041 p, r : s, w : x} would be both a full solution and tph,er correct one. [sent-68, score-0.45]
36 Every possible full solution to a cipher C will produce a plaintext string with some associated language model probability, and we will consider the best possible solution to be the one that gives the highest probability. [sent-70, score-1.059]
37 This plaintext can be found by treating all of the length clen strings S as being the output of different character mappings from C. [sent-72, score-0.698]
38 A string S that results from such a mapping is consistent with a partial solu- tion σ iff, for every pi : ci ∈ σ, the character positions of C that map to pi are exactly tahrea cchtearra poctsei-r positions with ci in C. [sent-73, score-0.675]
39 In our above example, we had C = ” ifmmp xpsme” , in which case we had clen = 11. [sent-74, score-0.114]
40 So mappings from C to ” hhhhh hhhhh” or ” hhhhhhhhhh” would be consistent with a partial solution of size 0, while ” hhhhh hhhhn” would be consistent with the size 2 partial solution σ = { : n : e}. [sent-75, score-0.906]
41 If we start with an empty solution and iteratively choose the most likely remaining partial solution in this way, storing the extensions obtained in a priority heap as we go, we will eventually reach a solution of size nC. [sent-78, score-0.701]
42 Every extension of σ has a probability that is, at best, equal to that of σ, and every partial solution receives, at worst, a score equal to its best extension, because the score is potentially based on an inconsistent mapping that does not qualify as an extension. [sent-79, score-0.411]
43 Thus the first solution of size nC will be the best solution of size nC. [sent-81, score-0.37]
44 The order by which we add the letters c to partial solutions is the order of the distinct cipher- text letters in right-to-left order of their final occurrence in C. [sent-82, score-0.489]
45 Create a priority queue Q for partial solutions, ordered by highest probability. [sent-91, score-0.27]
46 while Q is not empty do Pop the best partial solution σ from Q. [sent-93, score-0.312]
47 isf s = nC then return σ else For all p not in the range of σ, push the immediate extension σp onto Q with the score assigned to table cell G(rs+1 ,p, p) by GVit(σ, cs+1 , rs+1) if it is non-zero. [sent-95, score-0.156]
48 Unlike the real Viterbi algorithm, we must also observe the constraints of the input partial solution’s mapping. [sent-98, score-0.2]
49 A least frequent first regimen has the opposite problem, in which their rare occurrence in the ciphertext provides too few opportunities to potentially reduce the score of a candidate. [sent-101, score-0.358]
50 1042 A typical decipherment involves multiple runs of this algorithm, each of which scores all of the immediate extensions, both tightening and lowering their scores relative to the score of the input partial solution. [sent-102, score-0.353]
51 t The real Viterbi algorithm lacks these final two constraints, and would only store a single cell at G(i, l). [sent-104, score-0.137]
52 The table is completed by filling in the columns from i = 1 to clen in order. [sent-107, score-0.134]
53 Because we are using a trigram character model, the cells in the first and second columns must be primed with unigram and bigram probabilities. [sent-109, score-0.435]
54 The remaining probabilities are calculated by searching through the cells from the previous two columns, using the entry at the earlier column to indicate the probability of the best string up to that point, and searching through the trigram probabilities over two additional letters. [sent-110, score-0.371]
55 In order to decrease the search space, we add the further restriction that the solutions of every three character sequence must be consistent: if the ciphertext indicates that two adjacent letters are the same, then only the plaintext strings that map the same letter to each will be considered. [sent-113, score-1.337]
56 The number of letters that are forced to be consistent is three because consistency is enforced by removing inconsistent strings from consideration during × trigram model evaluation. [sent-114, score-0.262]
57 Because every partial solution is only obtained by extending a solution of size one less, and extensions are only made in a predetermined order of cipher alphabet letters, every partial solution is only considered / extended once. [sent-115, score-1.341]
58 The nP nP cells of every column ido not depend on each× onther only on the cells of the previous two columns i−1 aonnldy i−2, as cwelellsl as tthhee language tmwood ceoll. [sent-117, score-0.289]
59 4 — Experiment The above algorithm is designed for application to the transliteration of electronic documents, specifically, the transliteration of websites, and it has been tested with this in mind. [sent-120, score-0.136]
60 A rough search over internet articles has shown that a length of 1000 to 11000 characters is a realistic length for many articles, although this can vary according to the genre of the page. [sent-124, score-0.157]
61 In the first set of tests, we chose the mean of the above lengths to be our sample size, and we created and decoded 10 ciphers of this size (i. [sent-127, score-0.54]
62 We made these cipher texts by appending the contents of randomly chosen Wikipedia pages until they contained at least 6000 characters, and then using the first 6000 characters of the resulting files as the plaintexts of the cipher. [sent-130, score-0.475]
63 The plaintext for this set of tests was developed in the same way as the first set, and the input ciphertext lengths considered were 1000, 3500, 6000, 8500, 11000, and 13500 characters. [sent-135, score-0.792]
64 Each cell in the greenhouse is indexed by a plaintext letter and a character from the cipher. [sent-137, score-0.842]
65 The cells in the array give the best probabilities of any path passing through the greenhouse cell, given that the index character of the array maps to the character in column c, where c is the next ciphertext character to be fixed in the solution. [sent-139, score-1.191]
66 The cell at (d) is filled using the trigram probabilities and the probability of the path at starting at (a). [sent-142, score-0.24]
67 In all of the data considered, the frequency of spaces was far higher than that of any other character, and so in any real application the character corresponding to the space can likely be guessed without difficulty. [sent-143, score-0.28]
68 The ciphers we have considered have therefore been simplified by allowing the knowledge of which character corresponds to the space. [sent-144, score-0.64]
69 In the event that a trigram or bigram would be found in the plaintext that was not counted in the language model, add one smoothing was used. [sent-147, score-0.495]
70 The characters used in the language model were the upper and lower case letters, spaces, and full stops; other characters were skipped when counting the frequencies. [sent-150, score-0.223]
71 As discussed in the previous paragraph, the space character is assumed to be known. [sent-152, score-0.182]
72 These latter numbers give us insight into the quality of trigram probabilities as a heuristic for the A* search. [sent-154, score-0.149]
73 We judged the quality of the decoding by measuring the percentage of characters in the cipher alphabet that were correctly guessed, and also the word error rate of the plaintext generated by our solution. [sent-155, score-0.928]
74 The second metric is useful because a low probability character in the ciphertext may be guessed wrong without changing as much of the actual plaintext. [sent-156, score-0.58]
75 Counting the actual number of word errors is meant as an estimate of how useful or readable the plaintext will be. [sent-157, score-0.382]
76 We would have liked to compare our results with those of Ravi and Knight (2008), but the method presented there was simply not feasible 1044 × Algorithm 2 Generalized Viterbi Algorithm GVit(σ,c,r) Input: partial solution σ, ciphertext character c, and index r into C. [sent-159, score-0.819]
77 end for end for for i= 4 to r do for (l, k) such that σ ∪ {k : c, l : Ci} is consfiosrte (nlt, kd)o for j1 ,j2 such that σ ∪ {k : c, j2 : C[i−2] ,j1 : C[i−1] , tl σ : Ci} is consistent Cdo[ G(i, l, k) = max(G(i, l, k) , G(i−2,j2, k)×P(j1|j2j2(back)) P(l|j2j1)). [sent-164, score-0.14]
78 end for end for end for on texts and (case-sensitive) alphabets of this size × with the computing hardware at our disposal. [sent-165, score-0.211]
79 5 Results In our first set of tests, we measured the time consumption and accuracy of our algorithm over 10 ciphers taken from random texts that were 6000 characters long. [sent-166, score-0.686]
80 All running times reported in this section were obtained on a computer running Ubuntu Linux 8. [sent-171, score-0.15]
81 Cwoitlhum 4n G-lBev oefl subcomputations 5in G Hthez greenhouse were dispatched to an NVIDIA Quadro FX 1700 GPU card that is attached through a 16-lane PCI Express adapter. [sent-174, score-0.138]
82 In our second set of tests, we measured the time consumption and accuracy of our algorithm over several prefixes of different lengths of a single 13500-character ciphertext. [sent-176, score-0.192]
83 This is a Zipf’s Law effect misclassified characters come from poorly attested character trigrams, which are in turn found only in longer, rarer words. [sent-182, score-0.277]
84 The overall high accuracy is probably due to the large size of the texts relative to the uniticity distance of an English letter-substitution cipher (Bauer, 2007). [sent-183, score-0.427]
85 The results do show, however, that character trigram probabilities are an effective indicator of the most likely solution, even when the language model and test data are from very different genres (here, the — Wall Street Journal and Wikipedia, respectively). [sent-184, score-0.331]
86 As far as the running time of the algorithm goes, we see a substantial variance: from a few minutes to several hours for most of the longer ciphers, and that there are some that take longer than the threshold we gave in the experiment. [sent-191, score-0.155]
87 Desiring to reduce the variance of the running time, we look at the second set of tests for possible causes. [sent-193, score-0.125]
88 In the second test set, there is a general decrease in both the running time and the number of solutions expanded as the length of the ciphers increases. [sent-194, score-0.737]
89 In particular, the length 8500 cipher generates more solutions than the length 3500 or 6000 ones. [sent-198, score-0.558]
90 Recall that the ciphers in this section are all prefixes of the same string. [sent-199, score-0.511]
91 Because the algorithm fixes characters starting from the end of the cipher, these prefixes have very different character orderings, c1, . [sent-200, score-0.41]
92 , cnC , and thus a very different order of partial solutions. [sent-203, score-0.174]
93 The running time of our algorithm depends very crucially on these initial conditions. [sent-204, score-0.122]
94 Perhaps most interestingly, we note that the number of enqueued partial solutions is in every case identical or nearly identical to the number of partial solutions expanded. [sent-205, score-0.728]
95 A more complex heuristic that can additionally rank non-zero probability solutions with more prescience would likely make a very great difference to the running time of this method. [sent-209, score-0.247]
96 1046 6 Conclusions In the above paper, we have presented an algorithm for solving letter-substitution ciphers, with an eye towards discovering unknown encoding standards in electronic documents on the fly. [sent-210, score-0.228]
97 In a test of our algorithm over ciphers drawn from Wikipedia, we found its accuracy to be 100% on the ciphers that it solved within a threshold of 12 hours, this being 80% of the total attempted. [sent-211, score-0.963]
98 There is, however, a great deal of room for improvement in the trigram model’s ability to rank partial solutions that are not eliminated outright. [sent-213, score-0.432]
99 Our approach makes no explicit attempt to account for noisy ciphers, in which characters are erroneously mapped, nor any attempt to account for more general substitution ciphers in which a single plaintext (resp. [sent-217, score-0.977]
100 plaintext) letters, nor for ciphers in which ciphertext units corresponds to larger units of plaintext such syllables or words. [sent-219, score-1.165]
wordName wordTfidf (topN-words)
[('ciphers', 0.458), ('plaintext', 0.382), ('cipher', 0.351), ('ciphertext', 0.325), ('character', 0.182), ('partial', 0.174), ('solutions', 0.145), ('solution', 0.138), ('decipherment', 0.117), ('greenhouse', 0.115), ('trigram', 0.113), ('cells', 0.109), ('letter', 0.099), ('characters', 0.095), ('letters', 0.085), ('ci', 0.084), ('ravi', 0.079), ('clen', 0.076), ('gvit', 0.076), ('running', 0.075), ('alphabet', 0.066), ('viterbi', 0.065), ('queue', 0.065), ('cell', 0.064), ('knight', 0.062), ('consumption', 0.057), ('encodings', 0.057), ('hhhhh', 0.057), ('nc', 0.056), ('unknown', 0.054), ('prefixes', 0.053), ('trigrams', 0.052), ('tests', 0.05), ('string', 0.05), ('enqueued', 0.05), ('encoding', 0.048), ('size', 0.047), ('algorithm', 0.047), ('deciphering', 0.046), ('guessed', 0.046), ('substitution', 0.042), ('wikipedia', 0.042), ('electronic', 0.041), ('every', 0.04), ('writing', 0.038), ('aisltlen', 0.038), ('corlett', 0.038), ('dko', 0.038), ('fcoorns', 0.038), ('ifmmp', 0.038), ('knock', 0.038), ('mhz', 0.038), ('nagy', 0.038), ('nvidia', 0.038), ('peleg', 0.038), ('xpsme', 0.038), ('documents', 0.038), ('tl', 0.037), ('consistent', 0.037), ('alphabets', 0.036), ('probabilities', 0.036), ('lengths', 0.035), ('extensions', 0.035), ('immediate', 0.034), ('decoding', 0.034), ('bauer', 0.033), ('regimen', 0.033), ('end', 0.033), ('counting', 0.033), ('hours', 0.033), ('extension', 0.032), ('length', 0.031), ('priority', 0.031), ('columns', 0.031), ('clever', 0.031), ('restarts', 0.031), ('array', 0.03), ('texts', 0.029), ('zeros', 0.029), ('decrease', 0.028), ('runs', 0.028), ('badly', 0.027), ('ocr', 0.027), ('filling', 0.027), ('strings', 0.027), ('probability', 0.027), ('spaces', 0.026), ('decode', 0.026), ('mapped', 0.026), ('real', 0.026), ('onto', 0.026), ('generalization', 0.025), ('orderings', 0.025), ('pk', 0.025), ('transliteration', 0.024), ('bosch', 0.024), ('map', 0.024), ('relaxation', 0.023), ('card', 0.023), ('cutoff', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers
Author: Eric Corlett ; Gerald Penn
Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.
2 0.12028427 16 acl-2010-A Statistical Model for Lost Language Decipherment
Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight
Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.
3 0.082859822 112 acl-2010-Extracting Social Networks from Literary Fiction
Author: David Elson ; Nicholas Dames ; Kathleen McKeown
Abstract: We present a method for extracting social networks from literature, namely, nineteenth-century British novels and serials. We derive the networks from dialogue interactions, and thus our method depends on the ability to determine when two characters are in conversation. Our approach involves character name chunking, quoted speech attribution and conversation detection given the set of quotes. We extract features from the social networks and examine their correlation with one another, as well as with metadata such as the novel’s setting. Our results provide evidence that the majority of novels in this time period do not fit two characterizations provided by literacy scholars. Instead, our results suggest an alternative explanation for differences in social networks.
4 0.081446446 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices
Author: Matthew Skala ; Victoria Krakovna ; Janos Kramar ; Gerald Penn
Abstract: Constructing an encoding of a concept lattice using short bit vectors allows for efficient computation of join operations on the lattice. Join is the central operation any unification-based parser must support. We extend the traditional bit vector encoding, which represents join failure using the zero vector, to count any vector with less than a fixed number of one bits as failure. This allows non-joinable elements to share bits, resulting in a smaller vector size. A constraint solver is used to construct the encoding, and a variety of techniques are employed to find near-optimal solutions and handle timeouts. An evaluation is provided comparing the extended representation of failure with traditional bit vector techniques.
5 0.077482417 170 acl-2010-Letter-Phoneme Alignment: An Exploration
Author: Sittichai Jiampojamarn ; Grzegorz Kondrak
Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.
6 0.069197856 13 acl-2010-A Rational Model of Eye Movement Control in Reading
7 0.065399177 133 acl-2010-Hierarchical Search for Word Alignment
9 0.052580524 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization
10 0.049538888 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging
11 0.049168121 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results
12 0.048692472 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
13 0.047299746 169 acl-2010-Learning to Translate with Source and Target Syntax
14 0.045716528 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
15 0.044840228 220 acl-2010-Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure
16 0.044325609 217 acl-2010-String Extension Learning
17 0.044044428 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out
18 0.0416415 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar
19 0.040096298 119 acl-2010-Fixed Length Word Suffix for Factored Statistical Machine Translation
20 0.040055778 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
topicId topicWeight
[(0, -0.129), (1, -0.01), (2, -0.015), (3, -0.035), (4, 0.015), (5, -0.038), (6, 0.035), (7, -0.003), (8, 0.078), (9, -0.014), (10, -0.07), (11, 0.015), (12, 0.026), (13, -0.047), (14, -0.122), (15, -0.01), (16, -0.037), (17, 0.038), (18, -0.015), (19, 0.034), (20, -0.024), (21, 0.015), (22, -0.034), (23, -0.073), (24, 0.025), (25, -0.037), (26, -0.042), (27, -0.053), (28, -0.01), (29, 0.031), (30, -0.072), (31, -0.04), (32, 0.081), (33, -0.051), (34, -0.195), (35, -0.063), (36, 0.003), (37, -0.044), (38, -0.062), (39, -0.074), (40, 0.069), (41, 0.034), (42, -0.04), (43, -0.061), (44, -0.04), (45, -0.091), (46, 0.016), (47, 0.022), (48, 0.037), (49, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.92729712 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers
Author: Eric Corlett ; Gerald Penn
Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.
2 0.70362836 16 acl-2010-A Statistical Model for Lost Language Decipherment
Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight
Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.
3 0.59268576 116 acl-2010-Finding Cognate Groups Using Phylogenies
Author: David Hall ; Dan Klein
Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.
Author: Dong Yang ; Paul Dixon ; Sadaoki Furui
Abstract: This paper presents a joint optimization method of a two-step conditional random field (CRF) model for machine transliteration and a fast decoding algorithm for the proposed method. Our method lies in the category of direct orthographical mapping (DOM) between two languages without using any intermediate phonemic mapping. In the two-step CRF model, the first CRF segments an input word into chunks and the second one converts each chunk into one unit in the target language. In this paper, we propose a method to jointly optimize the two-step CRFs and also a fast algorithm to realize it. Our experiments show that the proposed method outper- forms the well-known joint source channel model (JSCM) and our proposed fast algorithm decreases the decoding time significantly. Furthermore, combination of the proposed method and the JSCM gives further improvement, which outperforms state-of-the-art results in terms of top-1 accuracy.
5 0.5517413 112 acl-2010-Extracting Social Networks from Literary Fiction
Author: David Elson ; Nicholas Dames ; Kathleen McKeown
Abstract: We present a method for extracting social networks from literature, namely, nineteenth-century British novels and serials. We derive the networks from dialogue interactions, and thus our method depends on the ability to determine when two characters are in conversation. Our approach involves character name chunking, quoted speech attribution and conversation detection given the set of quotes. We extract features from the social networks and examine their correlation with one another, as well as with metadata such as the novel’s setting. Our results provide evidence that the majority of novels in this time period do not fit two characterizations provided by literacy scholars. Instead, our results suggest an alternative explanation for differences in social networks.
6 0.54811394 68 acl-2010-Conditional Random Fields for Word Hyphenation
7 0.49442118 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration
8 0.48188627 250 acl-2010-Untangling the Cross-Lingual Link Structure of Wikipedia
9 0.46729425 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices
10 0.45969084 13 acl-2010-A Rational Model of Eye Movement Control in Reading
11 0.45538014 98 acl-2010-Efficient Staggered Decoding for Sequence Labeling
12 0.43216524 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers
14 0.39184082 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization
15 0.38524824 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms
16 0.37209249 170 acl-2010-Letter-Phoneme Alignment: An Exploration
17 0.36681637 97 acl-2010-Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices
18 0.35567576 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers
19 0.33462322 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
20 0.32759842 103 acl-2010-Estimating Strictly Piecewise Distributions
topicId topicWeight
[(14, 0.016), (25, 0.042), (39, 0.035), (42, 0.025), (52, 0.011), (59, 0.105), (62, 0.301), (72, 0.011), (73, 0.062), (78, 0.035), (83, 0.094), (84, 0.038), (98, 0.116)]
simIndex simValue paperId paperTitle
same-paper 1 0.74079251 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers
Author: Eric Corlett ; Gerald Penn
Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.
2 0.54975414 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans
Author: Fei Huang ; Alexander Yates
Abstract: Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Semantic role labeling techniques are typically trained on newswire text, and in tests their performance on fiction is as much as 19% worse than their performance on newswire text. We investigate techniques for building open-domain semantic role labeling systems that approach the ideal of a train-once, use-anywhere system. We leverage recently-developed techniques for learning representations of text using latent-variable language models, and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling. In experiments, our novel system reduces error by 16% relative to the previous state of the art on out-of-domain text.
3 0.54731429 158 acl-2010-Latent Variable Models of Selectional Preference
Author: Diarmuid O Seaghdha
Abstract: This paper describes the application of so-called topic models to selectional preference induction. Three models related to Latent Dirichlet Allocation, a proven method for modelling document-word cooccurrences, are presented and evaluated on datasets of human plausibility judgements. Compared to previously proposed techniques, these models perform very competitively, especially for infrequent predicate-argument combinations where they exceed the quality of Web-scale predictions while using relatively little data.
4 0.54716009 65 acl-2010-Complexity Metrics in an Incremental Right-Corner Parser
Author: Stephen Wu ; Asaf Bachrach ; Carlos Cardenas ; William Schuler
Abstract: Hierarchical HMM (HHMM) parsers make promising cognitive models: while they use a bounded model of working memory and pursue incremental hypotheses in parallel, they still achieve parsing accuracies competitive with chart-based techniques. This paper aims to validate that a right-corner HHMM parser is also able to produce complexity metrics, which quantify a reader’s incremental difficulty in understanding a sentence. Besides defining standard metrics in the HHMM framework, a new metric, embedding difference, is also proposed, which tests the hypothesis that HHMM store elements represents syntactic working memory. Results show that HHMM surprisal outperforms all other evaluated metrics in predicting reading times, and that embedding difference makes a significant, independent contribution.
5 0.54608071 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
Author: Ivan Titov ; Mikhail Kozhevnikov
Abstract: We argue that groups of unannotated texts with overlapping and non-contradictory semantics represent a valuable source of information for learning semantic representations. A simple and efficient inference method recursively induces joint semantic representations for each group and discovers correspondence between lexical entries and latent semantic concepts. We consider the generative semantics-text correspondence model (Liang et al., 2009) and demonstrate that exploiting the noncontradiction relation between texts leads to substantial improvements over natural baselines on a problem of analyzing human-written weather forecasts.
6 0.54490852 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web
7 0.54358256 39 acl-2010-Automatic Generation of Story Highlights
8 0.54263079 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
9 0.54231036 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields
10 0.54223496 148 acl-2010-Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
11 0.54124856 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data
12 0.54091823 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification
13 0.54065919 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries
14 0.54065001 56 acl-2010-Bridging SMT and TM with Translation Recommendation
15 0.54036427 195 acl-2010-Phylogenetic Grammar Induction
16 0.54004997 162 acl-2010-Learning Common Grammar from Multilingual Corpus
17 0.53946537 42 acl-2010-Automatically Generating Annotator Rationales to Improve Sentiment Classification
18 0.53905666 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking
19 0.53844345 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns
20 0.53806412 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons