acl acl2010 acl2010-16 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight
Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract In this paper we propose a method for the automatic decipherment of lost languages. [sent-3, score-0.428]
2 Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. [sent-4, score-0.233]
3 We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. [sent-5, score-0.214]
4 When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew. [sent-7, score-0.527]
5 1 Introduction Dozens of lost languages have been deciphered by humans in the last two centuries. [sent-8, score-0.237]
6 In each case, the decipherment has been considered a major intellectual breakthrough, often the culmination of decades of scholarly efforts. [sent-9, score-0.242]
7 Computers have played no role in the decipherment any of these languages. [sent-10, score-0.242]
8 1 In this paper, we demonstrate that at least some of this logic and intuition can be successfully modeled, allowing computational tools to be used in the decipherment process. [sent-12, score-0.275]
9 1“Successful archaeological decipherment has turned out to require a synthesis of logic and intuition . [sent-13, score-0.275]
10 edu Our definition of the computational decipherment task closely follows the setup typically faced by human decipherers (Robinson, 2002). [sent-19, score-0.276]
11 Our input consists of texts in a lost language and a corpus of non-parallel data in a known related language. [sent-20, score-0.233]
12 The decipherment itself involves two related subtasks: (i) finding the mapping between alphabets of the known and lost languages, and (ii) translating words in the lost language into corresponding cognates of the known language. [sent-21, score-0.921]
13 A common starting point is to compare letter and word frequencies between the lost and known languages. [sent-23, score-0.282]
14 In the presence of cognates the correct mapping between the languages will reveal similarities in frequency, both at the character and lexical level. [sent-24, score-0.373]
15 In addition, morphological analysis plays a crucial role here, as highly frequent morpheme correspondences can be particularly revealing. [sent-25, score-0.27]
16 In fact, these three strands of analysis (character frequency, morphology, and lexical frequency) are intertwined throughout the human decipherment process. [sent-26, score-0.242]
17 This model assumes that each word in the lost language is composed of morphemes which were generated with latent counterparts in the known language. [sent-29, score-0.365]
18 This allows us to assign probabilities based both on character-level correspondences (using a character-edit base distribution) as well as higher-level morpheme correspondences. [sent-31, score-0.261]
19 In addition, our model carries out an implicit morphological analysis of the lost language, utilizing the known morphological structure of the related language. [sent-32, score-0.385]
20 c s 2o0c1ia0ti Aosnso focria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsetisc 1s048–1057, and morpheme-level correspondences that humans have used in the manual decipherment process. [sent-35, score-0.294]
21 We assume that an accurate alphabetic mapping between related languages will be sparse in the following way: each letter will map to a very limited subset of letters in the other language. [sent-37, score-0.311]
22 For each pair of characters in the two languages, we posit an indicator variable which controls the prior likelihood of character substitutions. [sent-39, score-0.338]
23 We compare our method against the only existing decipherment baseline, an HMM-based character substitution cipher (Knight and Yamada, 1999; Knight et al. [sent-43, score-0.439]
24 2 Related Work Our work on decipherment has connections to three lines of work in statistical NLP. [sent-48, score-0.242]
25 First, our work relates to research on cognate identification (Lowe and Mazaudon, 1994; Guy, 1994; Kondrak, 2001 ; Bouchard et al. [sent-49, score-0.222]
26 For instance, some methods employ a hand-coded similarity function (Kondrak, 2001), while others assume knowledge of the phonetic mapping or require parallel cognate pairs to learn a similarity function (Bouchard et al. [sent-52, score-0.286]
27 While this research has similar goals, it typically builds on information or resources unavailable for ancient texts, such as comparable corpora, a seed lexicon, and cognate information (Fung and McKeown, 1997; Rapp, 1999; Koehn and Knight, 2002; Haghighi et al. [sent-55, score-0.277]
28 Moreover, distributional methods that rely on co-occurrence analysis operate over large corpora, which are typically unavailable for a lost language. [sent-57, score-0.186]
29 This method “makes the text speak” by gleaning character-to-sound mappings from non-parallel character and sound sequences. [sent-60, score-0.214]
30 While lost languages are gaining increasing interest in the NLP community (Knight and Sproat, 2009), there have been no successful attempts of their automatic decipherment. [sent-63, score-0.214]
31 Charles Virolleaud, who lead the initial decipherment effort, recognized that the script was likely alphabetic, since the inscribed words consisted of only thirty distinct symbols. [sent-66, score-0.291]
32 Bootstrapping from this finding, Bauer found words in the tablets that were likely to serve as cognates to Hebrew words— e. [sent-70, score-0.25]
33 What made the final decipherment possible was a sheer stroke of luck— Bauer guessed that a word inscribed on an ax discovered in the Ras Shamra excavations was the Ugaritic word for ax. [sent-74, score-0.265]
34 These differences result in significant divergence between Hebrew and Ugaritic cognates, thereby complicating the decipherment process. [sent-85, score-0.242]
35 – 4 Problem Formulation We are given a corpus in a lost language and a nonparallel corpus in a related language from the same language family. [sent-86, score-0.186]
36 Our primary goal is to translate words in the unknown language by mapping them to cognates in the known language. [sent-87, score-0.283]
37 As part of this process, we induce a lower-level mapping between the letters of the two alphabets, capturing the regular phonetic correspondences found in cognates. [sent-88, score-0.185]
38 We make several assumptions about the writing system of the lost language. [sent-89, score-0.186]
39 We also make a mild assumption about the morphology of the lost language. [sent-94, score-0.186]
40 While the correct morphological analysis of words in the lost language must be learned, we assume that the inventory and frequencies of prefixes and suffixes in the known language are given. [sent-97, score-0.341]
41 In summary, the observed input to the model consists of two elements: (i) a list of unanalyzed word types derived from a corpus in the lost language, and (ii) a morphologically analyzed lexicon in a known related language derived from a separate corpus, in our case non-parallel. [sent-98, score-0.258]
42 If we know this lost language to be closely related to English, we can surmise that these two endings correspond to the English verbal suffixes -ed and -s. [sent-102, score-0.186]
43 As this example illustrates, human decipherment efforts proceed by discovering both character-level and morpheme-level correspondences. [sent-106, score-0.242]
44 This interplay implicitly relies on a morphological analysis of words in the lost language, while utilizing knowledge of the known language’s lexicon and morphology. [sent-107, score-0.334]
45 One final intuition our model should capture is the sparsity of the alphabetic correspondence between related languages. [sent-108, score-0.206]
46 As a result, each character in one language will map to a small number of characters in the other language (typically one, but sometimes two or three). [sent-110, score-0.196]
47 2 Model Structure Our model posits that every observed word in the lost language is composed of a sequence of morphemes (prefix, stem, suffix). [sent-114, score-0.244]
48 Furthermore we posit that each morpheme was probabilistically generated jointly with a latent counterpart in the known language. [sent-115, score-0.264]
49 Our goal is to find those counterparts that lead to high frequency correspondences both at the character and morpheme level. [sent-116, score-0.36]
50 We resolve this tension by employing a non-parametric Bayesian model: the distributions over bilingual morpheme pairs assign probability based on recurrent patterns at the morpheme level. [sent-119, score-0.374]
51 Morpheme-pair distributions: draw a set of distributions on bilingual morpheme pairs Gstm, Gpre|stm, Gsuf|stm. [sent-127, score-0.24]
52 Word generation: draw pairs of cognates in the lost and known language, as well as words in the lost language with no cognate counterpart. [sent-129, score-0.853]
53 Figure 1: Plate diagram of the decipherment model. [sent-132, score-0.242]
54 The structural sparsity indicator variables of the base distribution hyperparameters The base distribution G0 de- λ⃗ determine the values v. [sent-133, score-0.341]
55 The morpheme-pair distributions Gstm, Gpre|stm, Gsuf|stm directly assign probabilities to highly frequent morpheme pairs. [sent-135, score-0.212]
56 Structural Sparsity The first step of the generative process provides a control on the sparsity of edit-operation probabilities, encoding the linguistic intuition that the correct character-level mappings should be sparse. [sent-137, score-0.184]
57 The set of edit operations includes character substitutions, insertions, and deletions, as well as a special end symbol: {(u, h) , (ϵ, h) , (u, ϵ) , END} (where u and h range over ,c(haϵr,hac)t,e(rsu ,iϵn) ,tEheN NloDst} a (nwdh kerneow un a nladn hguages, respectively). [sent-138, score-0.191]
58 For each edit operation e we posit a corresponding indicator variable λe. [sent-139, score-0.185]
59 Ws teh deef siente o a joint prior over these variables to encourage sparse character mappings. [sent-141, score-0.21]
60 More precisely, for each character u in the lost lang∑uage, we count the number of mappings c(u) = ∑h λ(u,h) . [sent-143, score-0.4]
61 Probabilities over edit sequences (and consequently on bilingual morpheme pairs) are then defined according to G0 as: ϕ⃗ P(⃗ e) = ∏ ϕei · q (#ins(⃗ e), #del( ⃗e)) ∏i We observe that the average Ugaritic word is over two letters longer than the average Hebrew word. [sent-150, score-0.296]
62 Thus, occurrences of Hebrew character insertions are a priori likely, and Ugaritic character deletions are very unlikely. [sent-151, score-0.324]
63 Morpheme-pair Distributions Next we draw a series of distributions which directly assign prob- ability to morpheme pairs. [sent-166, score-0.214]
64 First the model draws a boolean variable ci to determine whether word iin the lost language has a cognate in the known language, according to some prior P(ci). [sent-174, score-0.579]
65 If ci = 1, then a cognate word pair (u, h) is produced: (ustm, hstm) ∼ Gstm (upre, hpre) ∼ Gpre|stm (usuf, hsuf) ∼ Gsuf|stm u = upreustmusuf h = hprehstmhsuf Otherwise, a lone word u is generated, according a uniform character-level language model. [sent-175, score-0.3]
66 1052 In summary, this model structure captures both character and lexical level correspondences, while utilizing morphological knowledge of the known language. [sent-176, score-0.255]
67 An additional feature of this multilayered model structure is that each distribution over morpheme pairs is derived from the single character-level base distribution G0. [sent-177, score-0.25]
68 As a result, any character-level mappings learned from one type of morphological correspondence will be propagated to all other morpheme distributions. [sent-178, score-0.3]
69 6 Inference For each word ui in our undeciphered language we predict a morphological segmentation (upreustmusuf)i and corresponding cognate in the known language (hprehstmhsuf)i. [sent-180, score-0.544]
70 We then approximate the marginal probabilities for undeciphered word ui by summing over all the samples, and predicting the analysis with highest probability. [sent-185, score-0.196]
71 In our sampling algorithm, we avoid sampling the base distribution G0 and the derived morpheme-pair distributions (Gstm etc. [sent-186, score-0.198]
72 We explicitly sample the sparsity indicator variables λ⃗, the cognate indicator variables ci, and latent word analyses (segmentations and Hebrew counterparts). [sent-188, score-0.559]
73 1 Sampling Word Analyses For each undeciphered word, we need to sample a morphological segmentation (upre, ustm, usuf)i along with latent morphemes in the known language (hpre, hstm, hsuf)i. [sent-192, score-0.384]
74 First we sample the morphological segmentation of ui, along with the part-of-speech pos of the latent stem cognate. [sent-195, score-0.214]
75 2 Sampling Sparsity Indicators Recall that each sparsity indicator λe determines the value of the corresponding hyperparameter ve of the Dirichlet prior for the character-edit base distribution G0. [sent-205, score-0.281]
76 For example, for each type of edit-sequence which has been sampled (and may now occur many times throughout the data), we consider a single joint move to another edit-sequence e⃗′ (both of which yield the same lost language morpheme u). [sent-216, score-0.355]
77 5 Implementation Details Many of the steps detailed above involve the consideration of all possible edit-sequences consistent with (i) a particular undeciphered word ui and (ii) the entire lexicon of words in the known language (or some subset of words with a particular part-of-speech). [sent-221, score-0.239]
78 e1os2N0k5%e/ An Table 1: Accuracy of cognate translations, measured with respect to complete word-forms and morphemes, for the HMM-based substitution cipher baseline, our complete model, and our model without the structural sparsity priors. [sent-246, score-0.382]
79 To evaluate the output of our model, we annotated the words in the Ugaritic lexicon with the corresponding Hebrew cognates found in the standard reference dictionary (del Olo Lete and Sanmart ı´n, 2004). [sent-248, score-0.206]
80 Overall, we identified Hebrew cognates for 2,155 word forms, covering almost 1/3 of the Ugaritic vocabulary. [sent-255, score-0.181]
81 4 8 Evaluation Tasks and Results We evaluate our model on four separate decipherment tasks: (i) Learning alphabetic mappings, (ii) translating cognates, (iii) identifying cognates, and (iv) morphological segmentation. [sent-256, score-0.422]
82 As a baseline for the first three of these tasks (learning alphabetic mappings and translating and identifying cognates), we adapt the HMM-based method of Knight et al. [sent-257, score-0.186]
83 Hebrew character trigram transition probabilities are estimated using the Hebrew Bible, and Hebrew to Ugaritic character emission probabilities are learned using EM. [sent-261, score-0.322]
84 Finally, the highest prob4We are confident that a large majority of Ugaritic words with known Hebrew cognates were thus identified. [sent-262, score-0.228]
85 Alphabetic Mapping The first essential step towards successful decipherment is recovering the mapping between the symbols of the lost language and the alphabet of a known language. [sent-265, score-0.507]
86 We recover our model’s predicted alphabetic mappings by simply examining the sampled values of the binary indicator variables λu,h for each Ugaritic-Hebrew letter pair (u, h) . [sent-270, score-0.362]
87 Due to our structural sparsity prior the predicted mappings are sparse: each Ugaritic letter maps to only a single Hebrew letter, and most Hebrew letters map to only a single Ugaritic letter. [sent-271, score-0.369]
88 To recover alphabetic mappings from the HMM substitution cipher baseline, we predict the Hebrew letter h which maximizes the model’s probability P(h|u), fwohr ecahc hm Ugaritic lte htteer m u. [sent-272, score-0.3]
89 Cognate Decipherment We compare the decipherment accuracy for Ugaritic words that have P(⃗λ), corresponding Hebrew cognates. [sent-278, score-0.242]
90 As Table 1 shows, our method correctly translates over 60% of all distinct Ugaritic word-forms with Hebrew cognates and over 71% of the individual morphemes that compose them, outperforming the baseline by significant margins. [sent-280, score-0.265]
91 Accuracy improves when the frequency of the wordforms is taken into account (token-level evaluation), indicating that the model is able to decipher frequent words more accurately than infre1055 False positive rate Figure 2: ROC curve for cognate identification. [sent-281, score-0.222]
92 We also measure the average Levenshtein distance between predicted and actual cognate word-forms. [sent-283, score-0.222]
93 As Table 1shows, performance degrades significantly in the absence of these priors, indicating the importance of modeling the sparsity of character mappings. [sent-288, score-0.201]
94 Cognate identification We evaluate our model’s ability to identify cognates using the sampled indicator variables ci. [sent-289, score-0.308]
95 For both our model and the baseline, we can vary the threshold for cognate identification by raising or lowering the cognate prior P(ci). [sent-292, score-0.489]
96 In practice for our model, we use a high cognate prior, thus only ruling out Morfessorp8re8c. [sent-296, score-0.222]
97 9 Conclusion and Future Work In this paper we proposed a method for the automatic decipherment of lost languages. [sent-309, score-0.428]
98 Finally, we intend to explore our model’s predictive power when the family of the lost language is unknown. [sent-316, score-0.186]
99 An algorithm for identifying cognates in bilingual wordlists and its applicability to machine translation. [sent-362, score-0.207]
100 Identification of cognates and recurrent sound correspondences in word lists. [sent-403, score-0.256]
wordName wordTfidf (topN-words)
[('ugaritic', 0.667), ('hebrew', 0.307), ('decipherment', 0.242), ('cognate', 0.222), ('lost', 0.186), ('cognates', 0.181), ('morpheme', 0.142), ('character', 0.132), ('stm', 0.121), ('alphabetic', 0.104), ('undeciphered', 0.103), ('mappings', 0.082), ('morphological', 0.076), ('letters', 0.069), ('semitic', 0.069), ('estm', 0.069), ('gstm', 0.069), ('tablets', 0.069), ('sparsity', 0.069), ('indicator', 0.067), ('ui', 0.064), ('edit', 0.059), ('morphemes', 0.058), ('gpre', 0.057), ('gsuf', 0.057), ('upre', 0.057), ('ustm', 0.057), ('usuf', 0.057), ('ancient', 0.055), ('ci', 0.055), ('correspondences', 0.052), ('bauer', 0.05), ('knight', 0.05), ('letter', 0.049), ('known', 0.047), ('bible', 0.046), ('deciphering', 0.046), ('prior', 0.045), ('sampling', 0.042), ('distributions', 0.041), ('cipher', 0.04), ('latent', 0.04), ('stem', 0.038), ('base', 0.038), ('dirichlet', 0.037), ('intuitions', 0.036), ('characters', 0.035), ('distribution', 0.035), ('posit', 0.035), ('decipherers', 0.034), ('hetzron', 0.034), ('suf', 0.034), ('counterparts', 0.034), ('variables', 0.033), ('intuition', 0.033), ('deletions', 0.033), ('suffix', 0.033), ('segmentation', 0.032), ('phonetic', 0.032), ('prefixes', 0.032), ('mapping', 0.032), ('draw', 0.031), ('bouchard', 0.03), ('probabilities', 0.029), ('map', 0.029), ('languages', 0.028), ('sample', 0.028), ('insertions', 0.027), ('substitutions', 0.027), ('ve', 0.027), ('sampled', 0.027), ('structural', 0.026), ('distinct', 0.026), ('bilingual', 0.026), ('lexicon', 0.025), ('substitution', 0.025), ('geman', 0.025), ('kondrak', 0.025), ('dp', 0.024), ('variable', 0.024), ('unknown', 0.023), ('antoniak', 0.023), ('budgets', 0.023), ('countstm', 0.023), ('cryptographer', 0.023), ('cunchillos', 0.023), ('deciphered', 0.023), ('dhorme', 0.023), ('enigma', 0.023), ('epre', 0.023), ('hpre', 0.023), ('hprehstmhsuf', 0.023), ('hstm', 0.023), ('hsuf', 0.023), ('inscribed', 0.023), ('ishwaran', 0.023), ('lowery', 0.023), ('olo', 0.023), ('recurrent', 0.023), ('sanmart', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 16 acl-2010-A Statistical Model for Lost Language Decipherment
Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight
Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.
2 0.22179411 116 acl-2010-Finding Cognate Groups Using Phylogenies
Author: David Hall ; Dan Klein
Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.
3 0.12028427 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers
Author: Eric Corlett ; Gerald Penn
Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.
4 0.11314882 11 acl-2010-A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm
Author: Marina Litvak ; Mark Last ; Menahem Friedman
Abstract: Automated summarization methods can be defined as “language-independent,” if they are not based on any languagespecific knowledge. Such methods can be used for multilingual summarization defined by Mani (2001) as “processing several languages, with summary in the same language as input.” In this paper, we introduce MUSE, a languageindependent approach for extractive summarization based on the linear optimization of several sentence ranking measures using a genetic algorithm. We tested our methodology on two languages—English and Hebrew—and evaluated its performance with ROUGE-1 Recall vs. state- of-the-art extractive summarization approaches. Our results show that MUSE performs better than the best known multilingual approach (TextRank1) in both languages. Moreover, our experimental results on a bilingual (English and Hebrew) document collection suggest that MUSE does not need to be retrained on each language and the same model can be used across at least two different languages.
5 0.09466742 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
Author: Daphna Shezaf ; Ari Rappoport
Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.
6 0.067122139 170 acl-2010-Letter-Phoneme Alignment: An Exploration
7 0.064745255 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation
8 0.061766502 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery
10 0.055253856 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages
11 0.052846666 112 acl-2010-Extracting Social Networks from Literary Fiction
12 0.050923664 221 acl-2010-Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
13 0.049573708 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration
14 0.047582693 119 acl-2010-Fixed Length Word Suffix for Factored Statistical Machine Translation
15 0.046541136 13 acl-2010-A Rational Model of Eye Movement Control in Reading
16 0.044653103 158 acl-2010-Latent Variable Models of Selectional Preference
17 0.044473238 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
18 0.043255981 213 acl-2010-Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer
19 0.04301047 46 acl-2010-Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
20 0.040726054 214 acl-2010-Sparsity in Dependency Grammar Induction
topicId topicWeight
[(0, -0.122), (1, -0.012), (2, -0.024), (3, -0.025), (4, 0.015), (5, -0.028), (6, 0.017), (7, -0.034), (8, 0.1), (9, -0.046), (10, -0.058), (11, 0.042), (12, 0.076), (13, -0.073), (14, -0.067), (15, -0.092), (16, -0.018), (17, 0.031), (18, 0.083), (19, -0.011), (20, -0.044), (21, -0.074), (22, -0.029), (23, -0.065), (24, 0.078), (25, -0.056), (26, -0.087), (27, -0.107), (28, -0.039), (29, 0.024), (30, -0.023), (31, -0.065), (32, 0.211), (33, -0.132), (34, -0.184), (35, -0.146), (36, 0.069), (37, 0.002), (38, -0.009), (39, -0.003), (40, 0.023), (41, -0.008), (42, -0.041), (43, -0.149), (44, -0.046), (45, -0.062), (46, 0.142), (47, 0.118), (48, -0.033), (49, -0.12)]
simIndex simValue paperId paperTitle
same-paper 1 0.92070115 16 acl-2010-A Statistical Model for Lost Language Decipherment
Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight
Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.
2 0.71941876 116 acl-2010-Finding Cognate Groups Using Phylogenies
Author: David Hall ; Dan Klein
Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.
Author: Sebastian Spiegler ; Peter A. Flach
Abstract: This paper demonstrates that the use of ensemble methods and carefully calibrating the decision threshold can significantly improve the performance of machine learning methods for morphological word decomposition. We employ two algorithms which come from a family of generative probabilistic models. The models consider segment boundaries as hidden variables and include probabilities for letter transitions within segments. The advantage of this model family is that it can learn from small datasets and easily gen- eralises to larger datasets. The first algorithm PROMODES, which participated in the Morpho Challenge 2009 (an international competition for unsupervised morphological analysis) employs a lower order model whereas the second algorithm PROMODES-H is a novel development of the first using a higher order model. We present the mathematical description for both algorithms, conduct experiments on the morphologically rich language Zulu and compare characteristics of both algorithms based on the experimental results.
4 0.62872463 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers
Author: Eric Corlett ; Gerald Penn
Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.
5 0.54624635 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers
Author: Vipul Mittal
Abstract: In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph an- alyzer.
6 0.47729188 68 acl-2010-Conditional Random Fields for Word Hyphenation
7 0.45363215 234 acl-2010-The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages
10 0.37933108 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery
11 0.35255161 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
12 0.34159958 13 acl-2010-A Rational Model of Eye Movement Control in Reading
13 0.31036845 112 acl-2010-Extracting Social Networks from Literary Fiction
14 0.30750316 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation
15 0.30355683 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers
16 0.29847863 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration
17 0.29513672 11 acl-2010-A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm
18 0.2945039 213 acl-2010-Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer
19 0.29024988 61 acl-2010-Combining Data and Mathematical Models of Language Change
20 0.28212389 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages
topicId topicWeight
[(4, 0.011), (7, 0.013), (14, 0.024), (16, 0.011), (25, 0.054), (36, 0.285), (39, 0.013), (42, 0.025), (44, 0.018), (59, 0.077), (62, 0.019), (69, 0.03), (71, 0.011), (73, 0.046), (78, 0.026), (80, 0.035), (83, 0.066), (84, 0.029), (98, 0.098)]
simIndex simValue paperId paperTitle
same-paper 1 0.76503778 16 acl-2010-A Statistical Model for Lost Language Decipherment
Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight
Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.
2 0.74277037 231 acl-2010-The Prevalence of Descriptive Referring Expressions in News and Narrative
Author: Raquel Hervas ; Mark Finlayson
Abstract: Generating referring expressions is a key step in Natural Language Generation. Researchers have focused almost exclusively on generating distinctive referring expressions, that is, referring expressions that uniquely identify their intended referent. While undoubtedly one of their most important functions, referring expressions can be more than distinctive. In particular, descriptive referring expressions those that provide additional information not required for distinction are critical to flu– – ent, efficient, well-written text. We present a corpus analysis in which approximately one-fifth of 7,207 referring expressions in 24,422 words ofnews and narrative are descriptive. These data show that if we are ever to fully master natural language generation, especially for the genres of news and narrative, researchers will need to devote more attention to understanding how to generate descriptive, and not just distinctive, referring expressions. 1 A Distinctive Focus Generating referring expressions is a key step in Natural Language Generation (NLG). From early treatments in seminal papers by Appelt (1985) and Reiter and Dale (1992) to the recent set of Referring Expression Generation (REG) Challenges (Gatt et al., 2009) through different corpora available for the community (Eugenio et al., 1998; van Deemter et al., 2006; Viethen and Dale, 2008), generating referring expressions has become one of the most studied areas of NLG. Researchers studying this area have, almost without exception, focused exclusively on how to generate distinctive referring expressions, that is, referring expressions that unambiguously idenMark Alan Finlayson Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA, 02139 USA markaf@mit .edu tify their intended referent. Referring expressions, however, may be more than distinctive. It is widely acknowledged that they can be used to achieve multiple goals, above and beyond distinction. Here we focus on descriptive referring expressions, that is, referring expressions that are not only distinctive, but provide additional information not required for identifying their intended referent. Consider the following text, in which some of the referring expressions have been underlined: Once upon a time there was a man, who had three daughters. They lived in a house and their dresses were made of fabric. While a bit strange, the text is perfectly wellformed. All the referring expressions are distinctive, in that we can properly identify the referents of each expression. But the real text, the opening lines to the folktale The Beauty and the Beast, is actually much more lyrical: Once upon a time there was a rich merchant, who had three daughters. They lived in a very fine house and their gowns were made of the richest fabric sewn with jewels. All the boldfaced portions namely, the choice of head nouns, the addition of adjectives, the use of appositive phrases serve to perform a descriptive function, and, importantly, are all unnecessary for distinction! In all of these cases, the author is using the referring expressions as a vehicle for communicating information about the referents. This descriptive information is sometimes – – new, sometimes necessary for understanding the text, and sometimes just for added flavor. But when the expression is descriptive, as opposed to distinctive, this additional information is not required for identifying the referent of the expression, and it is these sorts of referring expressions that we will be concerned with here. 49 Uppsala,P Srwoce de dni,n 1g1s- 1of6 t Jhuely AC 20L1 20 .1 ?0c 2 C0o1n0fe Aresnsoceci Sathio rnt f Poarp Ceorsm,p paugteastio 4n9a–l5 L4i,nguistics Although these sorts of referring expression have been mostly ignored by researchers in this area1 , we show in this corpus study that descriptive expressions are in fact quite prevalent: nearly one-fifth of referring expressions in news and narrative are descriptive. In particular, our data, the trained judgments of native English speakers, show that 18% of all distinctive referring expressions in news and 17% of those in narrative folktales are descriptive. With this as motivation, we argue that descriptive referring expressions must be studied more carefully, especially as the field progresses from referring in a physical, immediate context (like that in the REG Challenges) to generating more literary forms of text. 2 Corpus Annotation This is a corpus study; our procedure was therefore to define our annotation guidelines (Section 2.1), select texts to annotate (2.2), create an annotation tool for our annotators (2.3), and, finally, train annotators, have them annotate referring expressions’ constituents and function, and then adjudicate the double-annotated texts into a gold standard (2.4). 2.1 Definitions We wrote an annotation guide explaining the difference between distinctive and descriptive referring expressions. We used the guide when training annotators, and it was available to them while annotating. With limited space here we can only give an outline of what is contained in the guide; for full details see (Finlayson and Herv a´s, 2010a). Referring Expressions We defined referring expressions as referential noun phrases and their coreferential expressions, e.g., “John kissed Mary. She blushed.”. This included referring expressions to generics (e.g., “Lions are fierce”), dates, times, and numbers, as well as events if they were referred to using a noun phrase. We included in each referring expression all the determiners, quantifiers, adjectives, appositives, and prepositional phrases that syntactically attached to that expression. When referring expressions were nested, all the nested referring expressions were also marked separately. Nuclei vs. Modifiers In the only previous corpus study of descriptive referring expressions, on 1With the exception of a small amount of work, discussed in Section 4. museum labels, Cheng et al. (2001) noted that descriptive information is often integrated into referring expressions using modifiers to the head noun. To study this, and to allow our results to be more closely compared with Cheng’s, we had our annotators split referring expressions into their constituents, portions called either nuclei or modifiers. The nuclei were the portions of the referring expression that performed the ‘core’ referring function; the modifiers were those portions that could be varied, syntactically speaking, independently of the nuclei. Annotators then assigned a distinctive or descriptive function to each constituent, rather than the referring expression as a whole. Normally, the nuclei corresponded to the head of the noun phrase. In (1), the nucleus is the token king, which we have here surrounded with square brackets. The modifiers, surrounded by parentheses, are The and old. (1) (The) (old) [king] was wise. Phrasal modifiers were marked as single modifiers, for example, in (2). (2) (The) [roof] (of the house) collapsed. It is significant that we had our annotators mark and tag the nuclei of referring expressions. Cheng and colleagues only mentioned the possibility that additional information could be introduced in the modifiers. However, O’Donnell et al. (1998) observed that often the choice of head noun can also influence the function of a referring expression. Consider (3), in which the word villain is used to refer to the King. (3) The King assumed the throne today. I ’t trust (that) [villain] one bit. don The speaker could have merely used him to refer to the King–the choice of that particular head noun villain gives us additional information about the disposition of the speaker. Thus villain is descriptive. Function: Distinctive vs. Descriptive As already noted, instead of tagging the whole referring expression, annotators tagged each constituent (nuclei and modifiers) as distinctive or descriptive. The two main tests for determining descriptiveness were (a) if presence of the constituent was unnecessary for identifying the referent, or (b) if 50 the constituent was expressed using unusual or ostentatious word choice. If either was true, the constituent was considered descriptive; otherwise, it was tagged as distinctive. In cases where the constituent was completely irrelevant to identifying the referent, it was tagged as descriptive. For example, in the folktale The Princess and the Pea, from which (1) was extracted, there is only one king in the entire story. Thus, in that story, the king is sufficient for identification, and therefore the modifier old is descriptive. This points out the importance of context in determining distinctiveness or descriptiveness; if there had been a roomful of kings, the tags on those modifiers would have been reversed. There is some question as to whether copular predicates, such as the plumber in (4), are actually referring expressions. (4) John is the plumber Our annotators marked and tagged these constructions as normal referring expressions, but they added an additional flag to identify them as copular predicates. We then excluded these constructions from our final analysis. Note that copular predicates were treated differently from appositives: in appositives the predicate was included in the referring expression, and in most cases (again, depending on context) was marked descriptive (e.g., John, the plumber, slept.). 2.2 Text Selection Our corpus comprised 62 texts, all originally written in English, from two different genres, news and folktales. We began with 30 folktales of different sizes, totaling 12,050 words. These texts were used in a previous work on the influence of dialogues on anaphora resolution algorithms (Aggarwal et al., 2009); they were assembled with an eye toward including different styles, different authors, and different time periods. Following this, we matched, approximately, the number of words in the folktales by selecting 32 texts from Wall Street Journal section of the Penn Treebank (Marcus et al., 1993). These texts were selected at ran- dom from the first 200 texts in the corpus. 2.3 The Story Workbench We used the Story Workbench application (Finlayson, 2008) to actually perform the annotation. The Story Workbench is a semantic annotation program that, among other things, includes the ability to annotate referring expressions and coreferential relationships. We added the ability to annotate nuclei, modifiers, and their functions by writing a workbench “plugin” in Java that could be installed in the application. The Story Workbench is not yet available to the public at large, being in a limited distribution beta testing phase. The developers plan to release it as free software within the next year. At that time, we also plan to release our plugin as free, downloadable software. 2.4 Annotation & Adjudication The main task of the study was the annotation of the constituents of each referring expression, as well as the function (distinctive or descriptive) of each constituent. The system generated a first pass of constituent analysis, but did not mark functions. We hired two native English annotators, neither of whom had any linguistics background, who corrected these automatically-generated constituent analyses, and tagged each constituent as descriptive or distinctive. Every text was annotated by both annotators. Adjudication of the differences was conducted by discussion between the two annotators; the second author moderated these discussions and settled irreconcilable disagreements. We followed a “train-as-you-go” paradigm, where there was no distinct training period, but rather adjudication proceeded in step with annotation, and annotators received feedback during those sessions. We calculated two measures of inter-annotator agreement: a kappa statistic and an f-measure, shown in Table 1. All of our f-measures indicated that annotators agreed almost perfectly on the location of referring expressions and their breakdown into constituents. These agreement calculations were performed on the annotators’ original corrected texts. All the kappa statistics were calculated for two tags (nuclei vs. modifier for the constituents, and distinctive vs. descriptive for the functions) over both each token assigned to a nucleus or modifier and each referring expression pair. Our kappas indicate moderate to good agreement, especially for the folktales. These results are expected because of the inherent subjectivity of language. During the adjudication sessions it became clear that different people do not consider the same information 51 as obvious or descriptive for the same concepts, and even the contexts deduced by each annotators from the texts were sometimes substantially different. 3 Results Table 2 lists the primary results of the study. We considered a referring expression descriptive if any of its constituents were descriptive. Thus, 18% of the referring expressions in the corpus added additional information beyond what was required to unambiguously identify their referent. The results were similar in both genres. Tales Articles Total Texts303262 Words Sentences 12,050 904 12,372 571 24,422 1,475 Ref. Exp.3,6813,5267,207 Dist. Ref. Exp. 3,057 2,830 5,887 Desc. Ref. Exp. 609 672 1,281 % Dist. Ref.83%81%82% % Desc. Ref. 17% 19% Table 2: Primary results. 18% Table 3 contains the percentages of descriptive and distinctive tags broken down by constituent. Like Cheng’s results, our analysis shows that descriptive referring expressions make up a significant fraction of all referring expressions. Although Cheng did not examine nuclei, our results show that the use of descriptive nuclei is small but not negligible. 4 Relation to the Field Researchers working on generating referring expressions typically acknowledge that referring expressions can perform functions other than distinction. Despite this widespread acknowledgment, researchers have, for the most part, explicitly ignored these functions. Exceptions to this trend Tales Articles Total Nuclei3,6663,5027,168 Max. Nuc/Ref Dist. Nuc. 1 95% 1 97% 1 96% Desc. Nuc. 5% 3% 4% Modifiers2,2773,6275,904 Avg. Mod/Ref Max. Mod/Ref Dist. Mod. Desc. Mod. 0.6 4 78% 22% 1.0 6 81% 19% 0.8 6 80% 20% Table 3: Breakdown of Constituent Tags are three. First is the general study of aggregation in the process of referring expression generation. Second and third are corpus studies by Cheng et al. (2001) and Jordan (2000a) that bear on the prevalence of descriptive referring expressions. The NLG subtask of aggregation can be used to imbue referring expressions with a descriptive function (Reiter and Dale, 2000, §5.3). There is a specific nk (iRned otefr aggregation 0c0al0le,d § embedding t ihsa at moves information from one clause to another inside the structure of a separate noun phrase. This type of aggregation can be used to transform two sentences such as “The princess lived in a castle. She was pretty ” into “The pretty princess lived in a castle ”. The adjective pretty, previously a cop- ular predicate, becomes a descriptive modifier of the reference to the princess, making the second text more natural and fluent. This kind of aggregation is widely used by humans for making the discourse more compact and efficient. In order to create NLG systems with this ability, we must take into account the caveat, noted by Cheng (1998), that any non-distinctive information in a referring expression must not lead to confusion about the distinctive function of the referring expression. This is by no means a trivial problem this sort of aggregation interferes with referring and coherence planning at both a local and global level (Cheng and Mellish, 2000; Cheng et al., 2001). It is clear, from the current state of the art of NLG, that we have not yet obtained a deep enough understanding of aggregation to enable us to handle these interactions. More research on the topic is needed. Two previous corpus studies have looked at the use of descriptive referring expressions. The first showed explicitly that people craft descriptive referring expressions to accomplish different – 52 goals. Jordan and colleagues (Jordan, 2000b; Jordan, 2000a) examined the use of referring expressions using the COCONUT corpus (Eugenio et al., 1998). They tested how domain and discourse goals can influence the content of non-pronominal referring expressions in a dialogue context, checking whether or not a subject’s goals led them to include non-referring information in a referring expression. Their results are intriguing because they point toward heretofore unexamined constraints, utilities and expectations (possibly genre- or styledependent) that may underlie the use ofdescriptive information to perform different functions, and are not yet captured by aggregation modules in particular or NLG systems in general. In the other corpus study, which partially inspired this work, Cheng and colleagues analyzed a set of museum descriptions, the GNOME corpus (Poesio, 2004), for the pragmatic functions of referring expressions. They had three functions in their study, in contrast to our two. Their first function (marked by their uniq tag) was equiv- alent to our distinctive function. The other two were specializations of our descriptive tag, where they differentiated between additional information that helped to understand the text (int), or additional information not necessary for understanding (att r). Despite their annotators seeming to have trouble distinguishing between the latter two tags, they did achieve good overall inter-annotator agreement. They identified 1,863 modifiers to referring expressions in their corpus, of which 47.3% fulfilled a descriptive (att r or int) function. This is supportive of our main assertion, namely, that descriptive referring expressions, not only crucial for efficient and fluent text, are actually a significant phenomenon. It is interesting, though, that Cheng’s fraction of descriptive referring expression was so much higher than ours (47.3% versus our 18%). We attribute this substantial difference to genre, in that Cheng studied museum labels, in which the writer is spaceconstrained, having to pack a lot of information into a small label. The issue bears further study, and perhaps will lead to insights into differences in writing style that may be attributed to author or genre. 5 Contributions We make two contributions in this paper. First, we assembled, double-annotated, and adjudicated into a gold-standard a corpus of 24,422 words. We marked all referring expressions, coreferential relations, and referring expression constituents, and tagged each constituent as having a descriptive or distinctive function. We wrote an annotation guide and created software that allows the annotation of this information in free text. The corpus and the guide are available on-line in a permanent digital archive (Finlayson and Herv a´s, 2010a; Finlayson and Herv a´s, 2010b). The software will also be released in the same archive when the Story Workbench annotation application is released to the public. This corpus will be useful for the automatic generation and analysis of both descriptive and distinctive referring expressions. Any kind of system intended to generate text as humans do must take into account that identifica- tion is not the only function of referring expressions. Many analysis applications would benefit from the automatic recognition of descriptive referring expressions. Second, we demonstrated that descriptive referring expressions comprise a substantial fraction (18%) of the referring expressions in news and narrative. Along with museum descriptions, studied by Cheng, it seems that news and narrative are genres where authors naturally use a large number ofdescriptive referring expressions. Given that so little work has been done on descriptive referring expressions, this indicates that the field would be well served by focusing more attention on this phenomenon. Acknowledgments This work was supported in part by the Air Force Office of Scientific Research under grant number A9550-05-1-0321, as well as by the Office of Naval Research under award number N00014091059. Any opinions, findings, and con- clusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the Office of Naval Research. This research is also partially funded the Spanish Ministry of Education and Science (TIN200914659-C03-01) and Universidad Complutense de Madrid (GR58/08). We also thank Whitman Richards, Ozlem Uzuner, Peter Szolovits, Patrick Winston, Pablo Gerv a´s, and Mark Seifter for their helpful comments and discussion, and thank our annotators Saam Batmanghelidj and Geneva Trotter. 53 References Alaukik Aggarwal, Pablo Gerv a´s, and Raquel Herv a´s. 2009. Measuring the influence of errors induced by the presence of dialogues in reference clustering of narrative text. In Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, India. Macmillan Publishers. Douglas E. Appelt. 1985. Planning English referring expressions. Artificial Intelligence, 26: 1–33. Hua Cheng and Chris Mellish. 2000. Capturing the interaction between aggregation and text planning in two generation systems. In INLG ’00: First international conference on Natural Language Generation 2000, pages 186–193, Morristown, NJ, USA. Association for Computational Linguistics. Hua Cheng, Massimo Poesio, Renate Henschel, and Chris Mellish. 2001 . Corpus-based np modifier generation. In NAACL ’01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1–8, Morristown, NJ, USA. Association for Computational Linguistics. Hua Cheng. 1998. Embedding new information into referring expressions. In ACL-36: Proceedings of the 36thAnnual Meeting ofthe Associationfor Computational Linguistics and 17th International Conference on Computational Linguistics, pages 1478– 1480, Morristown, NJ, USA. Association for Computational Linguistics. Barbara Di Eugenio, Johanna D. Moore, Pamela W. Jordan, and Richmond H. Thomason. 1998. An empirical investigation of proposals in collaborative dialogues. In Proceedings of the 17th international conference on Computational linguistics, pages 325–329, Morristown, NJ, USA. Association for Computational Linguistics. Mark A. Finlayson and Raquel Herv a´s. 2010a. Annotation guide for the UCM/MIT indications, referring expressions, and coreference corpus (UMIREC corpus). Technical Report MIT-CSAIL-TR-2010-025, MIT Computer Science and Artificial Intelligence Laboratory. http://hdl.handle.net/1721. 1/54765. Mark A. Finlayson and Raquel Herv a´s. 2010b. UCM/MIT indications, referring expressions, and coreference corpus (UMIREC corpus). Work product, MIT Computer Science and Artificial Intelligence Laboratory. http://hdl.handle.net/1721 .1/54766. Mark A. Finlayson. 2008. Collecting semantics in the wild: The Story Workbench. In Proceedings of the AAAI Fall Symposium on Naturally-Inspired Artificial Intelligence, pages 46–53, Menlo Park, CA, USA. AAAI Press. Albert Gatt, Anja Belz, and Eric Kow. 2009. The TUNA-REG challenge 2009: overview and evaluation results. In ENLG ’09: Proceedings of the 12th European Workshop on Natural Language Generation, pages 174–182, Morristown, NJ, USA. Association for Computational Linguistics. Pamela W. Jordan. 2000a. Can nominal expressions achieve multiple goals?: an empirical study. In ACL ’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 142– 149, Morristown, NJ, USA. Association for Computational Linguistics. Pamela W. Jordan. 2000b. Influences on attribute selection in redescriptions: A corpus study. In Proceedings of CogSci2000, pages 250–255. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):3 13–330. Michael O’Donnell, Hua Cheng, and Janet Hitzeman. 1998. Integrating referring and informing in NP planning. In Proceedings of COLING-ACL’98 Workshop on the Computational Treatment of Nominals, pages 46–56. Massimo Poesio. 2004. Discourse annotation and semantic annotation in the GNOME corpus. In DiscAnnotation ’04: Proceedings of the 2004 ACL Workshop on Discourse Annotation, pages 72–79, Morristown, NJ, USA. Association for Computational Linguistics. Ehud Reiter and Robert Dale. 1992. A fast algorithm for the generation of referring expressions. In Proceedings of the 14th conference on Computational linguistics, Nantes, France. Ehud Reiter and Robert Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press. Kees van Deemter, Ielka van der Sluis, and Albert Gatt. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of the 4th International Conference on Natural Language Generation (Special Session on Data Sharing and Evaluation), INLG-06. Jette Viethen and Robert Dale. 2008. The use of spatial relations in referring expressions. In Proceedings of the 5th International Conference on Natural Language Generation. 54
3 0.73529506 168 acl-2010-Learning to Follow Navigational Directions
Author: Adam Vogel ; Dan Jurafsky
Abstract: We present a system that learns to follow navigational natural language directions. Where traditional models learn from linguistic annotation or word distributions, our approach is grounded in the world, learning by apprenticeship from routes through a map paired with English descriptions. Lacking an explicit alignment between the text and the reference path makes it difficult to determine what portions of the language describe which aspects of the route. We learn this correspondence with a reinforcement learning algorithm, using the deviation of the route we follow from the intended path as a reward signal. We demonstrate that our system successfully grounds the meaning of spatial terms like above and south into geometric properties of paths.
Author: Mark Johnson
Abstract: This paper establishes a connection between two apparently very different kinds of probabilistic models. Latent Dirichlet Allocation (LDA) models are used as “topic models” to produce a lowdimensional representation of documents, while Probabilistic Context-Free Grammars (PCFGs) define distributions over trees. The paper begins by showing that LDA topic models can be viewed as a special kind of PCFG, so Bayesian inference for PCFGs can be used to infer Topic Models as well. Adaptor Grammars (AGs) are a hierarchical, non-parameteric Bayesian extension of PCFGs. Exploiting the close relationship between LDA and PCFGs just described, we propose two novel probabilistic models that combine insights from LDA and AG models. The first replaces the unigram component of LDA topic models with multi-word sequences or collocations generated by an AG. The second extension builds on the first one to learn aspects of the internal structure of proper names.
5 0.49238604 116 acl-2010-Finding Cognate Groups Using Phylogenies
Author: David Hall ; Dan Klein
Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.
6 0.48476484 25 acl-2010-Adapting Self-Training for Semantic Role Labeling
7 0.48394281 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
8 0.48085129 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar
9 0.48070383 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking
10 0.4804492 167 acl-2010-Learning to Adapt to Unknown Users: Referring Expression Generation in Spoken Dialogue Systems
11 0.47978884 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans
12 0.47964892 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers
13 0.47902656 71 acl-2010-Convolution Kernel over Packed Parse Forest
14 0.47882557 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation
15 0.47819203 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
16 0.47779837 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data
17 0.47755411 162 acl-2010-Learning Common Grammar from Multilingual Corpus
18 0.47747257 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
19 0.47700474 202 acl-2010-Reading between the Lines: Learning to Map High-Level Instructions to Commands
20 0.47692275 121 acl-2010-Generating Entailment Rules from FrameNet