emnlp emnlp2012 emnlp2012-132 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. [sent-3, score-0.272]
2 First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. [sent-4, score-0.144]
3 We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. [sent-6, score-0.514]
4 In other ways, written language harkens further to the past, reflecting aspects of languages long since gone from their spoken forms. [sent-11, score-0.231]
5 In this paper, we argue that this imperfect relationship between written symbol and spoken sound can be automatically inferred from textual patterns. [sent-12, score-0.157]
6 By examining data for over 100 languages, we train a statistical model to automat332 ically relate graphemic patterns in text to phonemic sequences for never-before-seen languages. [sent-13, score-0.285]
7 In an idealized alphabetic system, each phoneme in the language is unambiguously represented by a single grapheme. [sent-15, score-0.372]
8 The joint analysis of several languages can increase model accuracy, and enable the development of computational tools for languages with minimal linguistic resources. [sent-20, score-0.288]
9 Previous work has focused on settings where just a handful of languages are available. [sent-21, score-0.144]
10 On a more practical note, accurately relating graphemes and phonemes to one another is crucial for tasks such as automatic speech recognition and text-to-speech generation. [sent-23, score-0.514]
11 Characters in the Roman alphabet can take a wide range of phonemic values across the world’s languages. [sent-31, score-0.181]
12 Our task is thus to select a subset of phonemes for each language’s graphemes. [sent-34, score-0.19]
13 We develop a probabilistic undirected graphical model for this prediction problem, where a large set of languages serve as training data and a single heldout language serves as test data. [sent-37, score-0.26]
14 As we are dealing with texts (written in a roughly phonemic writing system), the true contextual phonetic realizations, and even using IPA symbols to relate symbols across languages is somewhat theoretically suspect. [sent-40, score-0.558]
15 The node is labeled with a binary value to indicate whether grapheme g can represent phoneme p in the language. [sent-43, score-0.676]
16 The node and edge features are derived from textual co-occurrence statistics for the graphemes of each language, as well as general information about the language’s family and region. [sent-45, score-0.488]
17 Parameters are jointly optimized over the training languages to maximize the likelihood of the node labelings given the observed feature values. [sent-46, score-0.196]
18 We apply our model to a novel data-set consisting of grapheme-phoneme mappings for 107 languages with Roman alphabets and short texts. [sent-48, score-0.287]
19 Without any edges, our model yields perfect mappings for only 10% of test languages. [sent-56, score-0.121]
20 By employing structure learning and including the induced edges, we more than double the number of test languages with perfect predictions. [sent-57, score-0.221]
21 (iii) Finally, an analysis of our grapheme-phoneme predictions shows that they do not achieve certain global characteristics observed across true phoneme inventories. [sent-58, score-0.407]
22 Sparse edges are automatically induced to allow joint training and prediction over related inventory decisions. [sent-62, score-0.146]
23 2 Background and Related Work In this section, we provide some background on phonetics and phoneme inventories. [sent-63, score-0.331]
24 1 Phoneme Inventories The sounds of the world’s languages are produced through a wide variety of articulatory mechanisms. [sent-66, score-0.356]
25 Consonants are sounds produced through a partial or complete stricture of the vocal tract, and can be roughly categorized along three independent dimen- sions: (i) Voicing: whether or not oscillation of the vocal folds accompanies the sound. [sent-67, score-0.251]
26 For example, /p/ is a bilabial (the lips touching one another) while /k/ is a velar (tongue touching touching the soft palate). [sent-70, score-0.278]
27 In contrast, vowels are voiced sounds produced with an open vocal tract. [sent-73, score-0.323]
28 They are categorized primarily based on the position of the tongue and lips, along three dimensions: (i) Roundedness: whether or not the lips are rounded during production of 334 the sound; (ii) Height: the vertical position of the tongue; (iii) Backness: how far forward the tongue lies. [sent-74, score-0.187]
29 Linguists have noted several statistical regularities found in phoneme inventories throughout the world. [sent-75, score-0.514]
30 Feature economy refers to the idea that languages tend to minimize the number of differentiating characteristics (e. [sent-76, score-0.246]
31 different kinds of voicing, manner, and place) that are used to distinguish consonant phonemes from one another (Clements, 2003). [sent-78, score-0.269]
32 In other words, once an articulatory feature is used to mark offone phoneme from another, it will likely be used again to differentiate other phoneme pairs in the same language. [sent-79, score-0.804]
33 The principle of Maximal perceptual contrast refers to the idea that the set of vowels employed by a language will be located in phonetic space to maximize their perceptual distances from one another, thus relieving the perceptual burden of the listener (Liljencrants and Lindblom, 1972). [sent-80, score-0.391]
34 Finally, researchers have noted that languages exhibit set patterns in how they sequence their phonemes (Kenstowicz and Kisseberth, 1979). [sent-82, score-0.376]
35 These phonotactic regularities and constraints are mirrored in graphemic patterns, and as our experiments show, can be explicitly modeled to achieve high accuracy in our task. [sent-85, score-0.296]
36 The focus has been on developing techniques for dealing with the phonemic ambiguity present both in annotated and unseen words. [sent-89, score-0.142]
37 Often this involves some level of phonetic analysis in one or both languages. [sent-93, score-0.141]
38 When a grapheme maps to more than one phoneme, we do not attempt to disambiguate particular instances of that grapheme in words. [sent-97, score-0.69]
39 As our data-set consists entirely of Latin-based writing systems, our work can be viewed as a more fine-grained computational exploration of the space of writing systems, with a focus on phonographic systems with the Latin pedigree. [sent-99, score-0.182]
40 3 Multilingual Analysis An influential thread of previous multilingual work starts with the observation that rich linguistic resources exist for some languages but not others. [sent-101, score-0.208]
41 , 2005; Padó and Lapata, 2006) In these cases, the existence of a bilingual parallel text along with highly accurate predictions for one of the languages was assumed. [sent-105, score-0.22]
42 An even more recent line of work does away with the assumption of parallel texts and performs joint unsupervised induction for various languages through the use of coupled priors in the context of grammar induction (Cohen and Smith, 2009; Berg-Kirkpatrick and Klein, 2010). [sent-115, score-0.144]
43 ing a structured nearest neighbor approach for 8 languages (Kim et al. [sent-118, score-0.23]
44 To start, we downloaded and transcribed image files containing grapheme-phoneme mappings for several hundred languages from an online en- 2http://www. [sent-125, score-0.226]
45 We then crossreferenced the languages with the World Atlas of Language Structures (WALS) database (Haspelmath and Bibiko, 2005) as well as the translations available for the Universal Declaration of Human Rights (UDHR). [sent-129, score-0.144]
46 Our final set of 107 languages includes those which appeared consistently in all three sources and that employ a Latin alphabet. [sent-130, score-0.144]
47 As seen from the figure, these languages cover a wide array of language families and regions. [sent-132, score-0.144]
48 We then analyzed the phoneme inventories for the 107 languages. [sent-133, score-0.441]
49 We decided to focus our attention on graphemes which are widely used across these languages with a diverse set of phonemic values. [sent-134, score-0.61]
50 We measured the ambiguity of each grapheme by calculating the entropy of its phoneme sets across the languages, and found that 17 graphemes had entropy > 0. [sent-135, score-1.104]
51 Table 1 lists these graphemes, the set of phonemes that they can represent, the number of languages in our data-set which employ them, and the entropy of their phoneme-sets across these languages. [sent-137, score-0.386]
52 2 Features The key intuition underlying this work is that graphemic patterns in text can reveal the phonemes which they represent. [sent-140, score-0.293]
53 Text Context Features: These features represent the textual environment of each grapheme in a language. [sent-143, score-0.435]
54 For each grapheme g, we consider counts of graphemes to the immediate left and right of g in the UDHR text. [sent-144, score-0.669]
55 We define five feature templates, including counts of (1) single graphemes to the left of g, (2) single graphemes to the right of g, (3) pairs of graphemes to the left of g, (4) pairs of graphemes to the right of g, and (5) pairs of graphemes surrounding g. [sent-145, score-1.62]
56 htm#latin Figure 2: Map and language families of languages in our data-set these features are too language specific and not abstract enough to yield effective cross-lingual generalization. [sent-150, score-0.182]
57 In other words, we would ideally map all the graphemes in our text to phonemes, and then consider the plausibility of the resulting phoneme sequences. [sent-153, score-0.655]
58 As an imperfect proxy for this idea, we made the following observation: for most Latin graphemes, the most common phonemic value across languages is the identical IPA symbol of that grapheme (e. [sent-155, score-0.631]
59 the most common phoneme for g is /g/, the most common phoneme for t is /t/, etc). [sent-157, score-0.662]
60 Using this observation, we again consider all contexts in which a grapheme appears, but this time map the surrounding graphemes to their IPA phoneme equivalents. [sent-158, score-1.0]
61 We then consider various linguistic properties of these surrounding “phonemes” whether they are vowels or consonants, whether they are voiced or not, their manner and places of articulation and create phonetic context features. [sent-159, score-0.416]
62 The intuition here is that these features can (noisily) capture the phonotactic context of a grapheme, allowing our model to learn general phonotactic constraints. [sent-161, score-0.362]
63 These features allow our model to capture family and region specific phonetic biases. [sent-170, score-0.307]
64 For example, African languages are more likely to use c and q to represents clicks than are European languages. [sent-171, score-0.144]
65 Thus, a language family feature can combine with a phonetic context feature to represent a family specific phonotactic constraint. [sent-173, score-0.451]
66 4 Model Using the features described above, we develop an undirected graphical model approach to our predic338 Table# 2o:PTFtaehNoxmlutnieFlymbaFeitrcuaoFetfusaretusre s25i87nR6,a49e6w748ac haFt791iel,gt3687oe249re8ydbfore and after discretization/filtering. [sent-194, score-0.154]
67 We learn weights over our features which optimally relate the input features of the training languages to their observed labels. [sent-198, score-0.26]
68 For each node i, we also obtain a feature vector by examining the language’s text and extracting textual and noisy phonetic patterns (as detailed in the previous section). [sent-206, score-0.235]
69 Likewise for the graph edges j,k: we extract a feature vector, and depending on the labels of the two vertices yj and yk, take a dot product with the relevant parameters. [sent-211, score-0.151]
70 This procedure starts with a fully connected undirected graph structure, and iteratively removes edges between nodes that are conditionally independent given other neighboring nodes in the graph according to a statistical independence test over all training languages. [sent-215, score-0.309]
71 Besides our primary undirected graphical model, we also consider several baselines and variants, in order to assess the contribution of our model’s graph structure as well as the features used. [sent-230, score-0.203]
72 In all cases, we perform leave-one-out cross-validation over the 107 languages in our data-set. [sent-231, score-0.144]
73 For this metric, we consider each grapheme g and examine its predicted labels over all its possible phonemes: (g : p1) , (g : p2) , . [sent-249, score-0.345]
74 We report the percentage of all graphemes with such correct predictions (micro-averaged over all graphemes in all test language scenarios). [sent-254, score-0.724]
75 For this metric, we report the percentage of test languages for which our model achieves perfect predictions on all grapheme-phoneme pairs, yielding a perfect mapping. [sent-257, score-0.298]
76 The majority baseline yields 67% F1-measure on the phoneme-level binary prediction task, with 56% grapheme accuracy, and about 3% language accuracy. [sent-260, score-0.345]
77 Using undiscretized raw count features, the SVM improves phoneme-level performance to about 80% F1, but fails to provide any improvement on grapheme or language performance. [sent-261, score-0.345]
78 In contrast, the SVM using discretized and filtered features achieves performance gains in all three categories, achieving 71% grapheme accuracy and 8% language accuracy. [sent-262, score-0.446]
79 Phoneme-level F1 reaches 87%, grapheme accuracy hits 79%, and language accuracy more than doubles, achieving 22%. [sent-266, score-0.345]
80 (1) language family and region features, (2) textual context features, and (3) phonetic context features. [sent-269, score-0.321]
81 First, it appears that dropping the region and language family features actually improves performance. [sent-272, score-0.205]
82 We next observe that dropping the textual context features leads to a small drop in performance. [sent-275, score-0.129]
83 Finally, we see that dropping the phonetic context features seriously degrades our model’s accuracy. [sent-276, score-0.218]
84 In this section we analyze the predicted phoneme inventories and ask whether they display the statistical properties observed in the gold-standard mappings. [sent-279, score-0.441]
85 As outlined in Section 2, consonant phonemes can be represented by the three articulatory features of voicing, manner, and place. [sent-280, score-0.449]
86 The principle of feature economy states that phoneme inventories will be organized to minimize the number of distinct articulatory features used in the language, while maximizing the number of resulting phonemes. [sent-281, score-0.723]
87 First, we can measure the economy index of a consonant system by computing the ratio of the number of consonantal phonemes to the number of articulatory features used in their production: (Clements, 2003). [sent-283, score-0.551]
88 The higher this value, the more economical the sound system. [sent-284, score-0.14]
89 Secondly, for each articulatory dimension we can calculate the empirical distribution over values observed across the consonants of the language. [sent-285, score-0.281]
90 Since ##cofenasotunraensts consonants are produced as combinations of the three articulatory dimensions, the greatest number of consonants (for a given set of utilized feature values) will be produced when the distributions are close to uniform. [sent-286, score-0.42]
91 Thus, we can measure how economical each feature dimension is by computing the entropy of its distribution over consonants. [sent-287, score-0.133]
92 For example, in an economical system, we would expect roughly half the consonants to be voiced, and half to be unvoiced. [sent-288, score-0.22]
93 First, we notice that the average entropy of voiced vs. [sent-290, score-0.174]
94 unvoiced consonants is nearly identical in both cases, close to the optimal value. [sent-291, score-0.261]
95 However, when we examine the dimensions of place and manner, we notice that the entropy induced by our model is not as high as that of the true consonant inventories, implying a suboptimal allocation of consonants. [sent-292, score-0.169]
96 In fact, when we examine the economy index (ratio of consonants to features), we indeed find that on aver– 341 dicted and true consonant inventories (averaged over all 107 languages). [sent-293, score-0.43]
97 age our model’s predictions are not as economical as the gold standard. [sent-294, score-0.157]
98 – 7 Conclusions In this paper, we considered a novel problem: that of automatically relating written symbols to spoken sounds for an unknown language using a known writing system the Latin alphabet. [sent-296, score-0.207]
99 Our model automatically learns how to relate textual patterns of the unknown language to plausible phonemic interpretations using induced phonotactic regularities. [sent-299, score-0.476]
100 Adding more languages improves unsupervised multilingual part-of-speech tagging: A bayesian non-parametric approach. [sent-479, score-0.208]
wordName wordTfidf (topN-words)
[('grapheme', 0.345), ('phoneme', 0.331), ('graphemes', 0.324), ('phonemes', 0.19), ('snyder', 0.164), ('phonotactic', 0.162), ('languages', 0.144), ('articulatory', 0.142), ('phonemic', 0.142), ('phonetic', 0.141), ('consonants', 0.139), ('unvoiced', 0.122), ('voiced', 0.122), ('inventories', 0.11), ('latin', 0.11), ('economy', 0.102), ('writing', 0.091), ('yarowsky', 0.083), ('mappings', 0.082), ('discretization', 0.081), ('economical', 0.081), ('fricative', 0.081), ('gjk', 0.081), ('consonant', 0.079), ('predictions', 0.076), ('undirected', 0.076), ('family', 0.074), ('regularities', 0.073), ('wals', 0.07), ('vocal', 0.07), ('ipa', 0.07), ('sounds', 0.07), ('multilingual', 0.064), ('kondrak', 0.063), ('tongue', 0.063), ('discretized', 0.063), ('perceptual', 0.063), ('sproat', 0.063), ('naseem', 0.061), ('affricated', 0.061), ('alphabets', 0.061), ('dwyer', 0.061), ('graphemephoneme', 0.061), ('graphemic', 0.061), ('lips', 0.061), ('plosive', 0.061), ('udhr', 0.061), ('velar', 0.061), ('voicing', 0.061), ('vowels', 0.061), ('edges', 0.059), ('sound', 0.059), ('universal', 0.058), ('region', 0.054), ('labelings', 0.052), ('jiampojamarn', 0.052), ('reddy', 0.052), ('articulation', 0.052), ('touching', 0.052), ('textual', 0.052), ('entropy', 0.052), ('benjamin', 0.052), ('graph', 0.049), ('inventory', 0.049), ('atlas', 0.047), ('haspelmath', 0.047), ('written', 0.046), ('yk', 0.045), ('neighbor', 0.044), ('morphological', 0.044), ('tahira', 0.044), ('yj', 0.043), ('patterns', 0.042), ('nearest', 0.042), ('alphabetic', 0.041), ('alveolar', 0.041), ('bibiko', 0.041), ('daniels', 0.041), ('dougherty', 0.041), ('fayyad', 0.041), ('goldsmith', 0.041), ('gone', 0.041), ('kenstowicz', 0.041), ('liljencrants', 0.041), ('noisily', 0.041), ('phonemically', 0.041), ('postalveolar', 0.041), ('spirtes', 0.041), ('stricture', 0.041), ('relate', 0.04), ('graphical', 0.04), ('manner', 0.04), ('dropping', 0.039), ('alphabet', 0.039), ('perfect', 0.039), ('features', 0.038), ('nodes', 0.038), ('induced', 0.038), ('transliteration', 0.037), ('world', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.
2 0.25987402 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
Author: Jagadeesh Jagarlamudi ; Hal Daume III
Abstract: In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. Our approach uses international phonetic alphabet (IPA) to learn the interlingual representation and thus allows us to use any word and its IPA representation as a training example. Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations.1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29% accuracy and an average improvement of 17% accuracy compared to a state-of-the-art baseline system.
3 0.13096257 81 emnlp-2012-Learning to Map into a Universal POS Tagset
Author: Yuan Zhang ; Roi Reichart ; Regina Barzilay ; Amir Globerson
Abstract: We present an automatic method for mapping language-specific part-of-speech tags to a set of universal tags. This unified representation plays a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Until now, however, such conversion schemes have been created manually. Our central hypothesis is that a valid mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function that captures a range of distributional and typological characteristics of the derived mapping. Given the exponential size of the mapping space, we propose a novel method for optimizing over soft mappings, and use entropy regularization to drive those towards hard mappings. Our results demonstrate that automatically induced mappings rival the quality of their manually designed counterparts when evaluated in the . context of multilingual parsing.1
4 0.070996419 61 emnlp-2012-Grounded Models of Semantic Representation
Author: Carina Silberer ; Mirella Lapata
Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
5 0.065960072 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
Author: Greg Durrett ; Adam Pauls ; Dan Klein
Abstract: We consider the problem of using a bilingual dictionary to transfer lexico-syntactic information from a resource-rich source language to a resource-poor target language. In contrast to past work that used bitexts to transfer analyses of specific sentences at the token level, we instead use features to transfer the behavior of words at a type level. In a discriminative dependency parsing framework, our approach produces gains across a range of target languages, using two different lowresource training methodologies (one weakly supervised and one indirectly supervised) and two different dictionary sources (one manually constructed and one automatically constructed).
6 0.064059444 69 emnlp-2012-Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
7 0.062590703 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
8 0.057182323 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
9 0.055225335 12 emnlp-2012-A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing
10 0.05396663 65 emnlp-2012-Improving NLP through Marginalization of Hidden Syntactic Structure
11 0.052559987 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
12 0.047924716 106 emnlp-2012-Part-of-Speech Tagging for Chinese-English Mixed Texts with Dynamic Features
13 0.047743849 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
14 0.047382601 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
15 0.045032695 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
16 0.044081565 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation
17 0.043835208 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming
18 0.042905163 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
19 0.042610891 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
20 0.042150173 131 emnlp-2012-Unified Dependency Parsing of Chinese Morphological and Syntactic Structures
topicId topicWeight
[(0, 0.189), (1, -0.026), (2, 0.035), (3, 0.012), (4, 0.002), (5, 0.086), (6, 0.08), (7, 0.058), (8, 0.106), (9, -0.107), (10, 0.161), (11, 0.01), (12, 0.009), (13, 0.09), (14, -0.215), (15, -0.073), (16, 0.208), (17, -0.183), (18, 0.058), (19, 0.039), (20, -0.259), (21, 0.137), (22, 0.023), (23, 0.048), (24, -0.043), (25, -0.376), (26, 0.217), (27, -0.024), (28, 0.057), (29, -0.016), (30, 0.025), (31, 0.004), (32, 0.153), (33, -0.124), (34, 0.011), (35, -0.026), (36, -0.094), (37, -0.008), (38, 0.062), (39, 0.07), (40, 0.048), (41, -0.034), (42, 0.081), (43, 0.025), (44, -0.027), (45, -0.012), (46, -0.008), (47, 0.028), (48, -0.013), (49, 0.021)]
simIndex simValue paperId paperTitle
1 0.91892856 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
Author: Jagadeesh Jagarlamudi ; Hal Daume III
Abstract: In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. Our approach uses international phonetic alphabet (IPA) to learn the interlingual representation and thus allows us to use any word and its IPA representation as a training example. Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations.1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29% accuracy and an average improvement of 17% accuracy compared to a state-of-the-art baseline system.
same-paper 2 0.91764092 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.
3 0.39145857 81 emnlp-2012-Learning to Map into a Universal POS Tagset
Author: Yuan Zhang ; Roi Reichart ; Regina Barzilay ; Amir Globerson
Abstract: We present an automatic method for mapping language-specific part-of-speech tags to a set of universal tags. This unified representation plays a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Until now, however, such conversion schemes have been created manually. Our central hypothesis is that a valid mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function that captures a range of distributional and typological characteristics of the derived mapping. Given the exponential size of the mapping space, we propose a novel method for optimizing over soft mappings, and use entropy regularization to drive those towards hard mappings. Our results demonstrate that automatically induced mappings rival the quality of their manually designed counterparts when evaluated in the . context of multilingual parsing.1
4 0.34901428 61 emnlp-2012-Grounded Models of Semantic Representation
Author: Carina Silberer ; Mirella Lapata
Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
5 0.27323645 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
Author: Greg Durrett ; Adam Pauls ; Dan Klein
Abstract: We consider the problem of using a bilingual dictionary to transfer lexico-syntactic information from a resource-rich source language to a resource-poor target language. In contrast to past work that used bitexts to transfer analyses of specific sentences at the token level, we instead use features to transfer the behavior of words at a type level. In a discriminative dependency parsing framework, our approach produces gains across a range of target languages, using two different lowresource training methodologies (one weakly supervised and one indirectly supervised) and two different dictionary sources (one manually constructed and one automatically constructed).
6 0.26915592 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
7 0.23485981 69 emnlp-2012-Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
8 0.21658835 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language
9 0.20732357 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
10 0.19747566 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
11 0.19514138 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
12 0.19078922 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
13 0.18869676 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification
14 0.18579559 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections
15 0.17753556 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
16 0.17717151 59 emnlp-2012-Generating Non-Projective Word Order in Statistical Linearization
17 0.17565508 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
18 0.17530316 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls
19 0.16772056 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media
20 0.1673886 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure
topicId topicWeight
[(2, 0.018), (15, 0.345), (16, 0.024), (25, 0.035), (34, 0.071), (39, 0.023), (45, 0.012), (60, 0.087), (63, 0.061), (64, 0.023), (65, 0.022), (70, 0.023), (73, 0.01), (74, 0.05), (76, 0.048), (80, 0.017), (86, 0.028), (95, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.70940691 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.
2 0.40642491 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
Author: Jayant Krishnamurthy ; Tom Mitchell
Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.
3 0.40449509 81 emnlp-2012-Learning to Map into a Universal POS Tagset
Author: Yuan Zhang ; Roi Reichart ; Regina Barzilay ; Amir Globerson
Abstract: We present an automatic method for mapping language-specific part-of-speech tags to a set of universal tags. This unified representation plays a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Until now, however, such conversion schemes have been created manually. Our central hypothesis is that a valid mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function that captures a range of distributional and typological characteristics of the derived mapping. Given the exponential size of the mapping space, we propose a novel method for optimizing over soft mappings, and use entropy regularization to drive those towards hard mappings. Our results demonstrate that automatically induced mappings rival the quality of their manually designed counterparts when evaluated in the . context of multilingual parsing.1
4 0.39855707 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
Author: Lizhen Qu ; Rainer Gemulla ; Gerhard Weikum
Abstract: We propose the weakly supervised MultiExperts Model (MEM) for analyzing the semantic orientation of opinions expressed in natural language reviews. In contrast to most prior work, MEM predicts both opinion polarity and opinion strength at the level of individual sentences; such fine-grained analysis helps to understand better why users like or dislike the entity under review. A key challenge in this setting is that it is hard to obtain sentence-level training data for both polarity and strength. For this reason, MEM is weakly supervised: It starts with potentially noisy indicators obtained from coarse-grained training data (i.e., document-level ratings), a small set of diverse base predictors, and, if available, small amounts of fine-grained training data. We integrate these noisy indicators into a unified probabilistic framework using ideas from ensemble learning and graph-based semi-supervised learning. Our experiments indicate that MEM outperforms state-of-the-art methods by a significant margin.
5 0.39575949 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews
Author: Jianxing Yu ; Zheng-Jun Zha ; Tat-Seng Chua
Abstract: This paper proposes to generate appropriate answers for opinion questions about products by exploiting the hierarchical organization of consumer reviews. The hierarchy organizes product aspects as nodes following their parent-child relations. For each aspect, the reviews and corresponding opinions on this aspect are stored. We develop a new framework for opinion Questions Answering, which enables accurate question analysis and effective answer generation by making use the hierarchy. In particular, we first identify the (explicit/implicit) product aspects asked in the questions and their sub-aspects by referring to the hierarchy. We then retrieve the corresponding review fragments relevant to the aspects from the hierarchy. In order to gener- ate appropriate answers from the review fragments, we develop a multi-criteria optimization approach for answer generation by simultaneously taking into account review salience, coherence, diversity, and parent-child relations among the aspects. We conduct evaluations on 11 popular products in four domains. The evaluated corpus contains 70,359 consumer reviews and 220 questions on these products. Experimental results demonstrate the effectiveness of our approach.
6 0.3936896 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
7 0.39240569 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
8 0.39210266 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
9 0.39207473 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
10 0.39198998 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
11 0.3906979 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
12 0.39060032 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation
13 0.39047599 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
14 0.39037085 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP
15 0.38915151 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
16 0.38896084 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?
17 0.38864845 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
18 0.3879554 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction
19 0.38758045 51 emnlp-2012-Extracting Opinion Expressions with semi-Markov Conditional Random Fields
20 0.38654804 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules