emnlp emnlp2012 emnlp2012-111 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jagadeesh Jagarlamudi ; Hal Daume III
Abstract: In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. Our approach uses international phonetic alphabet (IPA) to learn the interlingual representation and thus allows us to use any word and its IPA representation as a training example. Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations.1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29% accuracy and an average improvement of 17% accuracy compared to a state-of-the-art baseline system.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. [sent-3, score-0.599]
2 Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations. [sent-5, score-0.349]
3 1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. [sent-6, score-0.832]
4 We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. [sent-7, score-0.514]
5 Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29% accuracy and an average improvement of 17% accuracy compared to a state-of-the-art baseline system. [sent-8, score-0.704]
6 Similarly, bilingual dictionaries and transliteration data sets are more accessible from a language into English than into a different language. [sent-14, score-0.616]
7 Previous studies in machine translation (Utiyama and Isahara, 2007; Paul and Sumita, 2011), transliteration (Khapra et al. [sent-16, score-0.505]
8 In this paper, we propose a regularization framework for bridge language approaches and show its effectiveness for name transliteration task. [sent-19, score-0.806]
9 Named entity (NE) transliteration involves transliterating a name in one language into another language and is shown to be crucial for machine translation (MT) (Knight and Graehl, 1998; AlOnaizan and Knight, 2002; Hermjakob et al. [sent-26, score-0.614]
10 In this paper, we operate in the context of transliteration mining (Klementiev and Roth, 2006; Sproat et al. [sent-37, score-0.505]
11 Given a set of l languages, we address the problem of building a transliteration system between every pair of languages. [sent-39, score-0.505]
12 A straight forward supervised learning approach would require training data of name pairs between every pair of languages (Knight and Graehl, 1998) or a set of common names transliterated from every language into a pivot language. [sent-40, score-0.293]
13 Bridge language approaches overcome the need for common names and build transliteration systems for resource poor languages (Khapra et al. [sent-42, score-0.673]
14 However, such approaches still require training data consisting of bilingual name transliterations (orthographic name-to-name mappings). [sent-44, score-0.285]
15 In this paper, we relax the need for name transliterations by using international phonetic alphabet (IPA) in a manner akin to a “bridge language. [sent-45, score-0.272]
16 We refer to the set of (word, IPA) pairs as phoneme dictionary in this paper. [sent-49, score-0.327]
17 Since we only need a phoneme dictionary in each language, our approach does not require any bilingual resources to build the transliteration system. [sent-54, score-0.887]
18 the phoneme dictionaries obtained from Wiktionary contain at least 2000 words in 21 languages and we will see in Sec. [sent-57, score-0.427]
19 6 that we can build a decent transliteration system with 2000 words. [sent-58, score-0.505]
20 2 Finally, unlike other transliteration approaches, by simply adding a phoneme dictionary of (l + 1)st language we can readily get a transliteration system into any of the existing llanguages and thus avoid the need for all-pairs data or computation. [sent-59, score-1.337]
21 Using IPA as the bridge language poses some new challenges such as the language specific phonemic inventory. [sent-60, score-0.509]
22 Besides this language specific phonemic inventory, names have different IPA representations in different languages. [sent-65, score-0.389]
23 In order to handle this phonemic diversity, our method explicitly models language-specific variability and attempts to minimize this phonemic variabil2In our experiments, we consider languages with small (2000) and big (>30K) phoneme dictionaries. [sent-70, score-1.04]
24 At a high level, our approach uses the phoneme dictionaries of each language to learn mapping functions into an interlingual representation (also referred as common subspace). [sent-76, score-0.487]
25 Subsequently, given a pair of languages, a query name in one of the languages and a list of candidate transliterations in the other language, we use the mapping functions of those two language to identify the correct name transliteration. [sent-77, score-0.485]
26 An important advantage of our approach is that, it extends easily to more than two languages and in fact adding phoneme dictionary from a different, but related, language improves the accuracies of a given language pair. [sent-80, score-0.417]
27 Our main contributions are: 1) building a transliteration system using (word, IPA) pairs and hence using only monolingual resources and 2) proposing a regularization framework which is more general and applies to other bridge language applications such as lexicon mining (Mann and Yarowsky, 2001). [sent-81, score-0.719]
28 3 Low Dimensional Projections Our approach is inspired by the Canonical Correlation Analysis (CCA) (Hotelling, 1936) and its application to transliteration mining (Udupa and Khapra, 2010). [sent-82, score-0.505]
29 First, we convert the phoneme dictionary of each language into feature vectors, i. [sent-83, score-0.327]
30 ara Fctoerr barnedv tIyP,A w symbol sequences as character and phonemic spaces respectively. [sent-88, score-0.477]
31 The character space is specific to each lan- ×× guage while the phonemic space is shared across all the languages. [sent-89, score-0.454]
32 Then), for each language, we find mappings (Ai and Ui) from the character and phonemic spa(ces into a co)mmon k-dimensional subspace such that( the correct )transliterations lie closer to each other in this subspace. [sent-93, score-0.477]
33 of fea(tures) o)f the character space of the language and c is( the si)ze of the common phonemic space. [sent-99, score-0.408]
34 During the testing stage, given a name xi in the source (ith) language, we find its transliteration in the target (jth) language xj by solving the following decoding problem: R(c×k) argxjminL(xi,xj) R(di ×k) (1) Figure 1: A single name (Gandhi) is shown in all the input feature spaces. [sent-105, score-0.785]
35 The alignment between the character and phonemic space is indicated with double dimensional arrows. [sent-106, score-0.408]
36 Bridge-CCA uses a single mapping function U from the phonemic space into the common subspace (the 2-dimensional green space at the top), where as our approach uses two mapping functions U1 and U2, one for each language, to map the IPA sequences into the common subspace. [sent-107, score-0.543]
37 shows the name Gandhi (represented as point) in the character spaces of English and Hindi, three-dimensional spaces, and its IPA sequences in the phonemic space (the two- 15 dimensional space in the middle). [sent-114, score-0.636]
38 Notice that, because of the phonemic variation, the same name is represented by two distinct points in the common phonemic space. [sent-115, score-0.749]
39 As a result our approach successfully handles the language specific phonemic variation. [sent-118, score-0.341]
40 At the same time we constrain the projection directions such that they behave similarly for the phonemic sounds that are observed in majority of the languages. [sent-119, score-0.482]
41 1, our model (called Regularized Projections) finds two different mapping functions U1 and U2, one for each language, from the phonemic space into the common two-dimensional space at the bottom. [sent-121, score-0.426]
42 Inspired by the Canonical Correlation Analysis (CCA) (Hotelling, 1936), we find projection directions in the character and phonemic spaces of each language such that, after projection, a word is closer to its aligned IPA sequence. [sent-128, score-0.522]
43 Then, we might try to find projection directions ai in each language and u in the common phonemic space such that: argaim,uin∑i=l1(⟨xi,ai⟩ − ⟨pi,u⟩)2 (3) where ⟨·, ·⟩ denotes the dot product between two vectors. [sent-133, score-0.485]
44 ,T·h⟩is d menoodteesl assumes tphroatd tuhcet projection wdiorection u is same for the phonemic space of all the languages. [sent-134, score-0.397]
45 In our model, intuitively, the parameters corresponding to the phonemic sounds that occur in majority of the languages are shared across the languages while the parameters of the language specific sounds are modeled per each language. [sent-137, score-0.667]
46 This is achieved by modeling the projection directions of the ith language phonemic space ui ← u + ri. [sent-138, score-0.515]
47 The vector u ∈ Rc is common to the phonemic spaces vofe atollr t uhe ∈ languages and thus handles sounds that are observed in multiple languages while ri ∈ Rc, the residual vector, is specific to each language aRnd accounts for the language specific phonemic variations. [sent-139, score-1.127]
48 By enforcing the residual vectors to be small, this formulation encourages the sounds that occur in majority of the languages to be accounted by u and the sounds that are specific to the given language by ri. [sent-142, score-0.38]
49 These mappings are used in predicting the transliteration of a name in one language into any other language, which will be described in the following section. [sent-173, score-0.669]
50 Formally, given a word xi in ith language we find its transliteration into jth language xj, by solving the optimization problem shown in Eq. [sent-177, score-0.553]
51 Similar to the previous case, the closed form solution can be found by computing the first derivative with respect to the unknown phoneme sequence and the target language transliteration and setting it to z(ero. [sent-179, score-0.786]
52 2 and then solve for xj, the best transliteration in the jth language, as: Aj(I−UjTCi−j1Uj)AjTxj = AjUjTC−ij1UiAiTxi (12) S(ince Ui and Uj) are not full rank matrices, to increase the numerical stability of the prediction step, we use Cij = UiUiT+UjUjT+0. [sent-182, score-0.505]
53 In transliteration, generative approaches aim to generate the target language transliteration of a given source name (Knight and Graehl, 1998; Jung et al. [sent-186, score-0.614]
54 , 2004; Al-Onaizan and Knight, 2002) while discriminative approaches assume a list of target language names, obtained from other sources, and try to identify the correct transliteration (Klementiev and Roth, 2006; Sproat et al. [sent-188, score-0.505]
55 Nevertheless, all these approaches require either bilingual name pairs or phoneme sequences to learn to transliterate between two languages. [sent-193, score-0.517]
56 Thus, if we want to build a transliteration system between every pair of languages in a given set of languages then these approaches need resources between every pair of languages which can be prohibitive. [sent-194, score-0.775]
57 , 2010) uses a single mapping function for the phonemic space of all the languages and thus it can not handle language specific variability. [sent-198, score-0.49]
58 Approaches that map words in different languages into the common phonemic space have also been well studied. [sent-200, score-0.435]
59 Similar to our approach these variants only require soundex mappings of a new language to build transliteration system, but our model does not require explicit mapping between n-gram characters and the IPA symbols instead it learns them automatically using phoneme dictionaries. [sent-206, score-0.911]
60 6 Experiments Our experiments are designed to evaluate the following three aspects of our model, and of our approach to transliteration in general: IPA as bridge: Unlike other phonemic based approaches (Sec. [sent-208, score-0.825]
61 5), we do not explicitly model the phoneme modifications between pairs of languages. [sent-209, score-0.281]
62 Moreover, the phoneme dictionary in each language is crawled from Wiktionary (Sec. [sent-210, score-0.327]
63 Multilinguality: In our method, simply adding a phoneme dictionary of a new language allows us to extend our transliteration system into any of the existing languages. [sent-215, score-0.832]
64 We evaluate the effect of data from a different, but related, languages on a transliteration system between a given pair. [sent-216, score-0.595]
65 Complementarity: Using IPA as bridge language allows us to build transliteration system into resource poor languages. [sent-217, score-0.703]
66 But we also want to evaluate whether such an approach can help improving a transliteration system trained directly on bilingual name-pairs. [sent-218, score-0.56]
67 The phoneme dictionaries (list of words and their IPA representations as shown in Table 1) are obtained from Wiktionary. [sent-221, score-0.337]
68 In principle, our method allows us to build transliteration system into any of these language pairs without any additional information. [sent-226, score-0.505]
69 Training data is monolingual phoneme dictionaries while development/test sets are bilingual name pairs between English and the respective language. [sent-234, score-0.523]
70 Table 3 shows the sizes of phoneme dictionaries used for training the models. [sent-236, score-0.337]
71 The phoneme dictionaries of English, Bulgarian, and Russian contain more than 30K (word,IPA) pairs while the remaining two languages have smaller phoneme dictionaries. [sent-237, score-0.708]
72 2 Experimental Setup We convert the phoneme dictionaries of each language into feature vectors. [sent-241, score-0.337]
73 We use unigram and bigram features in the phonemic space and unigram, bigram and trigram features in the character space. [sent-242, score-0.408]
74 The phonemic space is common to all the languages and has 3777 features. [sent-248, score-0.435]
75 Though the phonemic features are common to all the languages, as discussed in Sec. [sent-249, score-0.32]
76 org/ 19 lambda Figure 2: Performance of transliteration system with residual parameter λ on English-Bulgarian development data set. [sent-253, score-0.592]
77 This indicates the diversity in the phonemic inventory of different languages. [sent-255, score-0.342]
78 We compare our approach against Bridge-CCA, a state-of-the-art bridge language transliteration system which is known to perform competitively with other discriminative approaches (Khapra et al. [sent-256, score-0.695]
79 We use the phoneme dictionaries in each language to train our approach, as well as the baseline system. [sent-258, score-0.337]
80 The projection directions learnt during the training are used to find the transliteration for a test name as described in Sec. [sent-259, score-0.703]
81 We report the performance in terms of the accuracy (exact match) of the top ranked transliteration and the mean reciprocal rank (MRR) of the correct transliteration. [sent-262, score-0.505]
82 The second block shows the results when our approach is trained only on phoneme dictionaries of the language pair, the third block shows results when we include other language data as well. [sent-322, score-0.337]
83 To understand why the cosine angle between AiTxi and AjTxj is not the appropriate measure, assume that the vectors xi and xj are feature vectors of same name in two languages and let p be its true IPA representation. [sent-337, score-0.355]
84 1, in20 tegrates over the best possible phoneme sequence and thus yields significant improvements. [sent-341, score-0.281]
85 Notice that even though our Russian phoneme dictionary has only 1141 (word, IPA) pairs, our approach is able to achieve an accuracy of 63. [sent-348, score-0.327]
86 47% and an MRR of 73% indicating that the correct name transliteration is, on an average, at rank 1or 2. [sent-349, score-0.614]
87 3 Complementarity In the final experiment, we want to compare the performance of our approach, which uses only monolingual resources, with a transliteration system trained using bilingual name pairs. [sent-359, score-0.691]
88 We train a CCA based transliteration system (Udupa and Khapra, AccE. [sent-360, score-0.505]
89 15% and the average string edit distance of the returned transliteration to the true transliteration is about 3. [sent-389, score-1.01]
90 These accura- cies are not directly comparable to the results shown in Table 5 because, presumably, it is a transliteration generation system unlike CCA which is a transliteration mining approach. [sent-391, score-1.01]
91 For lack fair comparison, we don’t report the accuracies of the Google transliteration output in Table 5. [sent-392, score-0.505]
92 This experiment shows that a transliteration system trained on word and IPA representations can actually augment a system trained on bilingual name pairs lead- ing to an improved performance. [sent-398, score-0.669]
93 7 Conclusion In this paper we proposed a regularization technique for the bridge language approaches and showed its effectiveness on the name transliteration task. [sent-399, score-0.806]
94 Our approach learns interlingual representation using only monolingual resources and hence can be used to build transliteration system between resource poor languages. [sent-400, score-0.651]
95 We show that, by accounting the language specific phonemic variation, we can get a significant improvements. [sent-401, score-0.341]
96 Our experimental results suggest that a transliteration system built using IPA data can also help improve the accuracy of a transliteration system trained on bilingual name pairs. [sent-402, score-1.174]
97 Since name transliteration problem is being studied for a considerable time, many resources already exist between English and other languages. [sent-406, score-0.614]
98 An english to korean transliteration model of extended markov window. [sent-448, score-0.532]
99 Weakly supervised named entity transliteration and discovery 22 from multilingual comparable corpora. [sent-465, score-0.505]
100 Unsupervised named ACL-44, pages 73–80, Stroudsburg, entity transliteration using temporal and phonetic correlation. [sent-521, score-0.547]
wordName wordTfidf (topN-words)
[('ipa', 0.541), ('transliteration', 0.505), ('phonemic', 0.32), ('phoneme', 0.281), ('bridge', 0.168), ('transliterations', 0.121), ('name', 0.109), ('khapra', 0.094), ('interlingual', 0.094), ('languages', 0.09), ('residual', 0.087), ('ui', 0.081), ('cca', 0.078), ('sounds', 0.073), ('aitxi', 0.073), ('character', 0.063), ('udupa', 0.062), ('ajtxj', 0.06), ('dictionaries', 0.056), ('bilingual', 0.055), ('ri', 0.055), ('mappings', 0.055), ('uj', 0.052), ('projection', 0.052), ('ai', 0.051), ('spaces', 0.05), ('bulgarian', 0.049), ('names', 0.048), ('sproat', 0.047), ('pivot', 0.046), ('dictionary', 0.046), ('sequences', 0.044), ('lagrangian', 0.043), ('knight', 0.043), ('phonetic', 0.042), ('russian', 0.041), ('subspace', 0.039), ('stroudsburg', 0.038), ('directions', 0.037), ('eigenvalue', 0.036), ('romanian', 0.036), ('soundex', 0.036), ('xitai', 0.036), ('xj', 0.036), ('vectors', 0.036), ('tao', 0.035), ('ni', 0.034), ('mapping', 0.034), ('matrices', 0.033), ('yoon', 0.031), ('multipliers', 0.031), ('klementiev', 0.031), ('raghavendra', 0.031), ('resource', 0.03), ('fr', 0.03), ('variability', 0.029), ('transliterate', 0.028), ('wiktionary', 0.028), ('gi', 0.028), ('projections', 0.028), ('english', 0.027), ('xi', 0.026), ('graehl', 0.026), ('di', 0.025), ('rc', 0.025), ('space', 0.025), ('haizhou', 0.024), ('mrr', 0.024), ('regularization', 0.024), ('abduljaleel', 0.024), ('appropriateness', 0.024), ('bridgecca', 0.024), ('ecir', 0.024), ('feiti', 0.024), ('fgii', 0.024), ('hermjakob', 0.024), ('mandl', 0.024), ('pipitri', 0.024), ('pitri', 0.024), ('pitu', 0.024), ('pixitai', 0.024), ('rdi', 0.024), ('saralegi', 0.024), ('uiuit', 0.024), ('notice', 0.024), ('ei', 0.023), ('functions', 0.022), ('monolingual', 0.022), ('cosine', 0.022), ('competitively', 0.022), ('inventory', 0.022), ('cos', 0.022), ('im', 0.022), ('french', 0.022), ('optimization', 0.022), ('specific', 0.021), ('gets', 0.021), ('variation', 0.021), ('kumaran', 0.021), ('relaxes', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
Author: Jagadeesh Jagarlamudi ; Hal Daume III
Abstract: In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. Our approach uses international phonetic alphabet (IPA) to learn the interlingual representation and thus allows us to use any word and its IPA representation as a training example. Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations.1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29% accuracy and an average improvement of 17% accuracy compared to a state-of-the-art baseline system.
2 0.25987402 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.
3 0.089882195 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze
Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
4 0.071086951 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao
Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.
5 0.068426967 81 emnlp-2012-Learning to Map into a Universal POS Tagset
Author: Yuan Zhang ; Roi Reichart ; Regina Barzilay ; Amir Globerson
Abstract: We present an automatic method for mapping language-specific part-of-speech tags to a set of universal tags. This unified representation plays a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Until now, however, such conversion schemes have been created manually. Our central hypothesis is that a valid mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function that captures a range of distributional and typological characteristics of the derived mapping. Given the exponential size of the mapping space, we propose a novel method for optimizing over soft mappings, and use entropy regularization to drive those towards hard mappings. Our results demonstrate that automatically induced mappings rival the quality of their manually designed counterparts when evaluated in the . context of multilingual parsing.1
6 0.064549193 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
7 0.059461314 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
8 0.05444194 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
9 0.038069528 116 emnlp-2012-Semantic Compositionality through Recursive Matrix-Vector Spaces
10 0.036304023 69 emnlp-2012-Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
11 0.034978911 61 emnlp-2012-Grounded Models of Semantic Representation
12 0.033614632 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
13 0.032103453 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
14 0.031002283 86 emnlp-2012-Locally Training the Log-Linear Model for SMT
15 0.030845236 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
16 0.030490214 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
17 0.030179683 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
18 0.030011658 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction
19 0.029911077 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
20 0.029513527 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification
topicId topicWeight
[(0, 0.137), (1, -0.013), (2, -0.0), (3, 0.023), (4, -0.016), (5, 0.047), (6, 0.086), (7, 0.061), (8, 0.1), (9, -0.135), (10, 0.147), (11, -0.003), (12, 0.055), (13, 0.061), (14, -0.195), (15, -0.087), (16, 0.273), (17, -0.165), (18, 0.066), (19, 0.077), (20, -0.263), (21, 0.114), (22, 0.013), (23, 0.015), (24, -0.022), (25, -0.417), (26, 0.157), (27, 0.003), (28, 0.022), (29, -0.035), (30, 0.055), (31, -0.034), (32, 0.115), (33, -0.07), (34, -0.035), (35, -0.011), (36, -0.075), (37, -0.021), (38, 0.079), (39, 0.029), (40, -0.029), (41, -0.036), (42, 0.074), (43, 0.034), (44, -0.1), (45, -0.059), (46, 0.037), (47, 0.016), (48, -0.035), (49, -0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.95199746 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
Author: Jagadeesh Jagarlamudi ; Hal Daume III
Abstract: In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. Our approach uses international phonetic alphabet (IPA) to learn the interlingual representation and thus allows us to use any word and its IPA representation as a training example. Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations.1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29% accuracy and an average improvement of 17% accuracy compared to a state-of-the-art baseline system.
2 0.84202302 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.
3 0.33710665 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze
Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
4 0.28909716 81 emnlp-2012-Learning to Map into a Universal POS Tagset
Author: Yuan Zhang ; Roi Reichart ; Regina Barzilay ; Amir Globerson
Abstract: We present an automatic method for mapping language-specific part-of-speech tags to a set of universal tags. This unified representation plays a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Until now, however, such conversion schemes have been created manually. Our central hypothesis is that a valid mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function that captures a range of distributional and typological characteristics of the derived mapping. Given the exponential size of the mapping space, we propose a novel method for optimizing over soft mappings, and use entropy regularization to drive those towards hard mappings. Our results demonstrate that automatically induced mappings rival the quality of their manually designed counterparts when evaluated in the . context of multilingual parsing.1
5 0.2407912 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao
Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.
6 0.2258155 61 emnlp-2012-Grounded Models of Semantic Representation
7 0.21870519 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
8 0.21115217 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
9 0.17952958 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
10 0.17233515 69 emnlp-2012-Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
11 0.17028971 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
12 0.1630113 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language
13 0.15567668 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation
14 0.13951229 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
15 0.13888144 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
16 0.13721362 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction
17 0.13592471 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
18 0.13346241 59 emnlp-2012-Generating Non-Projective Word Order in Statistical Linearization
19 0.1304602 75 emnlp-2012-Large Scale Decipherment for Out-of-Domain Machine Translation
20 0.12973325 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification
topicId topicWeight
[(2, 0.025), (11, 0.015), (16, 0.027), (25, 0.012), (34, 0.098), (39, 0.013), (45, 0.023), (60, 0.109), (63, 0.046), (64, 0.022), (65, 0.023), (70, 0.024), (71, 0.309), (74, 0.032), (76, 0.054), (80, 0.014), (86, 0.017), (95, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.74762839 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
Author: Jagadeesh Jagarlamudi ; Hal Daume III
Abstract: In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. Our approach uses international phonetic alphabet (IPA) to learn the interlingual representation and thus allows us to use any word and its IPA representation as a training example. Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations.1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29% accuracy and an average improvement of 17% accuracy compared to a state-of-the-art baseline system.
2 0.73472333 72 emnlp-2012-Joint Inference for Event Timeline Construction
Author: Quang Do ; Wei Lu ; Dan Roth
Abstract: This paper addresses the task of constructing a timeline of events mentioned in a given text. To accomplish that, we present a novel representation of the temporal structure of a news article based on time intervals. We then present an algorithmic approach that jointly optimizes the temporal structure by coupling local classifiers that predict associations and temporal relations between pairs of temporal entities with global constraints. Moreover, we present ways to leverage knowledge provided by event coreference to further improve the system performance. Overall, our experiments show that the joint inference model significantly outperformed the local classifiers by 9.2% of relative improvement in F1. The experiments also suggest that good event coreference could make remarkable contribution to a robust event timeline construction system.
3 0.47598714 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
Author: Greg Durrett ; Adam Pauls ; Dan Klein
Abstract: We consider the problem of using a bilingual dictionary to transfer lexico-syntactic information from a resource-rich source language to a resource-poor target language. In contrast to past work that used bitexts to transfer analyses of specific sentences at the token level, we instead use features to transfer the behavior of words at a type level. In a discriminative dependency parsing framework, our approach produces gains across a range of target languages, using two different lowresource training methodologies (one weakly supervised and one indirectly supervised) and two different dictionary sources (one manually constructed and one automatically constructed).
4 0.47348604 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?
Author: Mahesh Joshi ; Mark Dredze ; William W. Cohen ; Carolyn Rose
Abstract: We present a systematic analysis of existing multi-domain learning approaches with respect to two questions. First, many multidomain learning algorithms resemble ensemble learning algorithms. (1) Are multi-domain learning improvements the result of ensemble learning effects? Second, these algorithms are traditionally evaluated in a balanced class label setting, although in practice many multidomain settings have domain-specific class label biases. When multi-domain learning is applied to these settings, (2) are multidomain methods improving because they capture domain-specific class biases? An understanding of these two issues presents a clearer idea about where the field has had success in multi-domain learning, and it suggests some important open questions for improving beyond the current state of the art.
5 0.47293583 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
Author: Fei Huang ; Alexander Yates
Abstract: Representation learning is a promising technique for discovering features that allow supervised classifiers to generalize from a source domain dataset to arbitrary new domains. We present a novel, formal statement of the representation learning task. We argue that because the task is computationally intractable in general, it is important for a representation learner to be able to incorporate expert knowledge during its search for helpful features. Leveraging the Posterior Regularization framework, we develop an architecture for incorporating biases into representation learning. We investigate three types of biases, and experiments on two domain adaptation tasks show that our biased learners identify significantly better sets of features than unbiased learners, resulting in a relative reduction in error of more than 16% for both tasks, with respect to existing state-of-the-art representation learning techniques.
6 0.47182998 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP
7 0.47182146 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
8 0.47035486 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation
9 0.46917096 81 emnlp-2012-Learning to Map into a Universal POS Tagset
10 0.46910223 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
11 0.46889454 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
12 0.46870413 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
13 0.46858862 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
14 0.46821326 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
15 0.46746701 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
16 0.46720573 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
17 0.46596166 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM
18 0.46570706 45 emnlp-2012-Exploiting Chunk-level Features to Improve Phrase Chunking
19 0.46527335 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
20 0.46500504 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT