emnlp emnlp2011 emnlp2011-122 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Taylor Berg-Kirkpatrick ; Dan Klein
Abstract: We present a simple objective function that when optimized yields accurate solutions to both decipherment and cognate pair identification problems. The objective simultaneously scores a matching between two alphabets and a matching between two lexicons, each in a different language. We introduce a simple coordinate descent procedure that efficiently finds effective solutions to the resulting combinatorial optimization problem. Our system requires only a list of words in both languages as input, yet it competes with and surpasses several state-of-the-art systems that are both substantially more complex and make use of more information.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a simple objective function that when optimized yields accurate solutions to both decipherment and cognate pair identification problems. [sent-3, score-0.638]
2 The objective simultaneously scores a matching between two alphabets and a matching between two lexicons, each in a different language. [sent-4, score-0.704]
3 We focus on the setting where a close correspondence between the alphabets of the two languages exists, but is unknown. [sent-8, score-0.317]
4 Given only two lists of words, the lexicons of both languages, we attempt to induce the correspondence between alphabets and identify the cognates pairs present in the lexicons. [sent-9, score-0.659]
5 The system we propose accomplishes this by defining a simple combinatorial optimization problem that is a function of both the alphabet and cognate matchings, and then induces correspondences by optimizing the objective using a block coordinate descent procedure. [sent-10, score-1.173]
6 There is a range of past work that has variously investigated cognate detection (Kondrak, 2001 ; Bouchard-C oˆt´ e et al. [sent-11, score-0.407]
7 We consider a common element, which is a model wherein there are character-level correspondences and word-level correspondences, with the word matching parameterized by the character one. [sent-17, score-0.386]
8 2 Decipherment as Two-Level Optimization Our method represents two matchings, one at the alphabet level and one at the lexicon level. [sent-20, score-0.452]
9 For each character iin the source alphabet and each character j in the target alphabet we define an indicator variable xij that is on if and only if character i is mapped to character j. [sent-22, score-1.106]
10 For word u in the source lexicon and word v in the target lexicon, the indicator variable yuv denotes that u maps to v. [sent-24, score-0.403]
11 We define an objective function on the matching variables as follows. [sent-26, score-0.325]
12 Let EDITDIST(u, v; x) denote the edit distance between source word u and target word v given alphabet matching x. [sent-27, score-0.812]
13 Now, the objPectiPve that we will minimize can be stated simply: Pu Pv yuv · EDITDIST(u, v; x), the sum of the edPit diPstances b·e Etween the matched words, where the PeditP Pdistance function is parameterized by the alphabet matching. [sent-40, score-0.44]
14 Without restrictions on the matchings x and y this objective can always be driven to zero by either mapping all characters to all characters, or matching none of the words. [sent-41, score-0.421]
15 Let I the size of be the source alphabet and J be the size of the target alphabet. [sent-43, score-0.405]
16 We allow the alphabet matching x to be many-to-many but require that each character participate in no more than two mappings and that the total number of mappings be max(I, J), a constraint we refer to as restricted-many-to-many. [sent-44, score-0.704]
17 The requirements can be encoded with the following linear constraints on x: ∀i Xxij ≤ 2 ∀j Xxij ≤ 2 XXxij = max(I,J) Xj Xi Xi Xj The lexicon matching y is required to be τ-one-toone. [sent-45, score-0.379]
18 x is restricted-many-to-many y is τ-one-to-one 314 In order to get a better handle on the shape of the objective and to develop an efficient optimization procedure we decompose each edit distance computation and re-formulate the optimization problem in Section 2. [sent-52, score-0.369]
19 Here, the source lexicon consists of the English words ( cat bat cart rat cab ) , and the source alphabet consists of the characters ( a b c r t ) . [sent-56, score-0.632]
20 We have used digits as symbols in the target alphabet to make it clear that we treat the alphabets , , , , , , , , , , , as disjoint. [sent-58, score-0.551]
21 The matchings shown achieve an edit distance of zero between all matched word pairs except for the pair ( cat , 2 3 ) . [sent-63, score-0.303]
22 However, by writing the objective in an explicit form that refers to these edit variables, we are able to describe a efficient block coordinate descent procedure that can be used for optimization. [sent-69, score-0.489]
23 EDITDIST(u, v; x) is computed by minimizing over the set of monotonic alignments between the characters of the source word u and the characters of the target word v. [sent-70, score-0.346]
24 x are the alphabet matching indicator variables, y are the lexicon matching indicator variables, and z are the edit alignment indicator variables. [sent-76, score-1.137]
25 The index u refers to a word in the source lexicon, v refers to word in the target lexicon, irefers to a character in the source alphabet, and j refers to a character in the target alphabet. [sent-77, score-0.462]
26 Let zuv be the vector of alignment variables for the edit distance computation between source word u and target word v, where entry zuv,nm indicates whether the character at position n of source word u is aligned to the character at position m of target word v. [sent-80, score-1.019]
27 The requirement that zuv be a monotonic alignment can be expressed using linear constraints, but in our optimization procedure (described in Section 3) these constraints need not be explicitly rep- resented. [sent-92, score-0.508]
28 Now we can substitute the explicit edit distance equation into the implicit matching objective (1). [sent-93, score-0.535]
29 Noticing that the mins and sums commute, we arrive at the explicit form of the matching optimization problem. [sent-94, score-0.305]
30 is restricted-many-to-many y is τ-one-to-one ∀uv zuv is monotonic x The implicit and explicit optimizations are the same, apart from the fact that the explicit optimization now explicitly represents the edit alignment variables z. [sent-100, score-0.791]
31 Let the explicit matching objective (2) be denoted as J(x, y, z). [sent-101, score-0.313]
32 Note that the function EDITDIST returns both the min edit distance euv and the argmin edit alignments zuv. [sent-107, score-0.446]
33 1 Lexicon Matching Update Let x, the alphabet matching variable, be fixed. [sent-111, score-0.513]
34 We consider the problem of optimizing J(x, y, z) over the lexicon matching variable y and and the edit alignments z under the constraint that y is τ-oneto-one and each zuv is monotonic. [sent-112, score-0.946]
35 316 Algorithm 1 Block Coordinate Descent Randomly initialize alphabet matching x. [sent-118, score-0.513]
36 The zuv in each of these edit distance problems can be optimized independently. [sent-121, score-0.52]
37 zuv that do not have yuv active have no effect on the objective, and zuv with yuv active can be optimized using the standard edit distance dynamic program. [sent-122, score-1.119]
38 Thus, in a first step we compute the U · V edit distances euv and best monotonic alignmUe ·n Vt v eardiiatb dleisst zuv sb eetween all pairs of source and target words using U·V calls to the standard edit dis- ttaa? [sent-123, score-0.857]
39 2 Alphabet Matching Update Let y and z, the lexicon matching variables and the edit alignments, be fixed. [sent-135, score-0.58]
40 Now, we find the optimal alphabet matching variables x subject to the constraint that x is restricted-many-to-many. [sent-136, score-0.559]
41 It makes sense that to optimize J(x, y, z) with respect to x we should prioritize mappings xij that would mitigate the largest substitution costs in the active edit distance problems. [sent-137, score-0.316]
42 Indeed, with a little algebra it can be shown that solving a maximum weighted matching problem with weights cij that count potential substitution costs gives the correct update for x. [sent-138, score-0.319]
43 In particular, cij is the total cost of substitution edits in the active edit alignment prob- lems that would result if source character iwere not mapped to target character j in the alphabet matching x. [sent-139, score-1.078]
44 Since we have instead allowed restricted-many-to-many alphabet matchings we turn to linear programming for optimizing x. [sent-144, score-0.396]
45 ∀iXxij ≤ 2, ∀jXxij ≤ 2 Xj Xi XXxij = max(I,J) Xi Xj In experiments we used the GNU Linear Programming Toolkit (GLPK) to solve the LP and update the alphabet matching x. [sent-147, score-0.541]
46 To find better optima, we run the coordinate descent procedure multiple times, initialized each time with a random alphabet matching. [sent-151, score-0.456]
47 1 Phonetic Cognate Lexicons The first data set we evaluate on consists of 583 triples of phonetic transcriptions of cognates in Spanish, Portuguese, and Italian. [sent-159, score-0.379]
48 For a given pair of languages the task is to determine the mapping between lexicons that correctly maps each source word to its cognate in the target lexicon. [sent-162, score-0.647]
49 Hall and Klein (2010) presented a state-of-theart system for the task of cognate identification and evaluated on this data set. [sent-164, score-0.433]
50 They estimate parameters and infer the pairs of cognates present in all three languages jointly, while we consider each pair of languages in turn. [sent-166, score-0.385]
51 Notice that the phonetic alphabets for the three languages are actually the same. [sent-170, score-0.346]
52 Our model, on the other hand, is unaware of any prior correspondence between alphabets and does not make use of this additional information about phonetic change. [sent-173, score-0.36]
53 Hall and Klein (2010) also evaluate their model on lexicons that do not have a perfect cognate mapping. [sent-174, score-0.486]
54 This scenario, where not every word in one language has a cognate in another, is more realistic. [sent-175, score-0.382]
55 They produced a data set with this property by pruning words from the ROMANCE data set until only about 75% of the words in each source lexicon have cognates in each target lexicon. [sent-176, score-0.558]
56 To generate a gold cognate matching we used the intersected HMM alignment model of Liang et al. [sent-183, score-0.629]
57 To reduce this translation lexicon down to a cognate matching we went through the translation lexicon by hand and removed any pair of words that we judged to not be cognates. [sent-186, score-0.92]
58 The resulting gold matching contains cognate mappings in the English lexicon for 1,026 of the words in the Spanish lexicon. [sent-187, score-0.812]
59 This means that only about 50% of the words in English lexicon have cognates in the Spanish lexicon. [sent-188, score-0.446]
60 Since our system specializes in matching cognates and does not take into account additional information from corpus statistics, we compare against the version of their system that only takes into account orthographic features and is thus is best suited for cognate detection. [sent-195, score-1.059]
61 Their system requires a small seed ofcorrect cognate pairs. [sent-196, score-0.49]
62 Once in this canonical space, similarity metrics can be computed and words can be matched using a bipartite matching algorithm. [sent-198, score-0.308]
63 The process is iterative, adding cognate pairs to the seed lexicon gradually and each time re-computing a refined projection. [sent-199, score-0.598]
64 Again, for this dataset it is reasonable to expect that many characters will map to themselves in the best alphabet matching. [sent-205, score-0.355]
65 We attempt to learn a mapping from the alphabet of the ancient Semitic language Ugaritic to the alphabet of Hebrew, and at the same time learn a matching between Hebrew words in a Hebrew lexicon and their cognates in a Ugaritic lexicon. [sent-212, score-1.252]
66 The data set consists of a Ugaritic lexicon of 2,214 words, each of which has a Hebrew cognate, the lexicon of their 2,214 Hebrew cognates, and a gold cognate dictionary for evaluation. [sent-215, score-0.7]
67 It attempts to learn a correspondence between the morphology of Ugaritic and that of Hebrew while reconstructing cognates for Ugaritic words. [sent-219, score-0.382]
68 (2010) run their system on a set 7,386 Ugaritic words, the same set that we extracted our 2,214 Ugaritic words with Hebrew cognates from. [sent-222, score-0.338]
69 We evaluate the accuracy of the lexicon matching produced by our system on these 2,214 Ugaritic words, and so do they, measuring the number of correctly reconstructed cognates. [sent-223, score-0.43]
70 By restricting the source and target lexicons to sets of cognates we have made the task easier. [sent-224, score-0.503]
71 We compare against the phylogenetic cognate detection system of Hall and Klein (2010). [sent-230, score-0.541]
72 We show the pairwise cognate accuracy across all pairs of languages from the following set: Spanish, Portuguese, and Italian. [sent-231, score-0.431]
73 comparable: only a small proportion of the words in the Ugaritic lexicon have cognates in the lexicon composed of the most frequent Hebrew words. [sent-232, score-0.605]
74 The Ugaritic alphabet contains 30 characters, the Hebrew alphabet contains 22 characters, and the gold matching contains 33 entries. [sent-237, score-0.806]
75 We evaluate the learned alphabet matching by counting the number of recovered entries from the gold matching. [sent-238, score-0.513]
76 9% of the correct cognate mappings on the pair Spanish and Italian, 85. [sent-250, score-0.433]
77 We compare against the phylogenetic cognate detection system of Hall and Klein (2010). [sent-259, score-0.541]
78 We show the pairwise cognate precision, recall, and F1 across all pairs of languages from the following set: Spanish, Portuguese, and Italian. [sent-260, score-0.431]
79 Note that approximately 75% of the source words in each of the source lexicons have cognates in each of the target lexicons. [sent-261, score-0.562]
80 Our system achieves accuracy comparable to that of the phylogenetic system, despite the fact that the phylogenetic system is substantially more complex and makes use of an informed prior on alphabet correspondences. [sent-266, score-0.632]
81 The alphabet matching learned by our system is interesting to analyze. [sent-267, score-0.564]
82 Our system learns the correct cognate pairing of Spanish /bino/ to Portuguese /vinu/. [sent-269, score-0.453]
83 Our system, which allows many-tomany alphabet correspondences, correctly identifies the mappings /o/ → /u/ and /b/ → /v/ as well as the identity mappings → /o/ / → n/od/ /abn/d → → /b/ / → a /bs/ w welhlic ash are iadleson common. [sent-271, score-0.395]
84 In this data set, only approximately 75% of the source words in each of the source lexicons have cognates in each of the target lexicons. [sent-274, score-0.562]
85 We show the cognate precision, recall, and F1 for the pair of languages English and Spanish using lexicons extracted from corpora. [sent-292, score-0.535]
86 Note that approximately 50% of the words in the English lexicon have cognates in the Spanish lexicon. [sent-293, score-0.446]
87 The phylogenetic system observes the phylogenetic tree of ancestry for the three languages and – explicitly models cognate evolution and survival in a ‘survival’ tree. [sent-300, score-0.745]
88 In this data set, only approximately 50% of the source words have cognates in the target lexicon. [sent-310, score-0.399]
89 Using a seed matching of 50 word pairs, the orthographic system of Haghighi et al. [sent-315, score-0.396]
90 Using a seed matching of 20 word pairs, it achieves a best F1 of 44. [sent-318, score-0.298]
91 In particular, the MATCHER system assumes the inventories of cognates in both Hebrew and Ugaritic are known, while the system of Snyder et al. [sent-330, score-0.411]
92 (2010) reconstructs cognates assuming only that the morphology of Hebrew is known, which is a harder task. [sent-331, score-0.346]
93 We show cognate pair identification accuracy and alphabet matching accuracy for Ugaritic and Hebrew. [sent-332, score-0.895]
94 tional information: a seed matching of correct cognate pairs. [sent-333, score-0.659]
95 We evaluate both accuracy of the lexicon matching learned by our system, and the accuracy of the alphabet matching. [sent-337, score-0.672]
96 In particular, our system assumes the inventories of cognates in both Hebrew and Ugaritic are known, while the system of Snyder et al. [sent-343, score-0.411]
97 (2010) reconstructs cognates assuming only that the morphology of Hebrew is known, which is a harder task. [sent-344, score-0.346]
98 Even so, the results show that our system is effective at decipherment when semantically similar lexicons are available. [sent-345, score-0.329]
99 6 Conclusion We have presented a simple combinatorial model that simultaneously incorporates both a matching between alphabets and a matching between lexicons. [sent-346, score-0.682]
100 Our system is effective at both the tasks of cognate identification and alphabet decipherment, requiring only lists of words in both languages as input. [sent-347, score-0.775]
wordName wordTfidf (topN-words)
[('cognate', 0.382), ('zuv', 0.333), ('ugaritic', 0.301), ('alphabet', 0.293), ('cognates', 0.287), ('matching', 0.22), ('alphabets', 0.205), ('decipherment', 0.174), ('hebrew', 0.163), ('lexicon', 0.159), ('edit', 0.155), ('editdist', 0.143), ('spanish', 0.123), ('yuv', 0.111), ('phylogenetic', 0.108), ('lexicons', 0.104), ('portuguese', 0.102), ('snyder', 0.094), ('phonetic', 0.092), ('character', 0.089), ('matchings', 0.08), ('coordinate', 0.08), ('partialromance', 0.079), ('correspondences', 0.077), ('haghighi', 0.069), ('orthographic', 0.068), ('europarl', 0.065), ('romance', 0.065), ('correspondence', 0.063), ('characters', 0.062), ('descent', 0.062), ('source', 0.059), ('objective', 0.059), ('block', 0.058), ('seed', 0.057), ('alignments', 0.056), ('monotonic', 0.054), ('target', 0.053), ('system', 0.051), ('optimization', 0.051), ('hungarian', 0.051), ('mappings', 0.051), ('optimum', 0.049), ('languages', 0.049), ('euv', 0.048), ('matcher', 0.048), ('klein', 0.047), ('variables', 0.046), ('del', 0.046), ('cij', 0.046), ('hall', 0.046), ('optima', 0.037), ('combinatorial', 0.037), ('matched', 0.036), ('knight', 0.035), ('implicit', 0.035), ('explicit', 0.034), ('morphology', 0.032), ('xj', 0.032), ('pv', 0.032), ('xxij', 0.032), ('xxxij', 0.032), ('xyuv', 0.032), ('distance', 0.032), ('sub', 0.031), ('xij', 0.031), ('substitutions', 0.029), ('update', 0.028), ('ins', 0.027), ('reconstructs', 0.027), ('alignment', 0.027), ('bipartite', 0.027), ('italian', 0.027), ('substitution', 0.025), ('past', 0.025), ('canonical', 0.025), ('lp', 0.025), ('uv', 0.025), ('ish', 0.025), ('survival', 0.025), ('xv', 0.025), ('aligned', 0.024), ('optimizing', 0.023), ('solutions', 0.023), ('lu', 0.022), ('explicitly', 0.022), ('active', 0.022), ('lv', 0.022), ('transforms', 0.022), ('deletions', 0.022), ('inventories', 0.022), ('procedure', 0.021), ('achieves', 0.021), ('indicator', 0.021), ('integer', 0.021), ('held', 0.021), ('pairing', 0.02), ('refers', 0.02), ('xi', 0.02), ('max', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 122 emnlp-2011-Simple Effective Decipherment via Combinatorial Optimization
Author: Taylor Berg-Kirkpatrick ; Dan Klein
Abstract: We present a simple objective function that when optimized yields accurate solutions to both decipherment and cognate pair identification problems. The objective simultaneously scores a matching between two alphabets and a matching between two lexicons, each in a different language. We introduce a simple coordinate descent procedure that efficiently finds effective solutions to the resulting combinatorial optimization problem. Our system requires only a list of words in both languages as input, yet it competes with and surpasses several state-of-the-art systems that are both substantially more complex and make use of more information.
2 0.31767011 77 emnlp-2011-Large-Scale Cognate Recovery
Author: David Hall ; Dan Klein
Abstract: We present a system for the large scale induction of cognate groups. Our model explains the evolution of cognates as a sequence of mutations and innovations along a phylogeny. On the task of identifying cognates from over 21,000 words in 218 different languages from the Oceanic language family, our model achieves a cluster purity score over 91%, while maintaining pairwise recall over 62%.
3 0.10554527 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
Author: Jiajun Zhang ; Feifei Zhai ; Chengqing Zong
Abstract: Due to its explicit modeling of the grammaticality of the output via target-side syntax, the string-to-tree model has been shown to be one of the most successful syntax-based translation models. However, a major limitation of this model is that it does not utilize any useful syntactic information on the source side. In this paper, we analyze the difficulties of incorporating source syntax in a string-totree model. We then propose a new way to use the source syntax in a fuzzy manner, both in source syntactic annotation and in rule matching. We further explore three algorithms in rule matching: 0-1 matching, likelihood matching, and deep similarity matching. Our method not only guarantees grammatical output with an explicit target tree, but also enables the system to choose the proper translation rules via fuzzy use of the source syntax. Our extensive experiments have shown significant improvements over the state-of-the-art string-to-tree system. 1
4 0.07510557 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
Author: Ryan McDonald ; Slav Petrov ; Keith Hall
Abstract: We present a simple method for transferring dependency parsers from source languages with labeled training data to target languages without labeled training data. We first demonstrate that delexicalized parsers can be directly transferred between languages, producing significantly higher accuracies than unsupervised parsers. We then use a constraint driven learning algorithm where constraints are drawn from parallel corpora to project the final parser. Unlike previous work on projecting syntactic resources, we show that simple methods for introducing multiple source lan- guages can significantly improve the overall quality of the resulting parsers. The projected parsers from our system result in state-of-theart performance when compared to previously studied unsupervised and projected parsing systems across eight different languages.
5 0.072717547 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction
Author: Young-Bum Kim ; Joao Graca ; Benjamin Snyder
Abstract: In this paper, we consider the problem of unsupervised morphological analysis from a new angle. Past work has endeavored to design unsupervised learning methods which explicitly or implicitly encode inductive biases appropriate to the task at hand. We propose instead to treat morphological analysis as a structured prediction problem, where languages with labeled data serve as training examples for unlabeled languages, without the assumption of parallel data. We define a universal morphological feature space in which every language and its morphological analysis reside. We develop a novel structured nearest neighbor prediction method which seeks to find the morphological analysis for each unlabeled lan- guage which lies as close as possible in the feature space to a training language. We apply our model to eight inflecting languages, and induce nominal morphology with substantially higher accuracy than a traditional, MDLbased approach. Our analysis indicates that accuracy continues to improve substantially as the number of training languages increases.
6 0.070127524 3 emnlp-2011-A Correction Model for Word Alignments
7 0.067359805 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement
8 0.06397707 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
9 0.058537479 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
10 0.05270464 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
11 0.049728621 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
12 0.048839718 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation
13 0.04712816 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
14 0.044027191 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
15 0.042845424 121 emnlp-2011-Semi-supervised CCG Lexicon Extension
16 0.041955397 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms
17 0.041252229 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
18 0.040168911 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
19 0.039215472 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
20 0.038159024 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
topicId topicWeight
[(0, 0.146), (1, 0.029), (2, 0.012), (3, -0.001), (4, 0.059), (5, 0.016), (6, -0.127), (7, -0.002), (8, -0.238), (9, 0.074), (10, -0.056), (11, 0.108), (12, -0.098), (13, 0.003), (14, -0.056), (15, -0.021), (16, -0.257), (17, 0.234), (18, -0.162), (19, -0.46), (20, 0.251), (21, 0.191), (22, -0.011), (23, -0.019), (24, -0.046), (25, -0.125), (26, 0.004), (27, -0.003), (28, -0.154), (29, -0.042), (30, -0.0), (31, -0.057), (32, -0.015), (33, -0.011), (34, -0.016), (35, 0.092), (36, -0.034), (37, -0.067), (38, 0.013), (39, -0.025), (40, 0.024), (41, -0.015), (42, 0.025), (43, 0.006), (44, -0.027), (45, 0.002), (46, 0.024), (47, -0.017), (48, 0.009), (49, -0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.95250946 122 emnlp-2011-Simple Effective Decipherment via Combinatorial Optimization
Author: Taylor Berg-Kirkpatrick ; Dan Klein
Abstract: We present a simple objective function that when optimized yields accurate solutions to both decipherment and cognate pair identification problems. The objective simultaneously scores a matching between two alphabets and a matching between two lexicons, each in a different language. We introduce a simple coordinate descent procedure that efficiently finds effective solutions to the resulting combinatorial optimization problem. Our system requires only a list of words in both languages as input, yet it competes with and surpasses several state-of-the-art systems that are both substantially more complex and make use of more information.
2 0.9146474 77 emnlp-2011-Large-Scale Cognate Recovery
Author: David Hall ; Dan Klein
Abstract: We present a system for the large scale induction of cognate groups. Our model explains the evolution of cognates as a sequence of mutations and innovations along a phylogeny. On the task of identifying cognates from over 21,000 words in 218 different languages from the Oceanic language family, our model achieves a cluster purity score over 91%, while maintaining pairwise recall over 62%.
3 0.2724756 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
Author: Jiajun Zhang ; Feifei Zhai ; Chengqing Zong
Abstract: Due to its explicit modeling of the grammaticality of the output via target-side syntax, the string-to-tree model has been shown to be one of the most successful syntax-based translation models. However, a major limitation of this model is that it does not utilize any useful syntactic information on the source side. In this paper, we analyze the difficulties of incorporating source syntax in a string-totree model. We then propose a new way to use the source syntax in a fuzzy manner, both in source syntactic annotation and in rule matching. We further explore three algorithms in rule matching: 0-1 matching, likelihood matching, and deep similarity matching. Our method not only guarantees grammatical output with an explicit target tree, but also enables the system to choose the proper translation rules via fuzzy use of the source syntax. Our extensive experiments have shown significant improvements over the state-of-the-art string-to-tree system. 1
4 0.23026666 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction
Author: Young-Bum Kim ; Joao Graca ; Benjamin Snyder
Abstract: In this paper, we consider the problem of unsupervised morphological analysis from a new angle. Past work has endeavored to design unsupervised learning methods which explicitly or implicitly encode inductive biases appropriate to the task at hand. We propose instead to treat morphological analysis as a structured prediction problem, where languages with labeled data serve as training examples for unlabeled languages, without the assumption of parallel data. We define a universal morphological feature space in which every language and its morphological analysis reside. We develop a novel structured nearest neighbor prediction method which seeks to find the morphological analysis for each unlabeled lan- guage which lies as close as possible in the feature space to a training language. We apply our model to eight inflecting languages, and induce nominal morphology with substantially higher accuracy than a traditional, MDLbased approach. Our analysis indicates that accuracy continues to improve substantially as the number of training languages increases.
5 0.22922289 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement
Author: Ali El Kahki ; Kareem Darwish ; Ahmed Saad El Din ; Mohamed Abd El-Wahab ; Ahmed Hefny ; Waleed Ammar
Abstract: Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for respectively. the four language 1384 pairs
6 0.20258899 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
7 0.18689807 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
8 0.18454561 3 emnlp-2011-A Correction Model for Word Alignments
9 0.17736378 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
10 0.16871534 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
11 0.1681935 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
12 0.16223757 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
13 0.15465413 121 emnlp-2011-Semi-supervised CCG Lexicon Extension
14 0.14833735 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
15 0.14647071 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms
16 0.14449121 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts
17 0.14242128 129 emnlp-2011-Structured Sparsity in Structured Prediction
18 0.1361104 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
19 0.13492408 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
20 0.13404563 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
topicId topicWeight
[(23, 0.108), (36, 0.031), (37, 0.01), (45, 0.053), (49, 0.327), (54, 0.021), (62, 0.029), (64, 0.089), (66, 0.048), (69, 0.028), (79, 0.053), (82, 0.026), (87, 0.012), (90, 0.013), (96, 0.02), (98, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.7052201 122 emnlp-2011-Simple Effective Decipherment via Combinatorial Optimization
Author: Taylor Berg-Kirkpatrick ; Dan Klein
Abstract: We present a simple objective function that when optimized yields accurate solutions to both decipherment and cognate pair identification problems. The objective simultaneously scores a matching between two alphabets and a matching between two lexicons, each in a different language. We introduce a simple coordinate descent procedure that efficiently finds effective solutions to the resulting combinatorial optimization problem. Our system requires only a list of words in both languages as input, yet it competes with and surpasses several state-of-the-art systems that are both substantially more complex and make use of more information.
Author: Keith Vertanen ; Per Ola Kristensson
Abstract: Augmented and alternative communication (AAC) devices enable users with certain communication disabilities to participate in everyday conversations. Such devices often rely on statistical language models to improve text entry by offering word predictions. These predictions can be improved if the language model is trained on data that closely reflects the style of the users’ intended communications. Unfortunately, there is no large dataset consisting of genuine AAC messages. In this paper we demonstrate how we can crowdsource the creation of a large set of fictional AAC messages. We show that these messages model conversational AAC better than the currently used datasets based on telephone conversations or newswire text. We leverage our crowdsourced messages to intelligently select sentences from much larger sets of Twitter, blog and Usenet data. Compared to a model trained only on telephone transcripts, our best performing model reduced perplexity on three test sets of AAC-like communications by 60– 82% relative. This translated to a potential keystroke savings in a predictive keyboard interface of 5–1 1%.
3 0.45332205 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
Author: Shay B. Cohen ; Dipanjan Das ; Noah A. Smith
Abstract: We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel data is not used, allowing the technique to be applied even in domains where human-translated texts are unavailable. We obtain state-of-theart performance for two tasks of structure prediction: unsupervised part-of-speech tagging and unsupervised dependency parsing.
4 0.44603869 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
Author: Ryan McDonald ; Slav Petrov ; Keith Hall
Abstract: We present a simple method for transferring dependency parsers from source languages with labeled training data to target languages without labeled training data. We first demonstrate that delexicalized parsers can be directly transferred between languages, producing significantly higher accuracies than unsupervised parsers. We then use a constraint driven learning algorithm where constraints are drawn from parallel corpora to project the final parser. Unlike previous work on projecting syntactic resources, we show that simple methods for introducing multiple source lan- guages can significantly improve the overall quality of the resulting parsers. The projected parsers from our system result in state-of-theart performance when compared to previously studied unsupervised and projected parsing systems across eight different languages.
5 0.43857083 45 emnlp-2011-Dual Decomposition with Many Overlapping Components
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Dual decomposition has been recently proposed as a way of combining complementary models, with a boost in predictive power. However, in cases where lightweight decompositions are not readily available (e.g., due to the presence of rich features or logical constraints), the original subgradient algorithm is inefficient. We sidestep that difficulty by adopting an augmented Lagrangian method that accelerates model consensus by regularizing towards the averaged votes. We show how first-order logical constraints can be handled efficiently, even though the corresponding subproblems are no longer combinatorial, and report experiments in dependency parsing, with state-of-the-art results. 1
6 0.43191513 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
7 0.42875886 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
8 0.42766038 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
9 0.42002484 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
10 0.41799995 138 emnlp-2011-Tuning as Ranking
11 0.41593495 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
12 0.41288361 136 emnlp-2011-Training a Parser for Machine Translation Reordering
13 0.41248885 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction
14 0.41227302 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
15 0.41107243 79 emnlp-2011-Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction
16 0.41035131 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
17 0.40917212 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
18 0.40783453 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
19 0.40609378 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion
20 0.40560082 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation