emnlp emnlp2011 emnlp2011-72 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ali El Kahki ; Kareem Darwish ; Ahmed Saad El Din ; Mohamed Abd El-Wahab ; Ahmed Hefny ; Waleed Ammar
Abstract: Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for respectively. the four language 1384 pairs
Reference: text
sentIndex sentText sentNum sentScore
1 com4 Abstract Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. [sent-9, score-0.204]
2 This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. [sent-10, score-1.759]
3 An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. [sent-11, score-1.197]
4 Then, these mappings are modeled using a bipartite graph. [sent-12, score-0.327]
5 A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. [sent-13, score-0.624]
6 During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. [sent-14, score-0.715]
7 The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. [sent-15, score-0.991]
8 First, a generative model is trained by performing automatic character level alignment of parallel transliterated word pairs to find character segment mappings between source and target languages. [sent-31, score-1.034]
9 Second, given comparable or parallel text, the trained generative model is used to generate possible transliterations of a word in the source language while constraining the transliterations to words that exist in the target language. [sent-32, score-0.444]
10 Many possible character sequence mappings between source and target languages may not be observed in training data, particularly when limited training data is available – hurting recall. [sent-34, score-0.751]
11 Conditional probability estimates of obtained mappings may be inaccurate, because some mappings and some character sequences may not appear a sufficient number of times in training to properly estimate their probabilities – hurting precision. [sent-36, score-1.019]
12 To address the first problem, we modeled the automatically obtained character sequence mappings (from alignment) as a bipartite graph and then we performed graph reinforcement to enrich the graph and predict possible mappings that were not directly obtained from training data. [sent-38, score-1.649]
13 In this case, there are multiple paths that would lead to a mapping between the Arabic letter ― ‖قand the English letter ―c‖, namely ق q c and ق k c. [sent-42, score-0.275]
14 Another method for overcoming the missing mappings problem entails assigning small smoothing probabilities to unseen mappings. [sent-44, score-0.45]
15 ك ك However, from looking at the graph, it is evident that some mappings could be inferred and should be assigned probabilities that are higher than a small smoothing probability. [sent-45, score-0.39]
16 The second problem has to do primarily with some characters in one language, typically vowels, mapping to many character sequences in the other language, with some of these mappings assuming very high probabilities (due to limited training data). [sent-46, score-0.805]
17 To overcome this problem, we used link reweighting in graph reinforcement to scale down the likelihood of mappings to target character sequences in proportion to how many source sequences map to them. [sent-47, score-1.542]
18 For each language pair, the standard ACL 2010 NEWS workshop data contained a base set of 1,000 transliteration pairs for training, and set of 1,000 parallel Wikipedia titles for testing. [sent-50, score-0.659]
19 Employing graph reinforcement to improve the coverage of automatically aligned data – as they apply to transliteration mining. [sent-52, score-1.051]
20 Applying link reweighting to overcome situations where certain tokens character sequences in the case of transliteration – tend to have many mappings, which are often erroneous. [sent-55, score-1.129]
21 – Figure 1: Example mappings seen in training Background Much work has been done on TM for different language pairs such as English-Chinese (Kuo et al. [sent-58, score-0.329]
22 TM typically involves two main tasks, namely: finding character mappings between two languages, and given the mappings ascertaining whether two words are transliterations or not. [sent-66, score-1.025]
23 When training with a limited number of transliteration pairs, two additional problems appear: many possible character sequence mappings between source and target languages may not be observed in training data, and conditional probability estimates of obtained mappings may be inaccurate. [sent-67, score-1.554]
24 1 Finding Character Mappings To find character sequence mappings between two languages, the most common approach entails using automatic letter alignment of transliteration pairs. [sent-70, score-1.286]
25 In this paper, we use automatic character alignment between transliteration pairs using an HMM aligner. [sent-76, score-0.864]
26 Another method is to use automatic speech recognition confusion tables to extract phonetically equivalent character sequences to discover monolingual and cross lingual pronunciation variations (Kuo and Yang, 2005). [sent-77, score-0.361]
27 Alternatively, letters can be mapped into a common character set using a predefined transliteration scheme (Oh and Choi, 2006). [sent-78, score-0.85]
28 Noeman and Madkour (2010) implemented this technique using a finite state automaton by generating all possible transliterations along with weighted edit distance and then filtered them using appropriate thresholds and target language words. [sent-84, score-0.199]
29 A related alternative is to use back-transliteration to determine if one sequence could have been 1386 generated by successively mapping character sequences from one language into another (Brill et al. [sent-87, score-0.459]
30 Udupa and Khapra (2010) proposed a method in which transliteration candidates are mapped into a ―low-dimensional common representation space‖. [sent-89, score-0.554]
31 They used a variety of features including edit distance between an English token and the Romanized versions of the foreign token, forward and backward transliteration probabilities, and character n-gram similarity. [sent-95, score-0.792]
32 3 Training with Limited Training Data When only limited training data is available to train a character mapping model, the resultant mappings are typically incomplete (due to sparseness in the training data). [sent-101, score-0.681]
33 Further, resultant mappings may not be observed a sufficient of times and hence their mapping probabilities may be inaccurate. [sent-102, score-0.48]
34 SOUNDEX is used to convert English words into a simplified phonetic representation, in which vowels are removed and phonetically similar characters are conflated. [sent-106, score-0.17]
35 Another method involved expanding character sequence maps by automatically mining transliteration pairs and then aligning these pairs to generate an expanded set of character sequence maps (Fei et al. [sent-109, score-1.219]
36 In this work we proposed graph reinforcement with link reweighting to address this problem. [sent-111, score-0.753]
37 Graph reinforcement was used in the context of different problems such as mining paraphrases (Zhao et al. [sent-112, score-0.451]
38 4 Description of Baseline System The basic TM setup that we employed in this work used a generative transliteration model, which was trained on a set of transliteration pairs. [sent-116, score-1.108]
39 Alignment produced mappings of source character sequences to target character sequences along with the probability of source given target and vice versa. [sent-119, score-1.112]
40 Source character sequences were restricted to be 1 to 3 characters long. [sent-120, score-0.353]
41 For all the work reported herein, given an English-foreign language transliteration candidate pair, English was treated as the target language and the foreign language as the source. [sent-121, score-0.598]
42 Given a foreign source language word sequence and an English target word sequence could be a potential transliteration of . [sent-122, score-0.697]
43 An , example of word sequences pair is the TamilEnglish pair: (முதலாம் ஹைலி செலாெி, Haile Selassie Iof Ethiopia), where ―முதலாம் ” could be transliteration for any or none of the English words {―Haile‖, ―Selassie‖, ―I‖, ―of‖, ―Ethiopia‖} . [sent-123, score-0.635]
44 The pseudo code below describes how transliteration mining generates candidates. [sent-124, score-0.659]
45 Basically, given a source language word, all possible segmentations, where each segment has a maximum length of 3 characters, are produced along with their associated mappings into the target language. [sent-125, score-0.389]
46 Given all mapping combinations, combinations producing valid target words are retained and sorted according to the product of their mapping probabilities. [sent-126, score-0.266]
47 If the product of the mapping probabilities for the top combination is above a certain threshold, then it is chosen as the transliteration candidate. [sent-127, score-0.702]
48 )من Given the target words {the, best, man} and the possible mappings for the segments and their probabilities: = { (m, 0. [sent-130, score-0.348]
49 5 Smoothing and Thresholding We implemented the baseline system with and without assigning small smoothing probabilities for unseen source character to target character mappings. [sent-146, score-0.647]
50 We used a threshold on the minimum acceptable transliteration score to filter out unreliable transliterations. [sent-148, score-0.579]
51 We assumed a threshold d for single character mappings and the transliteration threshold for a target word of length l was computed as . [sent-151, score-1.19]
52 We selected d by sorting the mapping probabilities, removing the lowest 10% of mapping probabilities (which we assumed to be outliers), and then selecting the smallest observed probability to be the character threshold d. [sent-152, score-0.522]
53 The threshold was then applied to filter out transliteration with 1: Input: Mappings, set of source given target mappings with associated Prob. [sent-154, score-0.968]
54 score – DFS, Priority queue to store candidate transliterations pair ordered by their transliteration Each candidate transliteration tuple = (“”, 5: StartSymbol 6: TargetTransliteration, TransliterationScore). [sent-157, score-1.263]
55 Remove(SourceFragment) 23: End While 24: Return Null Figure 2: Pseudo code for transliteration mining 1. [sent-170, score-0.635]
56 For each pair, a dataset included a list of 1,000 parallel transliteration word pairs to train a transliteration model, and a list of 1,000 parallel word sequences to test TM. [sent-174, score-1.312]
57 The score of each mapping m(v1|v2), where m(v1 |v2) M, was initially set to the conditional probability of target given source p(v1|v2). [sent-194, score-0.196]
58 Graph reinforcement was performed by traversing the graph from S T S T in order to deduce new mappings. [sent-195, score-0.497]
59 Given a source sequence s' S and a target sequence t' T, the deduced mapping probabilities were computed as follows (Eq. [sent-196, score-0.291]
60 )ك We were able to apply reinforcement iteratively on all mappings from S to T to deduce previously unseen mappings (graph edges) and to update the probabilities of existing mappings. [sent-212, score-1.015]
61 8 Link Reweighting The basic graph reinforcement algorithm is prone to producing irrelevant mappings by using character sequences with many different possible mappings as a bridge. [sent-214, score-1.424]
62 Vowels were the most obvious examples of such character sequences. [sent-215, score-0.238]
63 For example, automatic alignment produced 26 Hindi character sequences that map to the English letter 1389 ―a‖, most of which were erroneous such as the mapping between ―a‖ and ―व” (pronounced va). [sent-216, score-0.559]
64 After successive iterations, such character sequences would cause the graph to be fully connected and eventually the link weights will tend to be uniform in their values. [sent-218, score-0.54]
65 To illustrate this effect, we experimented with basic graph reinforcement on the NEWS dataset. [sent-219, score-0.497]
66 Figures 3, 4, 5, and 6 show reinforcement results for Arabic, Russian, Hindi, and Tamil respectively. [sent-221, score-0.37]
67 The figures show that: recall increased quickly and nearly saturated after several iterations; precision continued to drop with more iterations; and F1measure peaked after a few iterations and began to drop afterwards. [sent-222, score-0.168]
68 To avoid this effect and to improve precision, we applied link reweighting after each iteration. [sent-224, score-0.256]
69 Link reweighting had the effect of decreasing the weights of target character sequences that have many source character sequences mapping to them and hence reducing the effect of incorrectly inducing mappings. [sent-225, score-0.996]
70 (|) ∑ ( |()|) Where si S is a source character sequence that maps to t. [sent-228, score-0.308]
71 So in the case of ―a‖ mapping to the ―व‖ character in Hindi, the link weight from ―a‖ to ―व‖ is divided by the sum of link weights from ―a‖ to all 26 characters to which ―a‖ maps. [sent-229, score-0.571]
72 We performed multiple experiments on the NEWS dataset to test the effect of graph reinforcement with link reweighting with varying number of reinforcement iterations. [sent-230, score-1.123]
73 Figures 7, 8, 9, and 10 compare baseline results with smoothing to results with graph reinforcement at different iterations. [sent-231, score-0.546]
74 As can be seen in the figures, the F1-measure values stabilized as we performed multiple graph reinforcement iterations. [sent-232, score-0.497]
75 For Russian, graph reinforcement marginally affected TM F1-measure, as precision and recall marginally changed. [sent-234, score-0.552]
76 English and Russian do not share the same alphabet, and the number of initial mappings was bigger compared to the other language pairs. [sent-237, score-0.304]
77 Careful inspection of the English-Russian test set, with the help of a Russian speaker, suggests that: 1) the test set reference contained many false negatives; 2) Russian names often have multiple phonetic forms (or spellings) in Russian with a single standard transliteration in English. [sent-238, score-0.616]
78 For example, the Russian name ―Olga‖ is often written and pronounced as ―Ola‖ and ―Olga‖ in Russian; and 3) certain English phones do not exist in Russian, leading to inconsistent character mappings in Russian. [sent-239, score-0.712]
79 For the other languages, graph reinforcement yielded steadily improving recall and consequently steadily improving F1-measure. [sent-241, score-0.527]
80 This fact makes transliteration between Hindi and English easier than Arabic and Tamil. [sent-251, score-0.554]
81 As for Arabic, multiple letters sequences in English can map to single letters in Arabic and vice versa. [sent-253, score-0.197]
82 9 When Graph Reinforcement Worked An example where reinforcement worked entails the English-Arabic transliteration pair (Seljuq, 1390 . [sent-257, score-0.956]
83 )سالجقهIn the baseline runs with 1,000 training examples, both were not mapped to each other because there were no mappings between the letter ―q‖ and the Arabic letter sequence ―‖قه (pronounced as ―qah‖). [sent-258, score-0.497]
84 The only mappings that were available for ―q‖ were ―( ‖كهpronounced as ―kah‖), ―( ‖قpronounced as ―q‖), and ―‖ك (pronounced as ―k‖) with probabilities 54. [sent-259, score-0.341]
85 After 3 graph reinforcement iterations, the top 5 mappings for ―q‖ were ―( ‖قpronounced as ―q‖), ―‖قه (pronounced as ―qah‖), ―( ‖كهpronounced as ―kah‖), ―( ‖كpronounced as ―k‖), and ―‖الق (pronounced as ―alq‖) with mapping probabilities of 0. [sent-263, score-0.949]
86 In this case, graph reinforcement was able to find the missing mapping and properly reorder the mappings. [sent-269, score-0.608]
87 Performing 10 iterations with link reweighting for Arabic led to 17 false positives. [sent-270, score-0.293]
88 This seems to indicate that graph reinforcement generally introduced more proper mappings than improper ones. [sent-272, score-0.801]
89 We used the improved technique to extract transliteration pairs from parallel Wikipedia titles. [sent-289, score-0.628]
90 The proposed technique solves two problems in transliteration mining, namely: some mappings may not be seen in training data – hurting recall, and certain mappings may not be seen a sufficient number of times to appropriate estimate mapping probabilities – hurting precision. [sent-290, score-1.42]
91 The results showed that graph reinforcement yielded improved transliteration mining from parallel Wikipedia titles for all four languages on which the technique was tested. [sent-291, score-1.252]
92 Generally iterative graph reinforcement was able to induce unseen mappings in training data improving recall. [sent-292, score-0.801]
93 Link reweighting favored precision over recall counterbalancing the effect of graph reinforcement. [sent-293, score-0.344]
94 To extend the work, we would like to try transliteration mining from large comparable texts. [sent-295, score-0.635]
95 For future work, graph reinforcement could be extended to MT to improve the coverage of aligned phrase tables. [sent-297, score-0.497]
96 Using graph reinforcement can help discover such translation though they may never be seen in training data. [sent-299, score-0.497]
97 Using link reweighting in graph reinforcement can help demote unlikely translations while promoting likely ones. [sent-300, score-0.781]
98 Further, when dealing with transliteration, 1392 graph reinforcement can help find phonetic variations within a single language, which can have interesting applications in spelling correction and information retrieval. [sent-302, score-0.559]
99 A phonetic similarity model for automatic extraction of transliteration pairs. [sent-357, score-0.616]
100 Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts using a statistical machine transliteration model. [sent-369, score-0.627]
wordName wordTfidf (topN-words)
[('transliteration', 0.554), ('reinforcement', 0.37), ('mappings', 0.304), ('character', 0.238), ('pronounced', 0.17), ('reweighting', 0.162), ('transliterations', 0.155), ('russian', 0.15), ('tm', 0.132), ('arabic', 0.131), ('graph', 0.127), ('udupa', 0.119), ('hindi', 0.113), ('mapping', 0.111), ('tamil', 0.111), ('kuo', 0.107), ('alef', 0.097), ('link', 0.094), ('letter', 0.082), ('mining', 0.081), ('sequences', 0.081), ('sourcefragment', 0.069), ('phonetic', 0.062), ('oh', 0.062), ('kumaran', 0.06), ('news', 0.058), ('letters', 0.058), ('hurting', 0.055), ('saravanan', 0.055), ('sourceword', 0.055), ('parallel', 0.049), ('smoothing', 0.049), ('raghavendra', 0.048), ('transliterated', 0.048), ('alignment', 0.047), ('target', 0.044), ('dfs', 0.042), ('noeman', 0.042), ('phonetically', 0.042), ('qatar', 0.042), ('sourcesubstring', 0.042), ('source', 0.041), ('languages', 0.04), ('haizhou', 0.04), ('iterations', 0.037), ('probabilities', 0.037), ('khapra', 0.036), ('cairo', 0.036), ('jiampojamarn', 0.036), ('characters', 0.034), ('man', 0.034), ('soundex', 0.032), ('isahara', 0.032), ('entails', 0.032), ('vowels', 0.032), ('fei', 0.032), ('titles', 0.031), ('recall', 0.03), ('sequence', 0.029), ('brill', 0.028), ('ahmed', 0.028), ('resultant', 0.028), ('microsoft', 0.028), ('figures', 0.028), ('darwish', 0.028), ('demote', 0.028), ('egypt', 0.028), ('ethiopia', 0.028), ('fragmentscore', 0.028), ('haile', 0.028), ('hamza', 0.028), ('kah', 0.028), ('kareem', 0.028), ('madkour', 0.028), ('overcoming', 0.028), ('patricia', 0.028), ('qah', 0.028), ('selassie', 0.028), ('startsymbol', 0.028), ('targetfragment', 0.028), ('targettransliteration', 0.028), ('targetwords', 0.028), ('transliterationscore', 0.028), ('threshold', 0.025), ('pairs', 0.025), ('precision', 0.025), ('pseudo', 0.024), ('choi', 0.024), ('trie', 0.024), ('phonetics', 0.024), ('talip', 0.024), ('mitesh', 0.024), ('ascertaining', 0.024), ('bilac', 0.024), ('mahesh', 0.024), ('wikipedia', 0.024), ('drop', 0.024), ('hmm', 0.023), ('bipartite', 0.023), ('basis', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement
Author: Ali El Kahki ; Kareem Darwish ; Ahmed Saad El Din ; Mohamed Abd El-Wahab ; Ahmed Hefny ; Waleed Ammar
Abstract: Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for respectively. the four language 1384 pairs
Author: Nobuhiro Kaji ; Masaru Kitsuregawa
Abstract: Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often outof-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.
3 0.079155169 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
Author: Weiwei Sun ; Jia Xu
Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.
4 0.074833892 13 emnlp-2011-A Word Reordering Model for Improved Machine Translation
Author: Karthik Visweswariah ; Rajakrishnan Rajkumar ; Ankur Gandhe ; Ananthakrishnan Ramanathan ; Jiri Navratil
Abstract: Preordering of source side sentences has proved to be useful in improving statistical machine translation. Most work has used a parser in the source language along with rules to map the source language word order into the target language word order. The requirement to have a source language parser is a major drawback, which we seek to overcome in this paper. Instead of using a parser and then using rules to order the source side sentence we learn a model that can directly reorder source side sentences to match target word order using a small parallel corpus with highquality word alignments. Our model learns pairwise costs of a word immediately preced- ing another word. We use the Lin-Kernighan heuristic to find the best source reordering efficiently during training and testing and show that it suffices to provide good quality reordering. We show gains in translation performance based on our reordering model for translating from Hindi to English, Urdu to English (with a public dataset), and English to Hindi. For English to Hindi we show that our technique achieves better performance than a method that uses rules applied to the source side English parse.
5 0.073299661 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
Author: Ashish Venugopal ; Jakob Uszkoreit ; David Talbot ; Franz Och ; Juri Ganitkevitch
Abstract: We propose a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method is robust to local editing operations and provides well defined trade-offs between the ability to identify algorithm outputs and the quality of the watermarked output. Unlike previous work in the field, our approach does not rely on controlling the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm. We present an application in statistical machine translation, where machine translated output is watermarked at minimal loss in translation quality and detected with high recall. 1 Motivation Machine learning algorithms provide structured results to input queries by simulating human behavior. Examples include automatic machine translation (Brown et al. , 1993) or automatic text and rich media summarization (Goldstein et al. , 1999) . These algorithms often estimate some portion of their models from publicly available human generated data. As new services that output structured results are made available to the public and the results disseminated on the web, we face a daunting new challenge: Machine generated structured results contaminate the pool of naturally generated human data. For example, machine translated output 1363 2Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218, USA juri@cs.jhu.edu and human generated translations are currently both found extensively on the web, with no automatic way of distinguishing between them. Algorithms that mine data from the web (Uszkoreit et al. , 2010) , with the goal of learning to simulate human behavior, will now learn models from this contaminated and potentially selfgenerated data, reinforcing the errors committed by earlier versions of the algorithm. It is beneficial to be able to identify a set of encountered structured results as having been generated by one’s own algorithm, with the purpose of filtering such results when building new models. Problem Statement: We define a structured result of a query q as r = {z1 · · · zL} where tthuree odr rdeesru latn dof identity qof a sele rm =en {tzs zi are important to the quality of the result r. The structural aspect of the result implies the existence of alternative results (across both the order of elements and the elements themselves) that might vary in their quality. Given a collection of N results, CN = r1 · · · rN, where each result ri has k rankedC alterna·t·iv·ers Dk (qi) of relatively similar quality and queries q1 · · · qN are arbitrary and not controlled by the watermarking algorithm, we define the watermarking task as: Task. Replace ri with ri0 ∈ Dk (qi) for some subset of results in CN to produce a watermarked sceoltle ocfti orens CN0 slleucchti otnha Ct: • CN0 is probabilistically identifiable as having bCeen generated by one’s own algorithm. Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1.e ?tc ho2d0s1 in A Nsasotucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag uesis 1ti3c6s3–1372, • • • 2 the degradation in quality from CN to the wthaete dremgarrakdeadt CN0 isnho quulda bitye analytically controllable, trading quality for detection performance. CN0 should not be detectable as watermCarked content without access to the generating algorithms. the detection of CN0 should be robust to simple eddeitte operations performed on individual results r ∈ CN0. Impact on Statistical Machine Translation Recent work(Resnik and Smith, 2003; Munteanu and Marcu, 2005; Uszkoreit et al. , 2010) has shown that multilingual parallel documents can be efficiently identified on the web and used as training data to improve the quality of statistical machine translation. The availability of free translation services (Google Translate, Bing Translate) and tools (Moses, Joshua) , increase the risk that the content found by parallel data mining is in fact generated by a machine, rather than by humans. In this work, we focus on statistical machine translation as an application for watermarking, with the goal of discarding documents from training if they have been generated by one’s own algorithms. To estimate the magnitude of the problem, we used parallel document mining (Uszkoreit et al. , 2010) to generate a collection of bilingual document pairs across several languages. For each document, we inspected the page content for source code that indicates the use of translation modules/plug-ins that translate and publish the translated content. We computed the proportion of the content within our corpus that uses these modules. We find that a significant proportion of the mined parallel data for some language pairs is generated via one of these translation modules. The top 3 languages pairs, each with parallel translations into English, are Tagalog (50.6%) , Hindi (44.5%) and Galician (41.9%) . While these proportions do not reflect impact on each language’s monolingual web, they are certainly high 1364 enough to affect machine translations systems that train on mined parallel data. In this work, we develop a general approach to watermark structured outputs and apply it to the outputs of a statistical machine translation system with the goal of identifying these same outputs on the web. In the context of the watermarking task defined above, we output selecting alternative translations for input source sentences. These translations often undergo simple edit and formatting operations such as case changes, sentence and word deletion or post editing, prior to publishing on the web. We want to ensure that we can still detect watermarked translations despite these edit operations. Given the rapid pace of development within machine translation, it is also important that the watermark be robust to improvements in underlying translation quality. Results from several iterations of the system within a single collection of documents should be identifiable under probabilistic bounds. While we present evaluation results for statistical machine translation, our proposed approach and associated requirements are applicable to any algorithm that produces structured results with several plausible alternatives. The alternative results can arise as a result of inherent task ambiguity (for example, there are multiple correct translations for a given input source sentence) or modeling uncertainty (for example, a model assigning equal probability to two competing results) . 3 Watermark Structured Results Selecting an alternative r0 from the space of alternatives Dk (q) can be stated as: r0= arr∈gDmk(aqx)w(r,Dk(q),h) (1) where w ranks r ∈ Dk (q) based on r’s presentwahtieorne owf a watermarking signal computed by a hashing operation h. In this approach, w and its component operation h are the only secrets held by the watermarker. This selection criterion is applied to all system outputs, ensuring that watermarked and non-watermarked version of a collection will never be available for comparison. A specific implementation of w within our watermarking approach can be evaluated by the following metrics: • • • False Positive Rate: how often nonFwaaltseermarked collections are falsely identified as watermarked. Recall Rate: how often watermarked collRecectiaolnls R are correctly inde wntaitfeierdm as wdat ceorl-marked. Quality Degradation: how significantly dQoueasl CN0 d Dieffegrr fdraotmio CN when evaluated by tdaoseks specific quality Cmetrics. While identification is performed at the collection level, we can scale these metrics based on the size of each collection to provide more task sensitive metrics. For example, in machine translation, we count the number of words in the collection towards the false positive and recall rates. In Section 3.1, we define a random hashing operation h and a task independent implementation of the selector function w. Section 3.2 describes how to classify a collection of watermarked results. Section 3.3 and 3.4 describes refinements to the selection and classification criteria that mitigate quality degradation. Following a comparison to related work in Section 4, we present experimental results for several languages in Section 5. 3.1 Watermarking: CN → CN0 We define a random hashing operation h that is applied to result r. It consists of two components: • A hash function applied to a structured re- sAul ht r hto f generate a lbieitd sequence cotfu a dfix reedlength. • An optional mapping that maps a single cAannd oidptaitoen raels umlta r ntog a hsaett mofa spusb -are ssiunlgtsle. Each sub-result is then hashed to generate a concatenated bit sequence for r. A good hash function produces outputs whose bits are independent. This implies that we can treat the bits for any input structured results 1365 as having been generated by a binomial distribution with equal probability of generating 1s vs 0s. This condition also holds when accumulating the bit sequences over a collection of results as long as its elements are selected uniformly from the space of possible results. Therefore, the bits generated from a collection of unwatermarked results will follow a binomial distribution with parameter p = 0.5. This result provides a null hypothesis for a statistical test on a given bit sequence, testing whether it is likely to have been generated from a binomial distribution binomial(n, p) where p = 0.5 and n is the length of the bit sequence. For a collection CN = r1 · · · rN, we can define a Fwaorte arm coalrlekc ranking funct·i·o·nr w to systematically select alternatives ri0 ∈ Dk (q) , such that the resulting CN0 is unlikely ∈to D produce bit sequences ltthinagt f Collow the p = 0.5 binomial distribution. A straightforward biasing criteria would be to select the candidate whose bit sequence exhibits the highest ratio of 1s. w can be defined as: (2) w(r,Dk(q),h) =#(|h1,(rh)(|r)) where h(r) returns the randomized bit sequence for result r, and #(x, y) counts the number of occurrences of x in sequence Selecting alternatives results to exhibit this bias will result in watermarked collections that exhibit this same bias. y. 3.2 Detecting the Watermark To classify a collection CN as watermarked or non-watermarked, we apply the hashing operation h on each element in CN and concatenate ttihoen sequences. eTlhemis sequence is tested against the null hypothesis that it was generated by a binomial distribution with parameter p = 0.5. We can apply a Fisherian test of statistical significance to determine whether the observed distribution of bits is unlikely to have occurred by chance under the null hypothesis (binomial with p = 0.5) . We consider a collection of results that rejects the null hypothesis to be watermarked results generated by our own algorithms. The p-value under the null hypothesis is efficiently computed by: p − value = Pn (X ≥ x) = Xi=nx?ni?pi(1 − p)n−i (3) (4) where x is the number of 1s observed in the collection, and n is the total number of bits in the sequence. Comparing this p-value against a desired significance level α, we reject the null hypothesis for collections that have Pn(X ≥ x) < α, thus deciding that such collections( were gen- erated by our own system. This classification criteria has a fixed false positive rate. Setting α = 0.05, we know that 5% of non-watermarked bit sequences will be falsely labeled as watermarked. This parameter α can be controlled on an application specific basis. By biasing the selection of candidate results to produce more 1s than 0s, we have defined a watermarking approach that exhibits a fixed false positive rate, a probabilistically bounded detection rate and a task independent hashing and selection criteria. In the next sections, we will deal with the question of robustness to edit operations and quality degradation. 3.3 Robustness and Inherent Bias We would like the ability to identify watermarked collections to be robust to simple edit operations. Even slight modifications to the elements within an item r would yield (by construction of the hash function) , completely different bit sequences that no longer preserve the biases introduced by the watermark selection function. To ensure that the distributional biases introduced by the watermark selector are preserved, we can optionally map individual results into a set of sub-results, each one representing some local structure of r. h is then applied to each subresult and the results concatenated to represent r. This mapping is defined as a component of the h operation. While a particular edit operation might affect a small number of sub-results, the majority of the bits in the concatenated bit sequence for r would remain untouched, thereby limiting the damage to the biases selected during watermark1366 ing. This is of course no defense to edit operations that are applied globally across the result; our expectation is that such edits would either significantly degrade the quality of the result or be straightforward to identify directly. For example, a sequence of words r = z1 · · · zL can be mapped into a set of consecutive n-gram sequences. Operations to edit a word zi in r will only affect events that consider the word zi. To account for the fact that alternatives in Dk (q) might now result in bit sequences of different lengths, we can generalize the biasing criteria to directly reflect the expected contribution to the watermark by defining: w(r, Dk(q), h) = Pn(X ≥ #(1, h(r))) (5) where Pn gives probabilities from binomial(n = |h(r) |,p = 0.5) . (Irn)|h,epr =en 0t. 5c)o.llection level biases: Our null hypothesis is based on the assumption that collections of results draw uniformly from the space of possible results. This assumption might not always hold and depends on the type of the results and collection. For example, considering a text document as a collection of sentences, we can expect that some sentences might repeat more frequently than others. This scenario is even more likely when applying a mapping into sub-results. n-gram sequences follow long-tailed or Zipfian distributions, with a small number of n-grams contributing heavily toward the total number of n-grams in a document. A random hash function guarantees that inputs are distributed uniformly at random over the output range. However, the same input will be assigned the same output deterministically. Therefore, if the distribution of inputs is heavily skewed to certain elements of the input space, the output distribution will not be uniformly distributed. The bit sequences resulting from the high frequency sub-results have the potential to generate inherently biased distributions when accumulated at the collection level. We want to choose a mapping that tends towards generating uniformly from the space of sub-results. We can empirically measure the quality of a sub-result mapping for a specific task by computing the false positive rate on non-watermarked collections. For a given significance level α, an ideal mapping would result in false positive rates close to α as well. Figure 1 shows false positive rates from 4 alternative mappings, computed on a large corpus of French documents (see Table 1for statistics) . Classification decisions are made at the collection level (documents) but the contribution to the false positive rate is based on the number of words in the classified document. We consider mappings from a result (sentence) into its 1-grams, 1 − 5-grams and 3 − 5 grams as well as trahem non-mapping case, w 3h −ere 5 tghrea mfusll a sres wuelltl is hashed. Figure 1 shows that the 1-grams and 1 − 5gram generate wsusb t-hraetsul tthse t 1h-agtr rmessu latn idn 1h −eav 5-ily biased false positive rates. The 3 − 5 gram mapping yields pfaolsseit positive r.a Ttesh ecl 3os −e t 5o gthraemir theoretically expected values. 1 Small deviations are expected since documents make different contributions to the false positive rate as a function of the number of words that they represent. For the remainder of this work, we use the 3-5 gram mapping and the full sentence mapping, since the alternatives generate inherently distributions with very high false positive rates. 3.4 Considering Quality The watermarking described in Equation 3 chooses alternative results on a per result basis, with the goal of influencing collection level bit sequences. The selection criteria as described will choose the most biased candidates available in Dk (q) . The parameter k determines the extent to which lesser quality alternatives can be chosen. If all the alternatives in each Dk (q) are of relatively similar quality, we expect minimal degradation due to watermarking. Specific tasks however can be particularly sensitive to choosing alternative results. Discriminative approaches that optimize for arg max selection like (Och, 2003; Liang et al. , 2006; Chiang et al. , 2009) train model parameters such 1In the final version of this paper we will perform sampling to create a more reliable estimate of the false positive rate that is not overly influenced by document length distributions. 1367 that the top-ranked result is well separated from its competing alternatives. Different queries also differ in the inherent ambiguity expected from their results; sometimes there really is just one correct result for a query, while for other queries, several alternatives might be equally good. By generalizing the definition of the w function to interpolate the estimated loss in quality and the gain in the watermarking signal, we can trade-off the ability to identify the watermarked collections against quality degradation: w(r,Dk(q),fw)− =(1 λ − ∗ λ g)ai ∗nl( or,s D(rk,(Dq)k,(fqw)) (6) Loss: The loss(r, Dk (q)) function reflects the quality degradation that results from selecting alternative r as opposed to the best ranked candidate in Dk (q)) . We will experiment with two variants: lossrank (r, Dk (q)) = (rank(r) − k)/k losscost(r, Dk(q)) = (cost(r)−cost(r1))/ cost(r1) where: • • • rank(r) : returns the rank of r within Dk (q) . cost(r) : a weighted sum of features (not cnoosrtm(ra)li:ze ad over httheed sse uarmch o space) rine a loglinear model such as those mentioned in (Och, 2003). r1: the highest ranked alternative in Dk (q) . lossrank provides a generally applicable criteria to select alternatives, penalizing selection from deep within Dk (q) . This estimate of the quality degradation does not reflect the generating model’s opinion on relative quality. losscost considers the relative increase in the generating model’s cost assigned to the alternative translation. Gain: The gain(r, Dk (q) , fw) function reflects the gain in the watermarking signal by selecting candidate r. We simply define the gain as the Pn(X ≥ #(1, h(r))) from Equation 5. ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (a) 1-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (c) 3 − 5-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (b) 1− 5-grams mapping p-value threshold (d) Full result hashing Figure 1 Comparison : of expected false positive rates against observed false positive rates for different sub-result mappings. 4 Related Work Using watermarks with the goal of transmitting a hidden message within images, video, audio and monolingual text media is common. For structured text content, linguistic approaches like (Chapman et al. , 2001; Gupta et al., 2006) use language specific linguistic and semantic expansions to introduce hidden watermarks. These expansions provide alternative candidates within which messages can be encoded. Recent publications have extended this idea to machine translation, using multiple systems and expansions to generate alternative translations. (Stutsman et al. , 2006) uses a hashing function to select alternatives that encode the hidden message in the lower order bits of the translation. In each of these approaches, the watermarker has control over the collection of results into which the watermark is to be embedded. These approaches seek to embed a hidden message into a collection of results that is selected by the watermarker. In contrast, we address the condition where the input queries are not in the watermarker’s control. 1368 The goal is therefore to introduce the watermark into all generated results, with the goal of probabilistically identifying such outputs. Our approach is also task independent, avoiding the need for templates to generate additional alternatives. By addressing the problem directly within the search space of a dynamic programming algorithm, we have access to high quality alternatives with well defined models of quality loss. Finally, our approach is robust to local word editing. By using a sub-result mapping, we increase the level of editing required to obscure the watermark signal; at high levels of editing, the quality of the results themselves would be significantly degraded. 5 Experiments We evaluate our watermarking approach applied to the outputs of statistical machine translation under the following experimental setup. A repository of parallel (aligned source and target language) web documents is sampled to produce a large corpus on which to evaluate the watermarking classification performance. The corpora represent translations into 4 diverse target languages, using English as the source language. Each document in this corpus can be considered a collection of un-watermarked structured results, where source sentences are queries and each target sentence represents a structured result. Using a state-of-the-art phrase-based statistical machine translation system (Och and Ney, 2004) trained on parallel documents identified by (Uszkoreit et al. , 2010) , we generate a set of 100 alternative translations for each source sentence. We apply the proposed watermarking approach, along with the proposed refinements that address task specific loss (Section 3.4) and robustness to edit operations (Section 3.3) to generate watermarked corpora. Each method is controlled via a single parameter (like k or λ) which is varied to generate alternative watermarked collections. For each parameter value, we evaluate the Recall Rate and Quality Degradation with the goal of finding a setting that yields a high recall rate, minimal quality degradation. False positive rates are evaluated based on a fixed classification significance level of α = 0.05. The false positive and recall rates are evaluated on the word level; a document that is misclassified or correctly identified contributes its length in words towards the error calculation. In this work, we use α = 0.05 during classification corresponding to an expected 5% false positive rate. The false positive rate is a function of h and the significance level α and therefore constant across the parameter values k and λ. We evaluate quality degradation on human translated test corpora that are more typical for machine translation evaluation. Each test corpus consists of 5000 source sentences randomly selected from the web and translated into each respective language. We chose to evaluate quality on test corpora to ensure that degradations are not hidden by imperfectly matched web corpora and are consistent with the kind of results often reported for machine translation systems. As with the classification corpora, we create watermarked versions at each parameter value. For a given pa1369 recall Figure 2: BLEU loss against recall of watermarked content for the baseline approach (max K-best) , rank and cost interpolation. rameter value, we measure false positive and re- call rates on the classification corpora and quality degradation on the evaluation corpora. Table 1 shows corpus statistics for the classification and test corpora and non-watermarked BLEU scores for each target language. All source texts are in English. 5.1 Loss Interpolated Experiments Our first set of experiments demonstrates baseline performance using the watermarking criteria in Equation 5 versus the refinements suggested in Section 3.4 to mitigate quality degradation. The h function is computed on the full sentence result r with no sub-event mapping. The following methods are evaluated in Figure 2. • • Baseline method (labeled “max K-best” ): sBealescetlsin er0 purely (blaasbedel on gain Kin- bweastte”r):marking signal (Equation 5) and is parameterized by k: the number of alternatives considered for each result. Rank interpolation: incorporates rank into w, varying ptholea interpolation parameter nλ.t • Cost interpolation: incorporates cost into w, varying tohlea interpolation parameter nλ.t The observed false positive rate on the French classification corpora is 1.9%. ClassificationQuality Ta AbFHularei ankg1bdic:sehitCon#t12e08n7w39t1065o s40r7tda617tsi c#sfo1e r85n37c2018tl2a5e4s 5n0sicfeastion#adno1c68 q3u09 06ma70lietynsdegr#ad7 aw3 t534io9 0rn279dcsorp# as.e5 nN54 t08 oe369n-wceatsrmBaL21 kEe6320d. U462579 B%LEU scores are reported for the quality corpora. We consider 0.2% BLEU loss as a threshold for acceptable quality degradation. Each method is judged by its ability to achieve high recall below this quality degradation threshold. Applying cost interpolation yields the best results in Figure 2, achieving a recall of 85% at 0.2% BLEU loss, while rank interpolation achieves a recall of 76%. The baseline approach of selecting the highest gain candidate within a depth of k candidates does not provide sufficient parameterization to yield low quality degradation. At k = 2, this method yields almost 90% recall, but with approximately 0.4% BLEU loss. 5.2 Robustness Experiments In Section 5.2, we proposed mapping results into sub-events or features. We considered alternative feature mappings in Figure 1, finding that mapping sentence results into a collection of 35 grams yields acceptable false positive rates at varied levels of α. Figure 3 presents results that compare moving from the result level hashing to the 3-5 gram sub-result mapping. We show the impact of the mapping on the baseline max K-best method as well as for cost interpolation. There are substantial reductions in recall rate at the 0.2% BLEU loss level when applying sub-result mappings in cases. The cost interpolation method recall drops from 85% to 77% when using the 3-5 grams event mapping. The observed false positive rate of the 3-5 gram mapping is 4.7%. By using the 3-5 gram mapping, we expect to increase robustness against local word edit operations, but we have sacrificed recall rate due to the inherent distributional bias discussed in Section 3.3. 1370 recall Figure 3: BLEU loss against recall of watermarked content for the baseline and cost interpolation methods using both result level and 3-5 gram mapped events. 5.3 Multilingual Experiments The watermarking approach proposed here introduces no language specific watermarking operations and it is thus broadly applicable to translating into all languages. In Figure 4, we report results for the baseline and cost interpolation methods, considering both the result level and 3-5 gram mapping. We set α = 0.05 and measure recall at 0.2% BLEU degradation for translation from English into Arabic, French, Hindi and Turkish. The observed false positive rates for full sentence hashing are: Arabic: 2.4%, French: 1.8%, Hindi: 5.6% and Turkish: 5.5%, while for the 3-5 gram mapping, they are: Arabic: 5.8%, French: 7.5%, Hindi:3.5% and Turkish: 6.2%. Underlying translation quality plays an important role in translation quality degradation when watermarking. Without a sub-result mapping, French (BLEU: 26.45%) Figure 4: Loss of recall when using 3-5 gram mapping vs sentence level mapping for Arabic, French, Hindi and Turkish translations. achieves recall of 85% at 0.2% BLEU loss, while the other languages achieve over 90% recall at the same BLEU loss threshold. Using a subresult mapping degrades quality for each language pair, but changes the relative performance. Turkish experiences the highest relative drop in recall, unlike French and Arabic, where results are relatively more robust to using sub-sentence mappings. This is likely a result of differences in n-gram distributions across these languages. The languages considered here all use space separated words. For languages that do not, like Chinese or Thai, our approach can be applied at the character level. 6 Conclusions In this work we proposed a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method provides probabilistic bounds on detection ability, analytic control on quality degradation and is robust to local edit- ing operations. Our method is applicable to any task where structured outputs are generated with ambiguities or ties in the results. We applied this method to the outputs of statistical machine translation, evaluating each refinement to our approach with false positive and recall rates against BLEU score quality degradation. Our results show that it is possible, across several language pairs, to achieve high recall rates (over 80%) with low false positive rates (between 5 and 8%) at minimal quality degradation (0.2% 1371 BLEU) , while still allowing for local edit operations on the translated output. In future work we will continue to investigate methods to mitigate quality loss. References Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Minimum error rate training in statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311 . Mark Chapman, George Davida, and Marc Rennhardway. 2001. A practical and effective approach to large-scale automated linguistic steganography. In Proceedings of the Information Security Conference. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, pages 121–128. Gaurav Gupta, Josef Pieprzyk, and Hua Xiong Wang. 2006. An attack-localizing watermarking scheme for natural language documents. In Proceedings of the 2006 A CM Symposium on Information, computer and communications security, ASIACCS ’06, pages 157–165, New York, NY, USA. ACM. Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/A CL, pages 761–768. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics. Franz Josef Och and Hermann Ney. alignment template approach to statistical machine translation. Computational Linguistics. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 2003 Meeting of the Asssociation of Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. computational linguistics. Computational Linguistics. Ryan Stutsman, Mikhail Atallah, Christian Grothoff, and Krista Grothoff. 2006. Lost in just the translation. In Proceedings of the 2006 A CM Symposium on Applied Computing. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 2010 COLING. 1372 2004. The
6 0.067359805 122 emnlp-2011-Simple Effective Decipherment via Combinatorial Optimization
7 0.065542109 3 emnlp-2011-A Correction Model for Word Alignments
8 0.059373144 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
9 0.048423968 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
10 0.045959581 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
11 0.044195324 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
12 0.038397081 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning
13 0.0379663 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
14 0.03796367 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
15 0.037447687 94 emnlp-2011-Modelling Discourse Relations for Arabic
16 0.037135243 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
17 0.036752153 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing
18 0.035848193 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
19 0.035448246 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
20 0.034633398 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
topicId topicWeight
[(0, 0.132), (1, 0.006), (2, 0.006), (3, -0.054), (4, -0.006), (5, -0.043), (6, -0.042), (7, 0.071), (8, -0.191), (9, 0.065), (10, -0.009), (11, -0.018), (12, -0.009), (13, 0.101), (14, -0.08), (15, -0.09), (16, -0.113), (17, 0.005), (18, 0.133), (19, -0.103), (20, 0.036), (21, -0.026), (22, -0.078), (23, -0.104), (24, -0.007), (25, -0.068), (26, 0.005), (27, 0.005), (28, 0.088), (29, 0.001), (30, -0.038), (31, 0.154), (32, 0.13), (33, 0.016), (34, -0.293), (35, -0.024), (36, -0.043), (37, 0.188), (38, -0.05), (39, 0.343), (40, 0.016), (41, 0.048), (42, -0.289), (43, -0.021), (44, -0.096), (45, -0.131), (46, -0.018), (47, 0.075), (48, -0.091), (49, -0.093)]
simIndex simValue paperId paperTitle
same-paper 1 0.96585506 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement
Author: Ali El Kahki ; Kareem Darwish ; Ahmed Saad El Din ; Mohamed Abd El-Wahab ; Ahmed Hefny ; Waleed Ammar
Abstract: Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for respectively. the four language 1384 pairs
Author: Nobuhiro Kaji ; Masaru Kitsuregawa
Abstract: Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often outof-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.
Author: Ashish Venugopal ; Jakob Uszkoreit ; David Talbot ; Franz Och ; Juri Ganitkevitch
Abstract: We propose a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method is robust to local editing operations and provides well defined trade-offs between the ability to identify algorithm outputs and the quality of the watermarked output. Unlike previous work in the field, our approach does not rely on controlling the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm. We present an application in statistical machine translation, where machine translated output is watermarked at minimal loss in translation quality and detected with high recall. 1 Motivation Machine learning algorithms provide structured results to input queries by simulating human behavior. Examples include automatic machine translation (Brown et al. , 1993) or automatic text and rich media summarization (Goldstein et al. , 1999) . These algorithms often estimate some portion of their models from publicly available human generated data. As new services that output structured results are made available to the public and the results disseminated on the web, we face a daunting new challenge: Machine generated structured results contaminate the pool of naturally generated human data. For example, machine translated output 1363 2Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218, USA juri@cs.jhu.edu and human generated translations are currently both found extensively on the web, with no automatic way of distinguishing between them. Algorithms that mine data from the web (Uszkoreit et al. , 2010) , with the goal of learning to simulate human behavior, will now learn models from this contaminated and potentially selfgenerated data, reinforcing the errors committed by earlier versions of the algorithm. It is beneficial to be able to identify a set of encountered structured results as having been generated by one’s own algorithm, with the purpose of filtering such results when building new models. Problem Statement: We define a structured result of a query q as r = {z1 · · · zL} where tthuree odr rdeesru latn dof identity qof a sele rm =en {tzs zi are important to the quality of the result r. The structural aspect of the result implies the existence of alternative results (across both the order of elements and the elements themselves) that might vary in their quality. Given a collection of N results, CN = r1 · · · rN, where each result ri has k rankedC alterna·t·iv·ers Dk (qi) of relatively similar quality and queries q1 · · · qN are arbitrary and not controlled by the watermarking algorithm, we define the watermarking task as: Task. Replace ri with ri0 ∈ Dk (qi) for some subset of results in CN to produce a watermarked sceoltle ocfti orens CN0 slleucchti otnha Ct: • CN0 is probabilistically identifiable as having bCeen generated by one’s own algorithm. Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1.e ?tc ho2d0s1 in A Nsasotucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag uesis 1ti3c6s3–1372, • • • 2 the degradation in quality from CN to the wthaete dremgarrakdeadt CN0 isnho quulda bitye analytically controllable, trading quality for detection performance. CN0 should not be detectable as watermCarked content without access to the generating algorithms. the detection of CN0 should be robust to simple eddeitte operations performed on individual results r ∈ CN0. Impact on Statistical Machine Translation Recent work(Resnik and Smith, 2003; Munteanu and Marcu, 2005; Uszkoreit et al. , 2010) has shown that multilingual parallel documents can be efficiently identified on the web and used as training data to improve the quality of statistical machine translation. The availability of free translation services (Google Translate, Bing Translate) and tools (Moses, Joshua) , increase the risk that the content found by parallel data mining is in fact generated by a machine, rather than by humans. In this work, we focus on statistical machine translation as an application for watermarking, with the goal of discarding documents from training if they have been generated by one’s own algorithms. To estimate the magnitude of the problem, we used parallel document mining (Uszkoreit et al. , 2010) to generate a collection of bilingual document pairs across several languages. For each document, we inspected the page content for source code that indicates the use of translation modules/plug-ins that translate and publish the translated content. We computed the proportion of the content within our corpus that uses these modules. We find that a significant proportion of the mined parallel data for some language pairs is generated via one of these translation modules. The top 3 languages pairs, each with parallel translations into English, are Tagalog (50.6%) , Hindi (44.5%) and Galician (41.9%) . While these proportions do not reflect impact on each language’s monolingual web, they are certainly high 1364 enough to affect machine translations systems that train on mined parallel data. In this work, we develop a general approach to watermark structured outputs and apply it to the outputs of a statistical machine translation system with the goal of identifying these same outputs on the web. In the context of the watermarking task defined above, we output selecting alternative translations for input source sentences. These translations often undergo simple edit and formatting operations such as case changes, sentence and word deletion or post editing, prior to publishing on the web. We want to ensure that we can still detect watermarked translations despite these edit operations. Given the rapid pace of development within machine translation, it is also important that the watermark be robust to improvements in underlying translation quality. Results from several iterations of the system within a single collection of documents should be identifiable under probabilistic bounds. While we present evaluation results for statistical machine translation, our proposed approach and associated requirements are applicable to any algorithm that produces structured results with several plausible alternatives. The alternative results can arise as a result of inherent task ambiguity (for example, there are multiple correct translations for a given input source sentence) or modeling uncertainty (for example, a model assigning equal probability to two competing results) . 3 Watermark Structured Results Selecting an alternative r0 from the space of alternatives Dk (q) can be stated as: r0= arr∈gDmk(aqx)w(r,Dk(q),h) (1) where w ranks r ∈ Dk (q) based on r’s presentwahtieorne owf a watermarking signal computed by a hashing operation h. In this approach, w and its component operation h are the only secrets held by the watermarker. This selection criterion is applied to all system outputs, ensuring that watermarked and non-watermarked version of a collection will never be available for comparison. A specific implementation of w within our watermarking approach can be evaluated by the following metrics: • • • False Positive Rate: how often nonFwaaltseermarked collections are falsely identified as watermarked. Recall Rate: how often watermarked collRecectiaolnls R are correctly inde wntaitfeierdm as wdat ceorl-marked. Quality Degradation: how significantly dQoueasl CN0 d Dieffegrr fdraotmio CN when evaluated by tdaoseks specific quality Cmetrics. While identification is performed at the collection level, we can scale these metrics based on the size of each collection to provide more task sensitive metrics. For example, in machine translation, we count the number of words in the collection towards the false positive and recall rates. In Section 3.1, we define a random hashing operation h and a task independent implementation of the selector function w. Section 3.2 describes how to classify a collection of watermarked results. Section 3.3 and 3.4 describes refinements to the selection and classification criteria that mitigate quality degradation. Following a comparison to related work in Section 4, we present experimental results for several languages in Section 5. 3.1 Watermarking: CN → CN0 We define a random hashing operation h that is applied to result r. It consists of two components: • A hash function applied to a structured re- sAul ht r hto f generate a lbieitd sequence cotfu a dfix reedlength. • An optional mapping that maps a single cAannd oidptaitoen raels umlta r ntog a hsaett mofa spusb -are ssiunlgtsle. Each sub-result is then hashed to generate a concatenated bit sequence for r. A good hash function produces outputs whose bits are independent. This implies that we can treat the bits for any input structured results 1365 as having been generated by a binomial distribution with equal probability of generating 1s vs 0s. This condition also holds when accumulating the bit sequences over a collection of results as long as its elements are selected uniformly from the space of possible results. Therefore, the bits generated from a collection of unwatermarked results will follow a binomial distribution with parameter p = 0.5. This result provides a null hypothesis for a statistical test on a given bit sequence, testing whether it is likely to have been generated from a binomial distribution binomial(n, p) where p = 0.5 and n is the length of the bit sequence. For a collection CN = r1 · · · rN, we can define a Fwaorte arm coalrlekc ranking funct·i·o·nr w to systematically select alternatives ri0 ∈ Dk (q) , such that the resulting CN0 is unlikely ∈to D produce bit sequences ltthinagt f Collow the p = 0.5 binomial distribution. A straightforward biasing criteria would be to select the candidate whose bit sequence exhibits the highest ratio of 1s. w can be defined as: (2) w(r,Dk(q),h) =#(|h1,(rh)(|r)) where h(r) returns the randomized bit sequence for result r, and #(x, y) counts the number of occurrences of x in sequence Selecting alternatives results to exhibit this bias will result in watermarked collections that exhibit this same bias. y. 3.2 Detecting the Watermark To classify a collection CN as watermarked or non-watermarked, we apply the hashing operation h on each element in CN and concatenate ttihoen sequences. eTlhemis sequence is tested against the null hypothesis that it was generated by a binomial distribution with parameter p = 0.5. We can apply a Fisherian test of statistical significance to determine whether the observed distribution of bits is unlikely to have occurred by chance under the null hypothesis (binomial with p = 0.5) . We consider a collection of results that rejects the null hypothesis to be watermarked results generated by our own algorithms. The p-value under the null hypothesis is efficiently computed by: p − value = Pn (X ≥ x) = Xi=nx?ni?pi(1 − p)n−i (3) (4) where x is the number of 1s observed in the collection, and n is the total number of bits in the sequence. Comparing this p-value against a desired significance level α, we reject the null hypothesis for collections that have Pn(X ≥ x) < α, thus deciding that such collections( were gen- erated by our own system. This classification criteria has a fixed false positive rate. Setting α = 0.05, we know that 5% of non-watermarked bit sequences will be falsely labeled as watermarked. This parameter α can be controlled on an application specific basis. By biasing the selection of candidate results to produce more 1s than 0s, we have defined a watermarking approach that exhibits a fixed false positive rate, a probabilistically bounded detection rate and a task independent hashing and selection criteria. In the next sections, we will deal with the question of robustness to edit operations and quality degradation. 3.3 Robustness and Inherent Bias We would like the ability to identify watermarked collections to be robust to simple edit operations. Even slight modifications to the elements within an item r would yield (by construction of the hash function) , completely different bit sequences that no longer preserve the biases introduced by the watermark selection function. To ensure that the distributional biases introduced by the watermark selector are preserved, we can optionally map individual results into a set of sub-results, each one representing some local structure of r. h is then applied to each subresult and the results concatenated to represent r. This mapping is defined as a component of the h operation. While a particular edit operation might affect a small number of sub-results, the majority of the bits in the concatenated bit sequence for r would remain untouched, thereby limiting the damage to the biases selected during watermark1366 ing. This is of course no defense to edit operations that are applied globally across the result; our expectation is that such edits would either significantly degrade the quality of the result or be straightforward to identify directly. For example, a sequence of words r = z1 · · · zL can be mapped into a set of consecutive n-gram sequences. Operations to edit a word zi in r will only affect events that consider the word zi. To account for the fact that alternatives in Dk (q) might now result in bit sequences of different lengths, we can generalize the biasing criteria to directly reflect the expected contribution to the watermark by defining: w(r, Dk(q), h) = Pn(X ≥ #(1, h(r))) (5) where Pn gives probabilities from binomial(n = |h(r) |,p = 0.5) . (Irn)|h,epr =en 0t. 5c)o.llection level biases: Our null hypothesis is based on the assumption that collections of results draw uniformly from the space of possible results. This assumption might not always hold and depends on the type of the results and collection. For example, considering a text document as a collection of sentences, we can expect that some sentences might repeat more frequently than others. This scenario is even more likely when applying a mapping into sub-results. n-gram sequences follow long-tailed or Zipfian distributions, with a small number of n-grams contributing heavily toward the total number of n-grams in a document. A random hash function guarantees that inputs are distributed uniformly at random over the output range. However, the same input will be assigned the same output deterministically. Therefore, if the distribution of inputs is heavily skewed to certain elements of the input space, the output distribution will not be uniformly distributed. The bit sequences resulting from the high frequency sub-results have the potential to generate inherently biased distributions when accumulated at the collection level. We want to choose a mapping that tends towards generating uniformly from the space of sub-results. We can empirically measure the quality of a sub-result mapping for a specific task by computing the false positive rate on non-watermarked collections. For a given significance level α, an ideal mapping would result in false positive rates close to α as well. Figure 1 shows false positive rates from 4 alternative mappings, computed on a large corpus of French documents (see Table 1for statistics) . Classification decisions are made at the collection level (documents) but the contribution to the false positive rate is based on the number of words in the classified document. We consider mappings from a result (sentence) into its 1-grams, 1 − 5-grams and 3 − 5 grams as well as trahem non-mapping case, w 3h −ere 5 tghrea mfusll a sres wuelltl is hashed. Figure 1 shows that the 1-grams and 1 − 5gram generate wsusb t-hraetsul tthse t 1h-agtr rmessu latn idn 1h −eav 5-ily biased false positive rates. The 3 − 5 gram mapping yields pfaolsseit positive r.a Ttesh ecl 3os −e t 5o gthraemir theoretically expected values. 1 Small deviations are expected since documents make different contributions to the false positive rate as a function of the number of words that they represent. For the remainder of this work, we use the 3-5 gram mapping and the full sentence mapping, since the alternatives generate inherently distributions with very high false positive rates. 3.4 Considering Quality The watermarking described in Equation 3 chooses alternative results on a per result basis, with the goal of influencing collection level bit sequences. The selection criteria as described will choose the most biased candidates available in Dk (q) . The parameter k determines the extent to which lesser quality alternatives can be chosen. If all the alternatives in each Dk (q) are of relatively similar quality, we expect minimal degradation due to watermarking. Specific tasks however can be particularly sensitive to choosing alternative results. Discriminative approaches that optimize for arg max selection like (Och, 2003; Liang et al. , 2006; Chiang et al. , 2009) train model parameters such 1In the final version of this paper we will perform sampling to create a more reliable estimate of the false positive rate that is not overly influenced by document length distributions. 1367 that the top-ranked result is well separated from its competing alternatives. Different queries also differ in the inherent ambiguity expected from their results; sometimes there really is just one correct result for a query, while for other queries, several alternatives might be equally good. By generalizing the definition of the w function to interpolate the estimated loss in quality and the gain in the watermarking signal, we can trade-off the ability to identify the watermarked collections against quality degradation: w(r,Dk(q),fw)− =(1 λ − ∗ λ g)ai ∗nl( or,s D(rk,(Dq)k,(fqw)) (6) Loss: The loss(r, Dk (q)) function reflects the quality degradation that results from selecting alternative r as opposed to the best ranked candidate in Dk (q)) . We will experiment with two variants: lossrank (r, Dk (q)) = (rank(r) − k)/k losscost(r, Dk(q)) = (cost(r)−cost(r1))/ cost(r1) where: • • • rank(r) : returns the rank of r within Dk (q) . cost(r) : a weighted sum of features (not cnoosrtm(ra)li:ze ad over httheed sse uarmch o space) rine a loglinear model such as those mentioned in (Och, 2003). r1: the highest ranked alternative in Dk (q) . lossrank provides a generally applicable criteria to select alternatives, penalizing selection from deep within Dk (q) . This estimate of the quality degradation does not reflect the generating model’s opinion on relative quality. losscost considers the relative increase in the generating model’s cost assigned to the alternative translation. Gain: The gain(r, Dk (q) , fw) function reflects the gain in the watermarking signal by selecting candidate r. We simply define the gain as the Pn(X ≥ #(1, h(r))) from Equation 5. ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (a) 1-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (c) 3 − 5-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (b) 1− 5-grams mapping p-value threshold (d) Full result hashing Figure 1 Comparison : of expected false positive rates against observed false positive rates for different sub-result mappings. 4 Related Work Using watermarks with the goal of transmitting a hidden message within images, video, audio and monolingual text media is common. For structured text content, linguistic approaches like (Chapman et al. , 2001; Gupta et al., 2006) use language specific linguistic and semantic expansions to introduce hidden watermarks. These expansions provide alternative candidates within which messages can be encoded. Recent publications have extended this idea to machine translation, using multiple systems and expansions to generate alternative translations. (Stutsman et al. , 2006) uses a hashing function to select alternatives that encode the hidden message in the lower order bits of the translation. In each of these approaches, the watermarker has control over the collection of results into which the watermark is to be embedded. These approaches seek to embed a hidden message into a collection of results that is selected by the watermarker. In contrast, we address the condition where the input queries are not in the watermarker’s control. 1368 The goal is therefore to introduce the watermark into all generated results, with the goal of probabilistically identifying such outputs. Our approach is also task independent, avoiding the need for templates to generate additional alternatives. By addressing the problem directly within the search space of a dynamic programming algorithm, we have access to high quality alternatives with well defined models of quality loss. Finally, our approach is robust to local word editing. By using a sub-result mapping, we increase the level of editing required to obscure the watermark signal; at high levels of editing, the quality of the results themselves would be significantly degraded. 5 Experiments We evaluate our watermarking approach applied to the outputs of statistical machine translation under the following experimental setup. A repository of parallel (aligned source and target language) web documents is sampled to produce a large corpus on which to evaluate the watermarking classification performance. The corpora represent translations into 4 diverse target languages, using English as the source language. Each document in this corpus can be considered a collection of un-watermarked structured results, where source sentences are queries and each target sentence represents a structured result. Using a state-of-the-art phrase-based statistical machine translation system (Och and Ney, 2004) trained on parallel documents identified by (Uszkoreit et al. , 2010) , we generate a set of 100 alternative translations for each source sentence. We apply the proposed watermarking approach, along with the proposed refinements that address task specific loss (Section 3.4) and robustness to edit operations (Section 3.3) to generate watermarked corpora. Each method is controlled via a single parameter (like k or λ) which is varied to generate alternative watermarked collections. For each parameter value, we evaluate the Recall Rate and Quality Degradation with the goal of finding a setting that yields a high recall rate, minimal quality degradation. False positive rates are evaluated based on a fixed classification significance level of α = 0.05. The false positive and recall rates are evaluated on the word level; a document that is misclassified or correctly identified contributes its length in words towards the error calculation. In this work, we use α = 0.05 during classification corresponding to an expected 5% false positive rate. The false positive rate is a function of h and the significance level α and therefore constant across the parameter values k and λ. We evaluate quality degradation on human translated test corpora that are more typical for machine translation evaluation. Each test corpus consists of 5000 source sentences randomly selected from the web and translated into each respective language. We chose to evaluate quality on test corpora to ensure that degradations are not hidden by imperfectly matched web corpora and are consistent with the kind of results often reported for machine translation systems. As with the classification corpora, we create watermarked versions at each parameter value. For a given pa1369 recall Figure 2: BLEU loss against recall of watermarked content for the baseline approach (max K-best) , rank and cost interpolation. rameter value, we measure false positive and re- call rates on the classification corpora and quality degradation on the evaluation corpora. Table 1 shows corpus statistics for the classification and test corpora and non-watermarked BLEU scores for each target language. All source texts are in English. 5.1 Loss Interpolated Experiments Our first set of experiments demonstrates baseline performance using the watermarking criteria in Equation 5 versus the refinements suggested in Section 3.4 to mitigate quality degradation. The h function is computed on the full sentence result r with no sub-event mapping. The following methods are evaluated in Figure 2. • • Baseline method (labeled “max K-best” ): sBealescetlsin er0 purely (blaasbedel on gain Kin- bweastte”r):marking signal (Equation 5) and is parameterized by k: the number of alternatives considered for each result. Rank interpolation: incorporates rank into w, varying ptholea interpolation parameter nλ.t • Cost interpolation: incorporates cost into w, varying tohlea interpolation parameter nλ.t The observed false positive rate on the French classification corpora is 1.9%. ClassificationQuality Ta AbFHularei ankg1bdic:sehitCon#t12e08n7w39t1065o s40r7tda617tsi c#sfo1e r85n37c2018tl2a5e4s 5n0sicfeastion#adno1c68 q3u09 06ma70lietynsdegr#ad7 aw3 t534io9 0rn279dcsorp# as.e5 nN54 t08 oe369n-wceatsrmBaL21 kEe6320d. U462579 B%LEU scores are reported for the quality corpora. We consider 0.2% BLEU loss as a threshold for acceptable quality degradation. Each method is judged by its ability to achieve high recall below this quality degradation threshold. Applying cost interpolation yields the best results in Figure 2, achieving a recall of 85% at 0.2% BLEU loss, while rank interpolation achieves a recall of 76%. The baseline approach of selecting the highest gain candidate within a depth of k candidates does not provide sufficient parameterization to yield low quality degradation. At k = 2, this method yields almost 90% recall, but with approximately 0.4% BLEU loss. 5.2 Robustness Experiments In Section 5.2, we proposed mapping results into sub-events or features. We considered alternative feature mappings in Figure 1, finding that mapping sentence results into a collection of 35 grams yields acceptable false positive rates at varied levels of α. Figure 3 presents results that compare moving from the result level hashing to the 3-5 gram sub-result mapping. We show the impact of the mapping on the baseline max K-best method as well as for cost interpolation. There are substantial reductions in recall rate at the 0.2% BLEU loss level when applying sub-result mappings in cases. The cost interpolation method recall drops from 85% to 77% when using the 3-5 grams event mapping. The observed false positive rate of the 3-5 gram mapping is 4.7%. By using the 3-5 gram mapping, we expect to increase robustness against local word edit operations, but we have sacrificed recall rate due to the inherent distributional bias discussed in Section 3.3. 1370 recall Figure 3: BLEU loss against recall of watermarked content for the baseline and cost interpolation methods using both result level and 3-5 gram mapped events. 5.3 Multilingual Experiments The watermarking approach proposed here introduces no language specific watermarking operations and it is thus broadly applicable to translating into all languages. In Figure 4, we report results for the baseline and cost interpolation methods, considering both the result level and 3-5 gram mapping. We set α = 0.05 and measure recall at 0.2% BLEU degradation for translation from English into Arabic, French, Hindi and Turkish. The observed false positive rates for full sentence hashing are: Arabic: 2.4%, French: 1.8%, Hindi: 5.6% and Turkish: 5.5%, while for the 3-5 gram mapping, they are: Arabic: 5.8%, French: 7.5%, Hindi:3.5% and Turkish: 6.2%. Underlying translation quality plays an important role in translation quality degradation when watermarking. Without a sub-result mapping, French (BLEU: 26.45%) Figure 4: Loss of recall when using 3-5 gram mapping vs sentence level mapping for Arabic, French, Hindi and Turkish translations. achieves recall of 85% at 0.2% BLEU loss, while the other languages achieve over 90% recall at the same BLEU loss threshold. Using a subresult mapping degrades quality for each language pair, but changes the relative performance. Turkish experiences the highest relative drop in recall, unlike French and Arabic, where results are relatively more robust to using sub-sentence mappings. This is likely a result of differences in n-gram distributions across these languages. The languages considered here all use space separated words. For languages that do not, like Chinese or Thai, our approach can be applied at the character level. 6 Conclusions In this work we proposed a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method provides probabilistic bounds on detection ability, analytic control on quality degradation and is robust to local edit- ing operations. Our method is applicable to any task where structured outputs are generated with ambiguities or ties in the results. We applied this method to the outputs of statistical machine translation, evaluating each refinement to our approach with false positive and recall rates against BLEU score quality degradation. Our results show that it is possible, across several language pairs, to achieve high recall rates (over 80%) with low false positive rates (between 5 and 8%) at minimal quality degradation (0.2% 1371 BLEU) , while still allowing for local edit operations on the translated output. In future work we will continue to investigate methods to mitigate quality loss. References Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Minimum error rate training in statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311 . Mark Chapman, George Davida, and Marc Rennhardway. 2001. A practical and effective approach to large-scale automated linguistic steganography. In Proceedings of the Information Security Conference. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, pages 121–128. Gaurav Gupta, Josef Pieprzyk, and Hua Xiong Wang. 2006. An attack-localizing watermarking scheme for natural language documents. In Proceedings of the 2006 A CM Symposium on Information, computer and communications security, ASIACCS ’06, pages 157–165, New York, NY, USA. ACM. Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/A CL, pages 761–768. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics. Franz Josef Och and Hermann Ney. alignment template approach to statistical machine translation. Computational Linguistics. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 2003 Meeting of the Asssociation of Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. computational linguistics. Computational Linguistics. Ryan Stutsman, Mikhail Atallah, Christian Grothoff, and Krista Grothoff. 2006. Lost in just the translation. In Proceedings of the 2006 A CM Symposium on Applied Computing. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 2010 COLING. 1372 2004. The
4 0.29458368 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
Author: Jagadeesh Jagarlamudi ; Raghavendra Udupa ; Hal Daume III ; Abhijit Bhole
Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads to dense covariance matrices which in turn leads to suboptimal document representations. In this paper, we explore techniques to recover the desired sparsity in covariance matrices in two ways. First, we explore word association measures and bilingual dictionaries to weigh the word pairs. Later, we explore different selection strategies to remove the noisy pairs based on the association scores. Our experimental results on the task of aligning comparable documents shows the efficacy of sparse covariance matrices on two data sets from two different language pairs.
5 0.26642886 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
Author: Weiwei Sun ; Jia Xu
Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.
6 0.2628777 3 emnlp-2011-A Correction Model for Word Alignments
7 0.24213485 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
8 0.23834899 122 emnlp-2011-Simple Effective Decipherment via Combinatorial Optimization
9 0.22127172 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
10 0.2124407 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
11 0.20332986 13 emnlp-2011-A Word Reordering Model for Improved Machine Translation
12 0.19294934 32 emnlp-2011-Computing Logical Form on Regulatory Texts
13 0.18860017 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms
14 0.16317613 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
15 0.15564847 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts
16 0.15196098 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
17 0.14608668 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
18 0.14498505 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
19 0.14313194 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning
20 0.14122084 94 emnlp-2011-Modelling Discourse Relations for Arabic
topicId topicWeight
[(15, 0.015), (23, 0.091), (36, 0.021), (37, 0.014), (45, 0.064), (53, 0.024), (54, 0.021), (57, 0.017), (62, 0.015), (64, 0.031), (66, 0.016), (69, 0.029), (70, 0.351), (72, 0.021), (79, 0.039), (82, 0.019), (85, 0.021), (96, 0.037), (98, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.75269055 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement
Author: Ali El Kahki ; Kareem Darwish ; Ahmed Saad El Din ; Mohamed Abd El-Wahab ; Ahmed Hefny ; Waleed Ammar
Abstract: Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for respectively. the four language 1384 pairs
Author: Samuel Brody ; Nicholas Diakopoulos
Abstract: We present an automatic method which leverages word lengthening to adapt a sentiment lexicon specifically for Twitter and similar social messaging networks. The contributions of the paper are as follows. First, we call attention to lengthening as a widespread phenomenon in microblogs and social messaging, and demonstrate the importance of handling it correctly. We then show that lengthening is strongly associated with subjectivity and sentiment. Finally, we present an automatic method which leverages this association to detect domain-specific sentiment- and emotionbearing words. We evaluate our method by comparison to human judgments, and analyze its strengths and weaknesses. Our results are of interest to anyone analyzing sentiment in microblogs and social networks, whether for research or commercial purposes.
3 0.36441723 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
4 0.35623571 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
Author: Yang Gao ; Philipp Koehn ; Alexandra Birch
Abstract: Long-distance reordering remains one of the biggest challenges facing machine translation. We derive soft constraints from the source dependency parsing to directly address the reordering problem for the hierarchical phrasebased model. Our approach significantly improves Chinese–English machine translation on a large-scale task by 0.84 BLEU points on average. Moreover, when we switch the tuning function from BLEU to the LRscore which promotes reordering, we observe total improvements of 1.21 BLEU, 1.30 LRscore and 3.36 TER over the baseline. On average our approach improves reordering precision and recall by 6.9 and 0.3 absolute points, respectively, and is found to be especially effective for long-distance reodering.
5 0.35448501 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
Author: Burr Settles
Abstract: This paper describes DUALIST, an active learning annotation paradigm which solicits and learns from labels on both features (e.g., words) and instances (e.g., documents). We present a novel semi-supervised training algorithm developed for this setting, which is (1) fast enough to support real-time interactive speeds, and (2) at least as accurate as preexisting methods for learning with mixed feature and instance labels. Human annotators in user studies were able to produce near-stateof-the-art classifiers—on several corpora in a variety of application domains—with only a few minutes of effort.
6 0.35416648 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
8 0.35263655 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
9 0.35251889 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
10 0.3522692 13 emnlp-2011-A Word Reordering Model for Improved Machine Translation
11 0.35032111 128 emnlp-2011-Structured Relation Discovery using Generative Models
12 0.34985799 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
13 0.34908506 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
14 0.34856123 136 emnlp-2011-Training a Parser for Machine Translation Reordering
15 0.3484605 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
16 0.34742272 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
17 0.34675443 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
18 0.34518921 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
19 0.34514385 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
20 0.34514251 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation