emnlp emnlp2011 emnlp2011-2 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lukas Michelbacher ; Alok Kothari ; Martin Forst ; Christina Lioma ; Hinrich Schutze
Abstract: Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system. – 1
Reference: text
sentIndex sentText sentNum sentScore
1 Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. [sent-6, score-0.656]
2 In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. [sent-7, score-0.452]
3 Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole. [sent-11, score-0.475]
4 For example, a hot dog is not a hot animal but a sausage in a bun and a black hole in astrophysics is a region of space with special properties, not a dark cavity. [sent-20, score-0.76]
5 For example, in information retrieval (IR) the query hot dog should not retrieve documents that only contain the words hot and dog individually, outside of the phrase hot dog. [sent-22, score-0.953]
6 In this study, we focus on noun phrases in the physics domain. [sent-23, score-0.288]
7 We chose noun phrases because domain-specific terminology is commonly encoded in noun phrase MWUs; other types of phrases e. [sent-25, score-0.501]
8 We cast the task of MWU tokenization as semantic head recognition in this paper. [sent-28, score-0.558]
9 For example, in coreference resolution identity of syntactic heads is predictive of coreference; in parse disambiguation, the syntactic head of a noun phrase is a powerful feature for resolving attachment ambiguities. [sent-30, score-0.731]
10 However, in all of these cases, the syntactic head is only an approximation of the information that is really needed; the underlying assumption made when using the syntactic head as a substitute for the entire phrase is that the syntactic head is representative of the phrase. [sent-31, score-1.113]
11 We define the semantic head of a noun phrase as the non-compositional part of a phrase. [sent-33, score-0.595]
12 ec th2o0d1s1 i Ans Nsoactuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgaugies ti 7c9s3–803, termine possible coreference of a hot dog . [sent-37, score-0.29]
13 the dog in I first ate a hot dog and then fed the dog. [sent-40, score-0.423]
14 This is not the case for a system that makes the decision based on the semantic heads hot dog of a hot dog and dog of the dog. [sent-41, score-1.057]
15 We will show that semantic head recognition improves the performance of an information retrieval system. [sent-43, score-0.526]
16 We introduce a cascaded classification framework for recognizing semantic heads that allows us to treat noun phrases of arbitrary length. [sent-44, score-0.695]
17 First, we introduce the notion of semantic head, in analogy to syntactic head, and propose semantic head recognition as a new component ofNLP preprocessing. [sent-48, score-0.64]
18 Second, we develop a cascaded classification framework for semantic head recognition. [sent-49, score-0.578]
19 Third, we investigate the utility of contextual similarity for detecting non-compositionality and show that it significantly enhances a baseline semantic head recognizer. [sent-50, score-0.414]
20 However, we also identify a number of challenges of using contextual similarity in high-confidence semantic head recognition. [sent-51, score-0.414]
21 Fourth, we show that our approach to semantic head recognition improves the performance of an IR system. [sent-52, score-0.464]
22 In Section 3 we introduce semantic heads and present our cascaded model for semantic head recognition. [sent-54, score-0.87]
23 Lin (1999) defines a decision criterion for noncompositional phrases based on the change in the mutual information of a phrase when substituting one word for a similar one based on an automatically constructed thesaurus. [sent-70, score-0.43]
24 These studies compute the similarity between words and phrases represented as semantic vectors in a word space model. [sent-81, score-0.312]
25 The underlying idea is similar to Lin’s: the meaning of a non-compositional phrase somehow deviates from what one would expect given the semantic vectors of parts of the phrase. [sent-84, score-0.295]
26 Regarding (i), Schone and Jurafsky (2001) compare the semantic vector of a phrase p and the vectors of its component words in two ways: one includes the contexts of p in the construction of the semantic vectors of the parts and one does not. [sent-88, score-0.544]
27 They address (i) by comparing the semantic vectors of phrases with the vectors of their parts individually to detect meaning changes; e. [sent-92, score-0.375]
28 With respect to (iii), the above-mentioned studies use ad hoc thresholds to separate compositional and non-compositional phrases but do not offer a principled decision criterion. [sent-98, score-0.271]
29 For example, our definition of alternative vector relies on the fact that most noun phrase MWUs are fixed and exhibit no syntactic variability. [sent-112, score-0.274]
30 – – 3 Semantic Heads and Cascaded Model We cast the task of MWU tokenization as semantic head recognition in this paper. [sent-116, score-0.558]
31 We define the semantic head of a noun phrase as the largest noncompositional part of the phrase that contains the syntactic head. [sent-117, score-0.9]
32 For example, black hole is the semantic head of unusual black hole and afterglow is the semantic head of bright optical afterglow; in the latter case syntactic and semantic heads coincide. [sent-118, score-2.063]
33 The attachment ambiguity of the last noun phrase in he bought the hot dogs in a packet can be easily resolved for the semantic head hot dogs (food is often in a packet), but not as easily for the syntactic head dogs (dogs are usually not in packets). [sent-120, score-1.433]
34 Indeed, we will show in Section 7 that semantic head recognition improves the performance of an IR system. [sent-121, score-0.464]
35 The semantic head is either a single noun or a noncompositional noun phrase. [sent-122, score-0.717]
36 In the latter case, the modifier(s) introduce(s) a non-compositional, unpredictable shift of meaning; hot shifts the mean- ing of dog from live animal to food. [sent-123, score-0.29]
37 The semantic head always contains the syntactic head; for compositional phrases, syntactic head and semantic head are identical. [sent-125, score-1.31]
38 To determine the semantic head of a phrase, we use a cascaded classification approach. [sent-126, score-0.578]
39 The cascade (1) neutron star (2) unusual black hole (3) bright optical afterglow (4) small moment of inertia Figure 1: Example phrases with modifiers. [sent-127, score-1.196]
40 We need a cascade because we want to recognize the semantic head in noun phrases of arbitrary length. [sent-130, score-0.672]
41 We distinguish between the syntactic head of a phrase and the remaining words, the modifiers. [sent-135, score-0.441]
42 This means that in the phrase small moment of inertia, small (and not of inertia) is the peripheral element u. [sent-144, score-0.421]
43 In each iteration, the classifier decides whether the relation between the current peripheral element u and the rest v is compositional (C) or noncompositional (NC). [sent-146, score-0.531]
44 If the relation is NC, processing stops and uv is returned as the semantic head of p. [sent-147, score-0.51]
45 the new v is a single word, it is returned as the semantic head of p. [sent-151, score-0.414]
46 For the fully compositional phrase bright optical afterglow, the pro3We use the abstract representation p = uv even though u can appear after v in the surface form of p. [sent-153, score-0.426]
47 In the second case, the process stops earlier, in step 2, because the classifier finds that the relation between moment and of inertia is NC. [sent-155, score-0.308]
48 This means that the semantic head of small moment of inertia is moment of inertia. [sent-156, score-0.755]
49 We extracted all noun phrases from the collection that consist of a head noun with up to four modifiers almost all domain-specific terminology in our collection is captured by this pattern. [sent-162, score-0.616]
50 Method sj 1 compares the semantic vector of a phrase p with the sum of the vectors of its parts. [sent-192, score-0.348]
51 Method sj2 is like sj 1, except the contexts of p are not part of the semantic vectors of the parts. [sent-193, score-0.302]
52 Method alt compares the semantic vector of a phrase with its alternative vector. [sent-194, score-0.276]
53 In the definitions below, s represents a vector similarity measure, w(p) a general semantic vector of a phrase p and w∗ (wi) the semantic vector of a part wi of a phrase p that does not include the contexts of occurrences of wi that were part of p itself. [sent-195, score-0.523]
54 For a phrase p = uv with peripheral element u and rest v, we call the phrase p0 = u0v an alternative phrase if the rest v is the same and u0 u. [sent-197, score-0.662]
55 , giant star is an alternative phrase of neutron star ,an gdia insotl sattaerd i neutron star vise an alternative of young neutron star. [sent-200, score-0.796]
56 The alternative vector of p is then the semantic vector that is computed from the contexts of all of p’s alternative phrases. [sent-201, score-0.274]
57 Previous work has compared the semantic vector of a phrase with the vectors of its components. [sent-205, score-0.295]
58 Our question is: Is the typical context of the head hole if it occurs with a modifier that is not black different from when it occurs with the modifier black? [sent-207, score-0.688]
59 To add information about the variability of syntactic contexts in which phrases occur, we add the words immediately before and after the phrase with positional markers (−1 and +1, respectively) t ow itthhe voesicttioorn. [sent-211, score-0.335]
60 In line with the cascaded model, the raters where asked to identify the semantic head of each candidate phrase. [sent-218, score-0.625]
61 If at least two raters agreed on a semantic head of a phrase we made this choice the seman- tic head in the gold standard. [sent-219, score-0.853]
62 798 We computed raw agreement of each rater with the gold standard as the percentage of correctly recognized semantic heads this is the task that the classifier addresses. [sent-221, score-0.44]
63 In other words, if the contexts of the candidate phrase are too dissimilar to the contexts of the sum of its parts or to the alternative phrases, then we suspect non-compositionality. [sent-236, score-0.267]
64 In mode dec-all, we evaluate all decisions that were made in the course of recognizing the semantic head. [sent-259, score-0.285]
65 This mode emphasizes the correct recognition of semantic heads in phrases where multiple correct decisions in a row are necessary. [sent-260, score-0.581]
66 There is no obvious baseline for dec-all because the number of decisions depends on the classifier a classifier whose first decision on a four-word phrase is NC makes one decision, another one may make three. [sent-262, score-0.367]
67 The mode semh evaluates how many semantic – × heads were recognized correctly. [sent-263, score-0.511]
68 This mode directly evaluates the task of semantic head recognition. [sent-264, score-0.473]
69 When the semantic head recognizer processes a phrase, there are four possible results. [sent-284, score-0.469]
70 799 type freq rsemh rsynth r+ r− all 92 85 48 35 260 definition sem. [sent-289, score-0.333]
71 head) too long too short Table 5: Distribution of result types the semantic head is correctly recognized and it is distinct from the syntactic head. [sent-295, score-0.535]
72 Result rsynth: the semantic head is correctly recognized and it is identical to the syntactic head. [sent-296, score-0.535]
73 Result r+: the semantic head is not correctly recognized because the cascade was stopped too early, i. [sent-297, score-0.546]
74 Result r− : the semantic head is not correctly recognized because the cascade was stopped too late, i. [sent-300, score-0.546]
75 Table 6 shows the top 20 classifications where the semantic head was not the same as the syntactic head sorted by confidence in descending order. [sent-305, score-0.789]
76 ” we list the candidates with semantic heads in bold. [sent-309, score-0.292]
77 The columns to the right show the predicted semantic head and the feature values. [sent-310, score-0.414]
78 The two phrases are clearly compositional and the classifier failed even though the context feature points in the direction of compositionality with a value greater than . [sent-313, score-0.295]
79 Another incorrect classification occurs with the phrase massive star birth6 for which star birth was annotated as the semantic head. [sent-319, score-0.624]
80 Here we have a case where the peripheral element massive does not mod6i. [sent-320, score-0.262]
81 the birth of a massive star, a certain type of star with very high mass mode baseline context feature context feature subsets simalt - • • • • - - - simsj1 - - • – • • - • simsj2 - - - • • - • • ••• dec-1st dec-all semh . [sent-322, score-0.531]
82 m ify the syntactic head birth but massive star is itself a complex modifier. [sent-350, score-0.597]
83 The remaining phrases are peculiar velocity and local group. [sent-353, score-0.297]
84 Context features further increase performance significantly, but surprisingly, they are not of clear benefit for a high-confidence classifier that is targeted towards recognizing a smaller subset of semantic heads with high confidence. [sent-382, score-0.409]
85 – 7 Information Retrieval Experiment Typically, IR systems do not process noncompositional phrases as one semantic entity, missing out on potentially important information captured by non-compositionality. [sent-383, score-0.4]
86 This section illustrates one way of adjusting the retrieval process so that non-compositional phrases are processed as semantic entities that may enhance retrieval performance. [sent-384, score-0.373]
87 60 453 1002 203 94 Table 6: The 20 most confident classifications where the prediction is semantic head 6. [sent-458, score-0.414]
88 291 = confi- Tdeabnclee a query that contains a non-compositional phrase, boosting the retrieval weight of documents that contain this phrase will improve overall retrieval performance. [sent-468, score-0.319]
89 To boost the ranking of documents containing noncompositional phrases, we increase wnc at the expense of wc. [sent-479, score-0.256]
90 We applied the preprocessing described run MAP REC P20 baseline real NC pseudo NC1 pseudo NC2 pseudo NC3 pseudo NC4 pseudo NC5 0. [sent-488, score-0.56]
91 1423 Table 7: IR performance without considering noncompositionality (baseline), versus boosting real and pseudo non-compositionality (real NC, pseudo NCi). [sent-502, score-0.31]
92 in Section 4 to the queries and identified noncompositional phrases with the base AM classifier from Section 5. [sent-503, score-0.349]
93 Our approach for boosting the weight of these non-compositional phrases uses µ the same retrieval model enhanced with belief weights as described in Eq. [sent-504, score-0.266]
94 In addition, we include five runs that boost the weight of pseudo non-compositional phrases that were created randomly from the query text (pseudo NC runs). [sent-506, score-0.322]
95 These pseudo non-compositional phrases have exactly the same length as the observed noncompositional phrases for each query. [sent-507, score-0.507]
96 on making tunable vertical cavity surface emitting laser diodes ” and laser diodes was one of the non-compositional phrases recognized. [sent-532, score-0.262]
97 Semantic heads are in analogy to syntactic heads the core meaning units of phrases that cannot be further semantically decomposed. [sent-536, score-0.501]
98 To perform semantic head recognition for tokenization, we defined a novel cascaded model and implemented it as a statistical classifier that used previously proposed and new context features. [sent-537, score-0.704]
99 We reached an accuracy of 68% and argued that even a semantic head recognizer restricted to high-confidence decisions is useful because reliably recognizing a subset of semantic heads is better than recognizing none. [sent-539, score-0.901]
100 Finally, we showed that even in its preliminary current form the semantic head recognizer is able to improve the performance of an IR system. [sent-541, score-0.469]
wordName wordTfidf (topN-words)
[('rsemh', 0.298), ('head', 0.287), ('mwus', 0.245), ('hole', 0.181), ('mwu', 0.175), ('heads', 0.165), ('cascaded', 0.164), ('peripheral', 0.158), ('hot', 0.157), ('noncompositional', 0.151), ('afterglow', 0.14), ('ams', 0.14), ('dog', 0.133), ('black', 0.132), ('star', 0.131), ('semantic', 0.127), ('inertia', 0.123), ('simalt', 0.123), ('phrases', 0.122), ('multiword', 0.113), ('pseudo', 0.112), ('moment', 0.109), ('phrase', 0.105), ('velocity', 0.105), ('wnc', 0.105), ('compositional', 0.097), ('uv', 0.096), ('tokenization', 0.094), ('physics', 0.09), ('amf', 0.088), ('semh', 0.088), ('nc', 0.083), ('schone', 0.082), ('classifier', 0.076), ('noun', 0.076), ('birth', 0.075), ('recognized', 0.072), ('ir', 0.07), ('neutron', 0.07), ('peculiar', 0.07), ('optical', 0.068), ('pecina', 0.063), ('vectors', 0.063), ('retrieval', 0.062), ('bright', 0.06), ('cascade', 0.06), ('contexts', 0.059), ('mode', 0.059), ('decisions', 0.058), ('wc', 0.056), ('massive', 0.055), ('recognizer', 0.055), ('modifiers', 0.055), ('sj', 0.053), ('tokenizers', 0.053), ('decision', 0.052), ('dogs', 0.051), ('recognition', 0.05), ('syntactic', 0.049), ('query', 0.049), ('element', 0.049), ('raters', 0.047), ('expressions', 0.047), ('evert', 0.045), ('noncompositionality', 0.045), ('amt', 0.045), ('modifier', 0.044), ('alternative', 0.044), ('belief', 0.041), ('modes', 0.041), ('ramisch', 0.041), ('boosting', 0.041), ('recognizing', 0.041), ('runs', 0.039), ('confidence', 0.039), ('contingency', 0.038), ('stopwords', 0.038), ('amscp', 0.035), ('angular', 0.035), ('attia', 0.035), ('cook', 0.035), ('diodes', 0.035), ('ellipsoidal', 0.035), ('equilibrium', 0.035), ('forst', 0.035), ('imaging', 0.035), ('indri', 0.035), ('iraf', 0.035), ('isearch', 0.035), ('kaon', 0.035), ('laser', 0.035), ('lykke', 0.035), ('microscopy', 0.035), ('packet', 0.035), ('qnc', 0.035), ('resistance', 0.035), ('rsynth', 0.035), ('spectrograph', 0.035), ('vlba', 0.035), ('nlp', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition
Author: Lukas Michelbacher ; Alok Kothari ; Martin Forst ; Christina Lioma ; Hinrich Schutze
Abstract: Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system. – 1
2 0.097477555 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
Author: Jun Xie ; Haitao Mi ; Qun Liu
Abstract: Dependency structure, as a first step towards semantics, is believed to be helpful to improve translation quality. However, previous works on dependency structure based models typically resort to insertion operations to complete translations, which make it difficult to specify ordering information in translation rules. In our model of this paper, we handle this problem by directly specifying the ordering information in head-dependents rules which represent the source side as head-dependents relations and the target side as strings. The head-dependents rules require only substitution operation, thus our model requires no heuristics or separate ordering models of the previous works to control the word order of translations. Large-scale experiments show that our model performs well on long distance reordering, and outperforms the state- of-the-art constituency-to-string model (+1.47 BLEU on average) and hierarchical phrasebased model (+0.46 BLEU on average) on two Chinese-English NIST test sets without resort to phrases or parse forest. For the first time, a source dependency structure based model catches up with and surpasses the state-of-theart translation models.
3 0.088664927 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus
Author: Su Nam Kim ; Preslav Nakov
Abstract: Responding to the need for semantic lexical resources in natural language processing applications, we examine methods to acquire noun compounds (NCs), e.g., orange juice, together with suitable fine-grained semantic interpretations, e.g., squeezed from, which are directly usable as paraphrases. We employ bootstrapping and web statistics, and utilize the relationship between NCs and paraphrasing patterns to jointly extract NCs and such patterns in multiple alternating iterations. In evaluation, we found that having one compound noun fixed yields both a higher number of semantically interpreted NCs and improved accuracy due to stronger semantic restrictions.
4 0.086001001 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
Author: Edward Grefenstette ; Mehrnoosh Sadrzadeh
Abstract: Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (2010) using data from the BNC and evaluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The evaluation is based on the word disambiguation task developed by Mitchell and Lapata (2008) for intransitive sentences, and on a similar new experiment designed for transitive sentences. Our model matches the results of its competitors . in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity showcases the compositional power of our model.
5 0.071096987 107 emnlp-2011-Probabilistic models of similarity in syntactic context
Author: Diarmuid O Seaghdha ; Anna Korhonen
Abstract: This paper investigates novel methods for incorporating syntactic information in probabilistic latent variable models of lexical choice and contextual similarity. The resulting models capture the effects of context on the interpretation of a word and in particular its effect on the appropriateness of replacing that word with a potentially related one. Evaluating our techniques on two datasets, we report performance above the prior state of the art for estimating sentence similarity and ranking lexical substitutes.
6 0.065345958 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation
7 0.065212511 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
8 0.063677877 97 emnlp-2011-Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French
9 0.06284707 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
10 0.060480863 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
11 0.05929675 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
12 0.056106504 96 emnlp-2011-Multilayer Sequence Labeling
13 0.055345133 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
14 0.055256147 30 emnlp-2011-Compositional Matrix-Space Models for Sentiment Analysis
15 0.054846257 80 emnlp-2011-Latent Vector Weighting for Word Meaning in Context
16 0.051983133 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning
17 0.04873661 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
18 0.048382536 113 emnlp-2011-Relation Acquisition using Word Classes and Partial Patterns
19 0.047687508 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information
20 0.046430212 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
topicId topicWeight
[(0, 0.19), (1, -0.041), (2, -0.054), (3, -0.011), (4, 0.019), (5, -0.031), (6, -0.038), (7, 0.067), (8, 0.039), (9, 0.011), (10, 0.131), (11, -0.1), (12, 0.025), (13, -0.039), (14, -0.07), (15, 0.014), (16, 0.065), (17, 0.028), (18, 0.071), (19, 0.008), (20, 0.085), (21, -0.101), (22, -0.064), (23, 0.049), (24, -0.052), (25, -0.007), (26, 0.076), (27, 0.087), (28, -0.105), (29, -0.035), (30, -0.059), (31, -0.005), (32, -0.068), (33, -0.188), (34, 0.093), (35, 0.012), (36, -0.068), (37, 0.074), (38, 0.157), (39, 0.027), (40, -0.178), (41, -0.057), (42, 0.011), (43, 0.052), (44, -0.057), (45, -0.096), (46, 0.31), (47, 0.014), (48, 0.015), (49, 0.196)]
simIndex simValue paperId paperTitle
same-paper 1 0.94340849 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition
Author: Lukas Michelbacher ; Alok Kothari ; Martin Forst ; Christina Lioma ; Hinrich Schutze
Abstract: Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system. – 1
2 0.53630888 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus
Author: Su Nam Kim ; Preslav Nakov
Abstract: Responding to the need for semantic lexical resources in natural language processing applications, we examine methods to acquire noun compounds (NCs), e.g., orange juice, together with suitable fine-grained semantic interpretations, e.g., squeezed from, which are directly usable as paraphrases. We employ bootstrapping and web statistics, and utilize the relationship between NCs and paraphrasing patterns to jointly extract NCs and such patterns in multiple alternating iterations. In evaluation, we found that having one compound noun fixed yields both a higher number of semantically interpreted NCs and improved accuracy due to stronger semantic restrictions.
3 0.46198788 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
Author: Edward Grefenstette ; Mehrnoosh Sadrzadeh
Abstract: Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (2010) using data from the BNC and evaluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The evaluation is based on the word disambiguation task developed by Mitchell and Lapata (2008) for intransitive sentences, and on a similar new experiment designed for transitive sentences. Our model matches the results of its competitors . in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity showcases the compositional power of our model.
4 0.41469577 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information
Author: Richard Farkas
Abstract: Information-oriented document labeling is a special document multi-labeling task where the target labels refer to a specific information instead of the topic of the whole document. These kind oftasks are usually solved by looking up indicator phrases and analyzing their local context to filter false positive matches. Here, we introduce an approach for machine learning local content shifters which detects irrelevant local contexts using just the original document-level training labels. We handle content shifters in general, instead of learning a particular language phenomenon detector (e.g. negation or hedging) and form a single system for document labeling and content shift detection. Our empirical results achieved 24% error reduction compared to supervised baseline methods – on three document label– ing tasks.
5 0.37459639 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
Author: Jun Xie ; Haitao Mi ; Qun Liu
Abstract: Dependency structure, as a first step towards semantics, is believed to be helpful to improve translation quality. However, previous works on dependency structure based models typically resort to insertion operations to complete translations, which make it difficult to specify ordering information in translation rules. In our model of this paper, we handle this problem by directly specifying the ordering information in head-dependents rules which represent the source side as head-dependents relations and the target side as strings. The head-dependents rules require only substitution operation, thus our model requires no heuristics or separate ordering models of the previous works to control the word order of translations. Large-scale experiments show that our model performs well on long distance reordering, and outperforms the state- of-the-art constituency-to-string model (+1.47 BLEU on average) and hierarchical phrasebased model (+0.46 BLEU on average) on two Chinese-English NIST test sets without resort to phrases or parse forest. For the first time, a source dependency structure based model catches up with and surpasses the state-of-theart translation models.
6 0.35831442 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation
7 0.35242343 107 emnlp-2011-Probabilistic models of similarity in syntactic context
8 0.35202873 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
9 0.31420276 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
10 0.31368619 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
11 0.30393746 113 emnlp-2011-Relation Acquisition using Word Classes and Partial Patterns
12 0.29633951 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
13 0.29344454 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation
14 0.28380787 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
15 0.28328037 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models
16 0.2772668 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context
17 0.27717721 80 emnlp-2011-Latent Vector Weighting for Word Meaning in Context
18 0.27115119 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
19 0.26176605 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
20 0.2609961 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
topicId topicWeight
[(23, 0.076), (36, 0.042), (37, 0.02), (45, 0.071), (53, 0.019), (54, 0.023), (57, 0.017), (62, 0.021), (64, 0.014), (66, 0.039), (79, 0.031), (96, 0.037), (98, 0.491)]
simIndex simValue paperId paperTitle
same-paper 1 0.87245107 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition
Author: Lukas Michelbacher ; Alok Kothari ; Martin Forst ; Christina Lioma ; Hinrich Schutze
Abstract: Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system. – 1
Author: Ashish Venugopal ; Jakob Uszkoreit ; David Talbot ; Franz Och ; Juri Ganitkevitch
Abstract: We propose a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method is robust to local editing operations and provides well defined trade-offs between the ability to identify algorithm outputs and the quality of the watermarked output. Unlike previous work in the field, our approach does not rely on controlling the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm. We present an application in statistical machine translation, where machine translated output is watermarked at minimal loss in translation quality and detected with high recall. 1 Motivation Machine learning algorithms provide structured results to input queries by simulating human behavior. Examples include automatic machine translation (Brown et al. , 1993) or automatic text and rich media summarization (Goldstein et al. , 1999) . These algorithms often estimate some portion of their models from publicly available human generated data. As new services that output structured results are made available to the public and the results disseminated on the web, we face a daunting new challenge: Machine generated structured results contaminate the pool of naturally generated human data. For example, machine translated output 1363 2Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218, USA juri@cs.jhu.edu and human generated translations are currently both found extensively on the web, with no automatic way of distinguishing between them. Algorithms that mine data from the web (Uszkoreit et al. , 2010) , with the goal of learning to simulate human behavior, will now learn models from this contaminated and potentially selfgenerated data, reinforcing the errors committed by earlier versions of the algorithm. It is beneficial to be able to identify a set of encountered structured results as having been generated by one’s own algorithm, with the purpose of filtering such results when building new models. Problem Statement: We define a structured result of a query q as r = {z1 · · · zL} where tthuree odr rdeesru latn dof identity qof a sele rm =en {tzs zi are important to the quality of the result r. The structural aspect of the result implies the existence of alternative results (across both the order of elements and the elements themselves) that might vary in their quality. Given a collection of N results, CN = r1 · · · rN, where each result ri has k rankedC alterna·t·iv·ers Dk (qi) of relatively similar quality and queries q1 · · · qN are arbitrary and not controlled by the watermarking algorithm, we define the watermarking task as: Task. Replace ri with ri0 ∈ Dk (qi) for some subset of results in CN to produce a watermarked sceoltle ocfti orens CN0 slleucchti otnha Ct: • CN0 is probabilistically identifiable as having bCeen generated by one’s own algorithm. Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1.e ?tc ho2d0s1 in A Nsasotucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag uesis 1ti3c6s3–1372, • • • 2 the degradation in quality from CN to the wthaete dremgarrakdeadt CN0 isnho quulda bitye analytically controllable, trading quality for detection performance. CN0 should not be detectable as watermCarked content without access to the generating algorithms. the detection of CN0 should be robust to simple eddeitte operations performed on individual results r ∈ CN0. Impact on Statistical Machine Translation Recent work(Resnik and Smith, 2003; Munteanu and Marcu, 2005; Uszkoreit et al. , 2010) has shown that multilingual parallel documents can be efficiently identified on the web and used as training data to improve the quality of statistical machine translation. The availability of free translation services (Google Translate, Bing Translate) and tools (Moses, Joshua) , increase the risk that the content found by parallel data mining is in fact generated by a machine, rather than by humans. In this work, we focus on statistical machine translation as an application for watermarking, with the goal of discarding documents from training if they have been generated by one’s own algorithms. To estimate the magnitude of the problem, we used parallel document mining (Uszkoreit et al. , 2010) to generate a collection of bilingual document pairs across several languages. For each document, we inspected the page content for source code that indicates the use of translation modules/plug-ins that translate and publish the translated content. We computed the proportion of the content within our corpus that uses these modules. We find that a significant proportion of the mined parallel data for some language pairs is generated via one of these translation modules. The top 3 languages pairs, each with parallel translations into English, are Tagalog (50.6%) , Hindi (44.5%) and Galician (41.9%) . While these proportions do not reflect impact on each language’s monolingual web, they are certainly high 1364 enough to affect machine translations systems that train on mined parallel data. In this work, we develop a general approach to watermark structured outputs and apply it to the outputs of a statistical machine translation system with the goal of identifying these same outputs on the web. In the context of the watermarking task defined above, we output selecting alternative translations for input source sentences. These translations often undergo simple edit and formatting operations such as case changes, sentence and word deletion or post editing, prior to publishing on the web. We want to ensure that we can still detect watermarked translations despite these edit operations. Given the rapid pace of development within machine translation, it is also important that the watermark be robust to improvements in underlying translation quality. Results from several iterations of the system within a single collection of documents should be identifiable under probabilistic bounds. While we present evaluation results for statistical machine translation, our proposed approach and associated requirements are applicable to any algorithm that produces structured results with several plausible alternatives. The alternative results can arise as a result of inherent task ambiguity (for example, there are multiple correct translations for a given input source sentence) or modeling uncertainty (for example, a model assigning equal probability to two competing results) . 3 Watermark Structured Results Selecting an alternative r0 from the space of alternatives Dk (q) can be stated as: r0= arr∈gDmk(aqx)w(r,Dk(q),h) (1) where w ranks r ∈ Dk (q) based on r’s presentwahtieorne owf a watermarking signal computed by a hashing operation h. In this approach, w and its component operation h are the only secrets held by the watermarker. This selection criterion is applied to all system outputs, ensuring that watermarked and non-watermarked version of a collection will never be available for comparison. A specific implementation of w within our watermarking approach can be evaluated by the following metrics: • • • False Positive Rate: how often nonFwaaltseermarked collections are falsely identified as watermarked. Recall Rate: how often watermarked collRecectiaolnls R are correctly inde wntaitfeierdm as wdat ceorl-marked. Quality Degradation: how significantly dQoueasl CN0 d Dieffegrr fdraotmio CN when evaluated by tdaoseks specific quality Cmetrics. While identification is performed at the collection level, we can scale these metrics based on the size of each collection to provide more task sensitive metrics. For example, in machine translation, we count the number of words in the collection towards the false positive and recall rates. In Section 3.1, we define a random hashing operation h and a task independent implementation of the selector function w. Section 3.2 describes how to classify a collection of watermarked results. Section 3.3 and 3.4 describes refinements to the selection and classification criteria that mitigate quality degradation. Following a comparison to related work in Section 4, we present experimental results for several languages in Section 5. 3.1 Watermarking: CN → CN0 We define a random hashing operation h that is applied to result r. It consists of two components: • A hash function applied to a structured re- sAul ht r hto f generate a lbieitd sequence cotfu a dfix reedlength. • An optional mapping that maps a single cAannd oidptaitoen raels umlta r ntog a hsaett mofa spusb -are ssiunlgtsle. Each sub-result is then hashed to generate a concatenated bit sequence for r. A good hash function produces outputs whose bits are independent. This implies that we can treat the bits for any input structured results 1365 as having been generated by a binomial distribution with equal probability of generating 1s vs 0s. This condition also holds when accumulating the bit sequences over a collection of results as long as its elements are selected uniformly from the space of possible results. Therefore, the bits generated from a collection of unwatermarked results will follow a binomial distribution with parameter p = 0.5. This result provides a null hypothesis for a statistical test on a given bit sequence, testing whether it is likely to have been generated from a binomial distribution binomial(n, p) where p = 0.5 and n is the length of the bit sequence. For a collection CN = r1 · · · rN, we can define a Fwaorte arm coalrlekc ranking funct·i·o·nr w to systematically select alternatives ri0 ∈ Dk (q) , such that the resulting CN0 is unlikely ∈to D produce bit sequences ltthinagt f Collow the p = 0.5 binomial distribution. A straightforward biasing criteria would be to select the candidate whose bit sequence exhibits the highest ratio of 1s. w can be defined as: (2) w(r,Dk(q),h) =#(|h1,(rh)(|r)) where h(r) returns the randomized bit sequence for result r, and #(x, y) counts the number of occurrences of x in sequence Selecting alternatives results to exhibit this bias will result in watermarked collections that exhibit this same bias. y. 3.2 Detecting the Watermark To classify a collection CN as watermarked or non-watermarked, we apply the hashing operation h on each element in CN and concatenate ttihoen sequences. eTlhemis sequence is tested against the null hypothesis that it was generated by a binomial distribution with parameter p = 0.5. We can apply a Fisherian test of statistical significance to determine whether the observed distribution of bits is unlikely to have occurred by chance under the null hypothesis (binomial with p = 0.5) . We consider a collection of results that rejects the null hypothesis to be watermarked results generated by our own algorithms. The p-value under the null hypothesis is efficiently computed by: p − value = Pn (X ≥ x) = Xi=nx?ni?pi(1 − p)n−i (3) (4) where x is the number of 1s observed in the collection, and n is the total number of bits in the sequence. Comparing this p-value against a desired significance level α, we reject the null hypothesis for collections that have Pn(X ≥ x) < α, thus deciding that such collections( were gen- erated by our own system. This classification criteria has a fixed false positive rate. Setting α = 0.05, we know that 5% of non-watermarked bit sequences will be falsely labeled as watermarked. This parameter α can be controlled on an application specific basis. By biasing the selection of candidate results to produce more 1s than 0s, we have defined a watermarking approach that exhibits a fixed false positive rate, a probabilistically bounded detection rate and a task independent hashing and selection criteria. In the next sections, we will deal with the question of robustness to edit operations and quality degradation. 3.3 Robustness and Inherent Bias We would like the ability to identify watermarked collections to be robust to simple edit operations. Even slight modifications to the elements within an item r would yield (by construction of the hash function) , completely different bit sequences that no longer preserve the biases introduced by the watermark selection function. To ensure that the distributional biases introduced by the watermark selector are preserved, we can optionally map individual results into a set of sub-results, each one representing some local structure of r. h is then applied to each subresult and the results concatenated to represent r. This mapping is defined as a component of the h operation. While a particular edit operation might affect a small number of sub-results, the majority of the bits in the concatenated bit sequence for r would remain untouched, thereby limiting the damage to the biases selected during watermark1366 ing. This is of course no defense to edit operations that are applied globally across the result; our expectation is that such edits would either significantly degrade the quality of the result or be straightforward to identify directly. For example, a sequence of words r = z1 · · · zL can be mapped into a set of consecutive n-gram sequences. Operations to edit a word zi in r will only affect events that consider the word zi. To account for the fact that alternatives in Dk (q) might now result in bit sequences of different lengths, we can generalize the biasing criteria to directly reflect the expected contribution to the watermark by defining: w(r, Dk(q), h) = Pn(X ≥ #(1, h(r))) (5) where Pn gives probabilities from binomial(n = |h(r) |,p = 0.5) . (Irn)|h,epr =en 0t. 5c)o.llection level biases: Our null hypothesis is based on the assumption that collections of results draw uniformly from the space of possible results. This assumption might not always hold and depends on the type of the results and collection. For example, considering a text document as a collection of sentences, we can expect that some sentences might repeat more frequently than others. This scenario is even more likely when applying a mapping into sub-results. n-gram sequences follow long-tailed or Zipfian distributions, with a small number of n-grams contributing heavily toward the total number of n-grams in a document. A random hash function guarantees that inputs are distributed uniformly at random over the output range. However, the same input will be assigned the same output deterministically. Therefore, if the distribution of inputs is heavily skewed to certain elements of the input space, the output distribution will not be uniformly distributed. The bit sequences resulting from the high frequency sub-results have the potential to generate inherently biased distributions when accumulated at the collection level. We want to choose a mapping that tends towards generating uniformly from the space of sub-results. We can empirically measure the quality of a sub-result mapping for a specific task by computing the false positive rate on non-watermarked collections. For a given significance level α, an ideal mapping would result in false positive rates close to α as well. Figure 1 shows false positive rates from 4 alternative mappings, computed on a large corpus of French documents (see Table 1for statistics) . Classification decisions are made at the collection level (documents) but the contribution to the false positive rate is based on the number of words in the classified document. We consider mappings from a result (sentence) into its 1-grams, 1 − 5-grams and 3 − 5 grams as well as trahem non-mapping case, w 3h −ere 5 tghrea mfusll a sres wuelltl is hashed. Figure 1 shows that the 1-grams and 1 − 5gram generate wsusb t-hraetsul tthse t 1h-agtr rmessu latn idn 1h −eav 5-ily biased false positive rates. The 3 − 5 gram mapping yields pfaolsseit positive r.a Ttesh ecl 3os −e t 5o gthraemir theoretically expected values. 1 Small deviations are expected since documents make different contributions to the false positive rate as a function of the number of words that they represent. For the remainder of this work, we use the 3-5 gram mapping and the full sentence mapping, since the alternatives generate inherently distributions with very high false positive rates. 3.4 Considering Quality The watermarking described in Equation 3 chooses alternative results on a per result basis, with the goal of influencing collection level bit sequences. The selection criteria as described will choose the most biased candidates available in Dk (q) . The parameter k determines the extent to which lesser quality alternatives can be chosen. If all the alternatives in each Dk (q) are of relatively similar quality, we expect minimal degradation due to watermarking. Specific tasks however can be particularly sensitive to choosing alternative results. Discriminative approaches that optimize for arg max selection like (Och, 2003; Liang et al. , 2006; Chiang et al. , 2009) train model parameters such 1In the final version of this paper we will perform sampling to create a more reliable estimate of the false positive rate that is not overly influenced by document length distributions. 1367 that the top-ranked result is well separated from its competing alternatives. Different queries also differ in the inherent ambiguity expected from their results; sometimes there really is just one correct result for a query, while for other queries, several alternatives might be equally good. By generalizing the definition of the w function to interpolate the estimated loss in quality and the gain in the watermarking signal, we can trade-off the ability to identify the watermarked collections against quality degradation: w(r,Dk(q),fw)− =(1 λ − ∗ λ g)ai ∗nl( or,s D(rk,(Dq)k,(fqw)) (6) Loss: The loss(r, Dk (q)) function reflects the quality degradation that results from selecting alternative r as opposed to the best ranked candidate in Dk (q)) . We will experiment with two variants: lossrank (r, Dk (q)) = (rank(r) − k)/k losscost(r, Dk(q)) = (cost(r)−cost(r1))/ cost(r1) where: • • • rank(r) : returns the rank of r within Dk (q) . cost(r) : a weighted sum of features (not cnoosrtm(ra)li:ze ad over httheed sse uarmch o space) rine a loglinear model such as those mentioned in (Och, 2003). r1: the highest ranked alternative in Dk (q) . lossrank provides a generally applicable criteria to select alternatives, penalizing selection from deep within Dk (q) . This estimate of the quality degradation does not reflect the generating model’s opinion on relative quality. losscost considers the relative increase in the generating model’s cost assigned to the alternative translation. Gain: The gain(r, Dk (q) , fw) function reflects the gain in the watermarking signal by selecting candidate r. We simply define the gain as the Pn(X ≥ #(1, h(r))) from Equation 5. ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (a) 1-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (c) 3 − 5-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (b) 1− 5-grams mapping p-value threshold (d) Full result hashing Figure 1 Comparison : of expected false positive rates against observed false positive rates for different sub-result mappings. 4 Related Work Using watermarks with the goal of transmitting a hidden message within images, video, audio and monolingual text media is common. For structured text content, linguistic approaches like (Chapman et al. , 2001; Gupta et al., 2006) use language specific linguistic and semantic expansions to introduce hidden watermarks. These expansions provide alternative candidates within which messages can be encoded. Recent publications have extended this idea to machine translation, using multiple systems and expansions to generate alternative translations. (Stutsman et al. , 2006) uses a hashing function to select alternatives that encode the hidden message in the lower order bits of the translation. In each of these approaches, the watermarker has control over the collection of results into which the watermark is to be embedded. These approaches seek to embed a hidden message into a collection of results that is selected by the watermarker. In contrast, we address the condition where the input queries are not in the watermarker’s control. 1368 The goal is therefore to introduce the watermark into all generated results, with the goal of probabilistically identifying such outputs. Our approach is also task independent, avoiding the need for templates to generate additional alternatives. By addressing the problem directly within the search space of a dynamic programming algorithm, we have access to high quality alternatives with well defined models of quality loss. Finally, our approach is robust to local word editing. By using a sub-result mapping, we increase the level of editing required to obscure the watermark signal; at high levels of editing, the quality of the results themselves would be significantly degraded. 5 Experiments We evaluate our watermarking approach applied to the outputs of statistical machine translation under the following experimental setup. A repository of parallel (aligned source and target language) web documents is sampled to produce a large corpus on which to evaluate the watermarking classification performance. The corpora represent translations into 4 diverse target languages, using English as the source language. Each document in this corpus can be considered a collection of un-watermarked structured results, where source sentences are queries and each target sentence represents a structured result. Using a state-of-the-art phrase-based statistical machine translation system (Och and Ney, 2004) trained on parallel documents identified by (Uszkoreit et al. , 2010) , we generate a set of 100 alternative translations for each source sentence. We apply the proposed watermarking approach, along with the proposed refinements that address task specific loss (Section 3.4) and robustness to edit operations (Section 3.3) to generate watermarked corpora. Each method is controlled via a single parameter (like k or λ) which is varied to generate alternative watermarked collections. For each parameter value, we evaluate the Recall Rate and Quality Degradation with the goal of finding a setting that yields a high recall rate, minimal quality degradation. False positive rates are evaluated based on a fixed classification significance level of α = 0.05. The false positive and recall rates are evaluated on the word level; a document that is misclassified or correctly identified contributes its length in words towards the error calculation. In this work, we use α = 0.05 during classification corresponding to an expected 5% false positive rate. The false positive rate is a function of h and the significance level α and therefore constant across the parameter values k and λ. We evaluate quality degradation on human translated test corpora that are more typical for machine translation evaluation. Each test corpus consists of 5000 source sentences randomly selected from the web and translated into each respective language. We chose to evaluate quality on test corpora to ensure that degradations are not hidden by imperfectly matched web corpora and are consistent with the kind of results often reported for machine translation systems. As with the classification corpora, we create watermarked versions at each parameter value. For a given pa1369 recall Figure 2: BLEU loss against recall of watermarked content for the baseline approach (max K-best) , rank and cost interpolation. rameter value, we measure false positive and re- call rates on the classification corpora and quality degradation on the evaluation corpora. Table 1 shows corpus statistics for the classification and test corpora and non-watermarked BLEU scores for each target language. All source texts are in English. 5.1 Loss Interpolated Experiments Our first set of experiments demonstrates baseline performance using the watermarking criteria in Equation 5 versus the refinements suggested in Section 3.4 to mitigate quality degradation. The h function is computed on the full sentence result r with no sub-event mapping. The following methods are evaluated in Figure 2. • • Baseline method (labeled “max K-best” ): sBealescetlsin er0 purely (blaasbedel on gain Kin- bweastte”r):marking signal (Equation 5) and is parameterized by k: the number of alternatives considered for each result. Rank interpolation: incorporates rank into w, varying ptholea interpolation parameter nλ.t • Cost interpolation: incorporates cost into w, varying tohlea interpolation parameter nλ.t The observed false positive rate on the French classification corpora is 1.9%. ClassificationQuality Ta AbFHularei ankg1bdic:sehitCon#t12e08n7w39t1065o s40r7tda617tsi c#sfo1e r85n37c2018tl2a5e4s 5n0sicfeastion#adno1c68 q3u09 06ma70lietynsdegr#ad7 aw3 t534io9 0rn279dcsorp# as.e5 nN54 t08 oe369n-wceatsrmBaL21 kEe6320d. U462579 B%LEU scores are reported for the quality corpora. We consider 0.2% BLEU loss as a threshold for acceptable quality degradation. Each method is judged by its ability to achieve high recall below this quality degradation threshold. Applying cost interpolation yields the best results in Figure 2, achieving a recall of 85% at 0.2% BLEU loss, while rank interpolation achieves a recall of 76%. The baseline approach of selecting the highest gain candidate within a depth of k candidates does not provide sufficient parameterization to yield low quality degradation. At k = 2, this method yields almost 90% recall, but with approximately 0.4% BLEU loss. 5.2 Robustness Experiments In Section 5.2, we proposed mapping results into sub-events or features. We considered alternative feature mappings in Figure 1, finding that mapping sentence results into a collection of 35 grams yields acceptable false positive rates at varied levels of α. Figure 3 presents results that compare moving from the result level hashing to the 3-5 gram sub-result mapping. We show the impact of the mapping on the baseline max K-best method as well as for cost interpolation. There are substantial reductions in recall rate at the 0.2% BLEU loss level when applying sub-result mappings in cases. The cost interpolation method recall drops from 85% to 77% when using the 3-5 grams event mapping. The observed false positive rate of the 3-5 gram mapping is 4.7%. By using the 3-5 gram mapping, we expect to increase robustness against local word edit operations, but we have sacrificed recall rate due to the inherent distributional bias discussed in Section 3.3. 1370 recall Figure 3: BLEU loss against recall of watermarked content for the baseline and cost interpolation methods using both result level and 3-5 gram mapped events. 5.3 Multilingual Experiments The watermarking approach proposed here introduces no language specific watermarking operations and it is thus broadly applicable to translating into all languages. In Figure 4, we report results for the baseline and cost interpolation methods, considering both the result level and 3-5 gram mapping. We set α = 0.05 and measure recall at 0.2% BLEU degradation for translation from English into Arabic, French, Hindi and Turkish. The observed false positive rates for full sentence hashing are: Arabic: 2.4%, French: 1.8%, Hindi: 5.6% and Turkish: 5.5%, while for the 3-5 gram mapping, they are: Arabic: 5.8%, French: 7.5%, Hindi:3.5% and Turkish: 6.2%. Underlying translation quality plays an important role in translation quality degradation when watermarking. Without a sub-result mapping, French (BLEU: 26.45%) Figure 4: Loss of recall when using 3-5 gram mapping vs sentence level mapping for Arabic, French, Hindi and Turkish translations. achieves recall of 85% at 0.2% BLEU loss, while the other languages achieve over 90% recall at the same BLEU loss threshold. Using a subresult mapping degrades quality for each language pair, but changes the relative performance. Turkish experiences the highest relative drop in recall, unlike French and Arabic, where results are relatively more robust to using sub-sentence mappings. This is likely a result of differences in n-gram distributions across these languages. The languages considered here all use space separated words. For languages that do not, like Chinese or Thai, our approach can be applied at the character level. 6 Conclusions In this work we proposed a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method provides probabilistic bounds on detection ability, analytic control on quality degradation and is robust to local edit- ing operations. Our method is applicable to any task where structured outputs are generated with ambiguities or ties in the results. We applied this method to the outputs of statistical machine translation, evaluating each refinement to our approach with false positive and recall rates against BLEU score quality degradation. Our results show that it is possible, across several language pairs, to achieve high recall rates (over 80%) with low false positive rates (between 5 and 8%) at minimal quality degradation (0.2% 1371 BLEU) , while still allowing for local edit operations on the translated output. In future work we will continue to investigate methods to mitigate quality loss. References Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Minimum error rate training in statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311 . Mark Chapman, George Davida, and Marc Rennhardway. 2001. A practical and effective approach to large-scale automated linguistic steganography. In Proceedings of the Information Security Conference. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, pages 121–128. Gaurav Gupta, Josef Pieprzyk, and Hua Xiong Wang. 2006. An attack-localizing watermarking scheme for natural language documents. In Proceedings of the 2006 A CM Symposium on Information, computer and communications security, ASIACCS ’06, pages 157–165, New York, NY, USA. ACM. Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/A CL, pages 761–768. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics. Franz Josef Och and Hermann Ney. alignment template approach to statistical machine translation. Computational Linguistics. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 2003 Meeting of the Asssociation of Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. computational linguistics. Computational Linguistics. Ryan Stutsman, Mikhail Atallah, Christian Grothoff, and Krista Grothoff. 2006. Lost in just the translation. In Proceedings of the 2006 A CM Symposium on Applied Computing. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 2010 COLING. 1372 2004. The
3 0.80638051 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking
Author: Zheng Chen ; Heng Ji
Abstract: In this paper, we present a new ranking scheme, collaborative ranking (CR). In contrast to traditional non-collaborative ranking scheme which solely relies on the strengths of isolated queries and one stand-alone ranking algorithm, the new scheme integrates the strengths from multiple collaborators of a query and the strengths from multiple ranking algorithms. We elaborate three specific forms of collaborative ranking, namely, micro collaborative ranking (MiCR), macro collaborative ranking (MaCR) and micro-macro collab- orative ranking (MiMaCR). Experiments on entity linking task show that our proposed scheme is indeed effective and promising.
4 0.39693838 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
5 0.38546526 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information
Author: Richard Farkas
Abstract: Information-oriented document labeling is a special document multi-labeling task where the target labels refer to a specific information instead of the topic of the whole document. These kind oftasks are usually solved by looking up indicator phrases and analyzing their local context to filter false positive matches. Here, we introduce an approach for machine learning local content shifters which detects irrelevant local contexts using just the original document-level training labels. We handle content shifters in general, instead of learning a particular language phenomenon detector (e.g. negation or hedging) and form a single system for document labeling and content shift detection. Our empirical results achieved 24% error reduction compared to supervised baseline methods – on three document label– ing tasks.
6 0.37502432 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
7 0.37481418 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement
8 0.36600789 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
9 0.36283183 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs
10 0.36211032 70 emnlp-2011-Identifying Relations for Open Information Extraction
11 0.36121383 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
12 0.36004686 105 emnlp-2011-Predicting Thread Discourse Structure over Technical Web Forums
13 0.35596383 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing
14 0.35522401 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
15 0.35144693 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
16 0.35043275 138 emnlp-2011-Tuning as Ranking
17 0.34925404 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
18 0.34212169 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
19 0.3351253 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
20 0.33445555 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification