emnlp emnlp2011 emnlp2011-29 knowledge-graph by maker-knowledge-mining

29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking

Source: pdf

Author: Zheng Chen ; Heng Ji

Abstract: In this paper, we present a new ranking scheme, collaborative ranking (CR). In contrast to traditional non-collaborative ranking scheme which solely relies on the strengths of isolated queries and one stand-alone ranking algorithm, the new scheme integrates the strengths from multiple collaborators of a query and the strengths from multiple ranking algorithms. We elaborate three specific forms of collaborative ranking, namely, micro collaborative ranking (MiCR), macro collaborative ranking (MaCR) and micro-macro collab- orative ranking (MiMaCR). Experiments on entity linking task show that our proposed scheme is indeed effective and promising.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we present a new ranking scheme, collaborative ranking (CR). [sent-3, score-0.771]

2 In contrast to traditional non-collaborative ranking scheme which solely relies on the strengths of isolated queries and one stand-alone ranking algorithm, the new scheme integrates the strengths from multiple collaborators of a query and the strengths from multiple ranking algorithms. [sent-4, score-1.803]

3 We elaborate three specific forms of collaborative ranking, namely, micro collaborative ranking (MiCR), macro collaborative ranking (MaCR) and micro-macro collab- orative ranking (MiMaCR). [sent-5, score-1.603]

4 Experiments on entity linking task show that our proposed scheme is indeed effective and promising. [sent-6, score-0.248]

5 1 Introduction Many natural language processing tasks can be formalized as a ranking problem, namely to rank a collection of candidate “objects” with respect to a “query”. [sent-7, score-0.27]

6 Previous studies mainly focused on improving the ranking performance using one stand-alone learning algorithm on isolated queries. [sent-12, score-0.304]

7 The single query may not be formulated with the best terms or the query itself may not contain comprehensive information required for a highperformance ranking algorithm. [sent-22, score-0.8]

8 Therefore, techniques of query expansion or query reformulation can be introduced and previous research has shown the effectiveness ofthose techniques in such applications as information retrieval and question answering (Manning et al. [sent-23, score-0.559]

9 Nevertheless, previous research normally considers query reformulation as a new query for the ranking system, it would be more desirable to form a largerscale “collaborative” group for the query and make a unified decision based on the group. [sent-26, score-1.113]

10 Inspired from human collaborative learning in which two or more people form a group and accomplish work together, we propose a new ranking scheme, collaborative ranking, which aims to imitate human collaborative learning and enhance system ranking performance. [sent-27, score-1.281]

11 The main idea is to seek collaborations for each query from two levels: (1) query-level: search a group of query collaborators, and make the joint decision from the group together with the query using a stand-alone ranking algorithm. [sent-28, score-1.161]

12 Figure 1 presents an intuitive illustration of four ranking approaches, including the traditional noncollaborative ranking and three collaborative ranking forms: micro collaborative ranking (MiCR), macro collaborative ranking (MaCR), and micromacro collaborative ranking (MiMaCR). [sent-33, score-2.71]

13 Figure 1 (b) demonstrates that 6 query collaborators together with the query form a query collaboration group. [sent-35, score-1.27]

14 In this paper, we will show the efficacy of collaborative ranking on the entity linking task defined in the Knowledge Base Population (KBP) track (Ji et al. [sent-39, score-0.717]

15 Each query in the task is associated with a name string and its context document. [sent-41, score-0.297]

16 The intuition why query-level collaboration may work is that it leverages more comprehensive information about the entity mention from multiple “collaborators” (re1Query is normally expressed by small-scale data structure, so called micro. [sent-43, score-0.274]

17 Furthermore, previous work on this task mainly focused on comparing one ranking algorithm with the others, however, each ranking algorithm has its own strengths, and therefore, ranker-level collaboration can potentially improve the performance. [sent-46, score-0.674]

18 The goal of noncollaborative ranking is to seek a ranking function f such that it computes ranking scores for the cand{idates in the o}bject set, i. [sent-53, score-0.878]

19 Earlier studi}es on non-collaborative ranking mainly explored unsupervised approaches, e. [sent-59, score-0.27]

20 Recently, supervised approaches (named “learning to rank”) that automatically learn ranking functions from training data become the focus of ranking research. [sent-64, score-0.6]

21 1 Micro Collaborative Ranking(MiCR) Micro collaborative ranking is characterized by integrating joint strengths from multiple query collaborators and the query itself. [sent-116, score-1.477]

22 • Redundancy: Collaborators and query may sha•re R Rreedduunnddaanntc yin:formation. [sent-118, score-0.265]

23 • Robustness: Noisy collaborators are allowable, and• they ucsotnuleds sb:e N put uyn cdoelrl acboonrtarotol. [sent-120, score-0.341]

24 = associ= = Let f be a ranking function which 773 is obtained by either an unsupervised or supervised approach. [sent-129, score-0.33]

25 There are two important steps that distinguish MiCR from traditional non-collaborative ranking approaches: • Step (1): searching the best k collaborators of q. [sent-130, score-0.669]

26 • Step (2): simulating the interaction of k collaborators apt (t2h)e: ranking step. [sent-131, score-0.611]

27 In our case study presented later, we transform the collaborator searching problem into a clustering problem. [sent-133, score-0.251]

28 Collaborators of a query are then formed by members (excluding the query) in a cluster which contains the query and k is the size of the cluster minus one. [sent-134, score-0.584]

29 We transform the problem of step (2) into solving a function g1 such that a ranking score yj(q) can oj(q). [sent-135, score-0.298]

30 be computed for each object One approach to computing g1 is to firstly compute the ranking scores of collaborators and query using the ranking function f and then combine those ranking scores in some way (Formula 1). [sent-136, score-1.478]

31 The other approach is to learn a supervised ranking function which takes collaborators and query as input (Formula 2). [sent-137, score-0.936]

32 Input: a query q; a set of objects o(q) ; a function Output: a set of ranking scores y(q) 1: Search k collaborators of q: cq(q) = {cq1 2: 3: 4: 5: 6: g1 ,. [sent-159, score-0.929]

33 2 Macro Collaborative Ranking(MaCR) Macro collaborative ranking is characterized by integrating joint strengths from multiple rankers. [sent-169, score-0.606]

34 It is based on the following assumptions: • Independence: Each ranker can make its own ranking edpeecnisdieonncs. [sent-170, score-0.556]

35 e • Diversity: Each ranker has its own strengths in making ranking daecchis riaonnks. [sent-171, score-0.636]

36 Let = φ(q, be the feature vector formed from the pair consisting of query q and an associated object Let F∗ = {f1, . [sent-173, score-0.299]

37 W=e tr {afnsform the} computation nogf collaboration among rankers into solving the following composite function )g2: xj(q) oj(q)) oj(q). [sent-178, score-0.498]

38 A special form of ranking problem is that only the best object is required as output. [sent-183, score-0.304]

39 Input: a query q; a set of objects ; a set of m ranking functions F∗ ; a composite function g2 Output: a set of ranking scores 1: for j = 1; j <= ;j + do o(q) n(q) y(q) + 2: Form a feature vector xj(q). [sent-187, score-0.937]

40 3 Micro-Macro Collaborative Ranking (MiMaCR) The above two ranking approaches can be further integrated into a joint model which is named MicroMacro Collaborative Ranking (MiMaCR). [sent-194, score-0.27]

41 In order to compute query-level and ranker-level collaboration jointly, we solve the following complex composite function g3: yj(q) = g2(g1(•)) for each object oj(q), firstly (4) in which, we compute m micro-ranking scores using m ranking functions on query-level collaborators: anmds ecg. [sent-195, score-0.545]

42 Input: a query q; a set of objects ; a set of ranking functions F∗; functions g1, g2 Output: a set of ranking scores 1: Search k collaborators of q: = {cq1, . [sent-213, score-1.227]

43 6: end for 7: return o(q) y(q) cq(q) n(q) y(q) 4 A Case Study on Entity Linking To demonstrate the efficacy of our collaborative ranking scheme, we apply it to the entity linking task defined in the TAC-KBP2010 program (Ji et al. [sent-223, score-0.717]

44 , 2010) because there is a large amount of training and evaluation data available and various noncollaborative ranking approaches have been proposed, as summarized in (McNamee and Dang, 2009; Ji et al. [sent-224, score-0.31]

45 1 Task Definition The entity linking task aims to align a textual mention of a named entity (person,organization or geopolitical) to an appropriate entry in a knowledge base (KB), which may or may not contain the entity. [sent-227, score-0.36]

46 text) ednen ao ltaer a query sin C ,t lheet tqas =k which is a triple consisting of query id (q. [sent-231, score-0.564]

47 The entity linking system should return the id of “Michael Jordan (footballer)” as the answer, rather than the id of “Michael Jordan” who is most well known as a basketball player. [sent-249, score-0.284]

48 3 Baseline Rankers We developed 8 baseline rankers, including 4 unsupervised rankers (f1, f2, f3, f4) and 4 supervised rankers(f5 , f6, f7, f8). [sent-259, score-0.317]

49 •Naive (f1): since the answer for each query can eit•heNra ibvee a KB id or NIL, the naive ranker simply outputs NIL for all queries. [sent-260, score-0.619]

50 •Maxent (f5): a pointwise ranker implemented using OpenNLP Maxent toolkit4 which is based on maximum entropy model. [sent-273, score-0.315]

51 •SVM (f6): a pointwise ranker implemented using SSVVM M (lfight (Joachims, 1999). [sent-274, score-0.315]

52 •SVM ranking (f7): a pairwise ranker impleme•nStVedM using kSinVg M (frank (Joachims, 2006). [sent-275, score-0.578]

53 •ListNet (f8): a listwise ranker presented in (Cao et al. [sent-276, score-0.326]

54 The four supervised rankers apply exactly the same set of features except that SVM ranking (f7) needs to double expand the feature vector. [sent-278, score-0.587]

55 4 MiCR for Entity Linking We convert the collaborator searching problem into a clustering problem, i. [sent-286, score-0.251]

56 , for a given query q in the task, we retrieve at most K = 300 documents from the large corpus C, each of which contains q. [sent-288, score-0.265]

57 string; we ltahregne apply a clustering algorithm ntos generate clusters over the documents, and form query collaborators (excluding q. [sent-289, score-0.71]

58 We first selected f3 as our basic ranking function, and investigated whether the ranker can benefit from query collaborators formed by either agglomerative clustering or graph clustering. [sent-305, score-1.355]

59 We implemented three versions of composite function g1 (max, min and average), and experimented their performance using three unsupervised rankers f2, f3, f4 respectively. [sent-306, score-0.454]

60 Last, we implemented three supervised versions of g1 (Maxent, SVM and ListNet respectively) by adding cluster-level features and retraining the models in three supervised rankers f5, f6, f8 respectively. [sent-307, score-0.376]

61 Cluster-level features include maximum, minimum, average tfidf/entity similarities between the candidate and the query collaboration group. [sent-308, score-0.399]

62 Furthermore, we investigated how the performance can be affected by incrementally adding more rankers into the ranker set F∗. [sent-311, score-0.571]

63 To do so, we first sorted the 8 rankers according to their performance on the development set from the highest to the lowest, and starting with the highest performance ranker, we added one ranker at a time, until we have all the 8 rankers. [sent-312, score-0.571]

64 It is worth noting that, when there are even number of rankers in the set F∗, “ties” could take place using voting fiunn tchteio sne. [sent-313, score-0.37]

65 2 Performance of 8 Baseline Rankers Table 3 shows the performance of the 8 baseline rankers in 4 columns: Overall for all queries, PER for person queries, ORG for organization queries, and GPE for geo-political queries. [sent-326, score-0.285]

66 It shows that all the four supervised rankers perform better than the four unsupervised rankers. [sent-328, score-0.317]

67 Naive ranker obtains the lowest overall micro-average accuracy (54. [sent-329, score-0.311]

68 Among the four unsupervised rankers, profile ranker performs the best, which clearly shows that the extracted attributes of entities are effective for disambiguating confusable names. [sent-331, score-0.326]

69 For example, our data analysis shows that the attribute value of “per:alternative-name” from the context document is particularly useful if a person query is only mentioned by its last name. [sent-332, score-0.265]

70 For geo777 political queries, if the query is a city name, attribute “gpe:state” is useful to distinguish cities with the same name but in different states or provinces. [sent-334, score-0.297]

71 Among the four supervised rankers, ListNet outperforms SVM ranking and then SVM ranking outperforms the two pointwise rankers. [sent-335, score-0.601]

72 It may confirm previous research findings that listwise ranking is superior to pairwise ranking and pairwise ranking is superior to pointwise ranking (Cao et al. [sent-336, score-1.193]

73 The best baseline ranker (ListNet) obtains an absolute overall accuracy gain of 26. [sent-339, score-0.342]

74 3 Impact of MiCR To study the impact of MiCR, we first select f3 (tfidf ranker) as our ranking function. [sent-342, score-0.27]

75 Figure 3 shows the performance of applying different query collaborator searching strategies (graph or agglomerative clustering) and different versions of g1 (average, max and min respectively). [sent-343, score-0.663]

76 We intentionally adjust the meaning of threshold (x-axis) for both graph clustering and agglomerative clustering, such that at threshold 0, both clustering algorithms generate the largest number of clusters (i. [sent-344, score-0.387]

77 45, which clearly shows that removing noisy collaborators in the query collaboration group can improve the performance. [sent-349, score-0.788]

78 45, the number of query collaborators reduces and the performance significantly drops until it reaches the baseline performance of tfidf ranker (68. [sent-351, score-1.006]

79 It clearly shows that maintaining a controllable number of query collaborators can improve the performance. [sent-353, score-0.606]

80 7 (c) Min Function Threshold Figure 3: MiCR: comparison of average, max, and min functions combined with Graph and Agglomerative (Aggr)-based query collaborator searching strategies (tfidf ranker). [sent-362, score-0.533]

81 The max function (Figure 3 (b)) leverages the strengths from the strongest collaborator in the group, which can potentially improve KB accuracy, but meanwhile hurt NIL accuracy. [sent-363, score-0.337]

82 As shown in the figure, as more collaborators join in the group, the performance increases first for both graph and agglomerative clustering, however, it starts to deteriorate when arriving at a threshold, and in the end, the performance drops even lower than the baseline of tfidf ranker. [sent-364, score-0.574]

83 The min function (Figure 3 (c)) leverages the strengths from the weakest collaborator in the group, which can potentially improve NIL accuracy, but meanwhile hurt KB accuracy. [sent-365, score-0.355]

84 Min function is a counter example showing that searching query collaborators can not always lead to benefits. [sent-367, score-0.692]

85 To summarize so far, the best strategy for tfidf ranker in MiCR approach is graph-ave (applying graph clustering and using average function) which obtains overall accuracy gain of 5. [sent-368, score-0.563]

86 We further validate the performance of graph-ave using f2 , f4 ranking functions, for entity ranker, we obtain accuracy gain of 6. [sent-371, score-0.423]

87 We then experiment the three supervised g1 functions (ListNet, Maxent, and SVM respectively) using graph clustering as the query collaborator searching strategy. [sent-374, score-0.609]

88 Figure 6 shows that ListNet, Maxent, SVM rankers obtain accuracy gain of 1. [sent-375, score-0.341]

89 We observe that the performance drops when there are even number of rankers in the ranker set using voting function, which implies that our tie breaking strategy is not very effective. [sent-388, score-0.656]

90 We also experimented the voting function on the top 10 KBP2009 entity linking systems (each system performance is shown in the table embedded in Figure 5, and experiment is similarly done as described in section 4. [sent-389, score-0.329]

91 The reasons why we achieve relative smaller gains using our own ranker set are as follows: (1) we use the same candidate object set for all rankers, while different KBP2009 systems may use their own set of objects. [sent-394, score-0.32]

92 (2) our top 4 supervised rankers apply almost the same set of features, while different KBP2009 systems may apply more diversified features. [sent-395, score-0.317]

93 18037 #systems (rankers) Figure 5: MaCR: applying voting function to the top 10 KBP2009 entity linking systems. [sent-403, score-0.329]

94 6 Related Work In the literature of information retrieval, query expansion is a useful technique that involves the pro- cess of reformulating a query, and as a consequence, is capable to extend the ability of a query and improve the retrieval performance. [sent-416, score-0.559]

95 Various approaches for query expansion have been proposed, as summarized in (Manning et al. [sent-417, score-0.294]

96 The MiCR presented in this paper is superior to query expansion in two aspects, firstly, we leverage more information contained in multiple query collaborators; secondly, we place great emphasis on interactions among members in the query collaboration group. [sent-419, score-0.98]

97 In contrast, ensemble-based ranking has only recently attracted research interests (Hoi and Jin, 2008; Wei et al. [sent-422, score-0.27]

98 However, most of the previous research mainly focused on one or two ranking algorithms on isolated queries. [sent-429, score-0.304]

99 7 Conclusions We presented a new ranking scheme called collaborative ranking with three specific forms, MiCR, MaCR and MiMaCR and demonstrated its effectiveness on entity linking task. [sent-431, score-1.019]

100 In MiCR, effective searching of query collaborators and active interplay among members in the query collaboration group are two key factors that make MiCR successful. [sent-433, score-1.111]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('collaborators', 0.341), ('ranker', 0.286), ('rankers', 0.285), ('micr', 0.277), ('ranking', 0.27), ('query', 0.265), ('macr', 0.238), ('collaborative', 0.231), ('mimacr', 0.225), ('kb', 0.172), ('collaboration', 0.134), ('xj', 0.12), ('linking', 0.119), ('collaborator', 0.119), ('listnet', 0.119), ('tfidf', 0.114), ('nil', 0.103), ('entity', 0.097), ('yj', 0.096), ('cqk', 0.092), ('jq', 0.092), ('agglomerative', 0.086), ('voting', 0.085), ('strengths', 0.08), ('clustering', 0.074), ('oj', 0.068), ('jcqk', 0.066), ('min', 0.063), ('searching', 0.058), ('qi', 0.057), ('nq', 0.057), ('macro', 0.052), ('composite', 0.051), ('queries', 0.049), ('svm', 0.049), ('group', 0.048), ('micro', 0.048), ('mcnamee', 0.046), ('threshold', 0.045), ('max', 0.045), ('leverages', 0.043), ('maxent', 0.042), ('zheng', 0.04), ('profile', 0.04), ('laborativerankig', 0.04), ('listwise', 0.04), ('noncollaborative', 0.04), ('ji', 0.039), ('qin', 0.036), ('answer', 0.034), ('id', 0.034), ('object', 0.034), ('cq', 0.034), ('isolated', 0.034), ('graph', 0.033), ('jordan', 0.033), ('scheme', 0.032), ('chen', 0.032), ('name', 0.032), ('supervised', 0.032), ('ensemble', 0.032), ('gain', 0.031), ('filling', 0.031), ('clusters', 0.03), ('expansion', 0.029), ('anaphora', 0.029), ('pointwise', 0.029), ('functions', 0.028), ('function', 0.028), ('cluster', 0.027), ('versions', 0.027), ('burges', 0.027), ('dredze', 0.027), ('jqi', 0.026), ('micromacro', 0.026), ('rokach', 0.026), ('ssociated', 0.026), ('tamang', 0.026), ('yoshida', 0.026), ('accuracy', 0.025), ('cao', 0.025), ('population', 0.025), ('objects', 0.025), ('continues', 0.025), ('integrating', 0.025), ('entry', 0.024), ('boosting', 0.023), ('dang', 0.023), ('base', 0.023), ('hoi', 0.023), ('cuny', 0.023), ('footballer', 0.023), ('gpe', 0.023), ('tsai', 0.023), ('contained', 0.022), ('diversity', 0.022), ('slot', 0.022), ('meanwhile', 0.022), ('gradually', 0.022), ('pairwise', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking

Author: Zheng Chen ; Heng Ji

2 0.31151402 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion

Author: Swapna Gottipati ; Jing Jiang

Abstract: In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. We use both local contexts and global world knowledge to expand query language models. We place a strong emphasis on named entities in the local contexts and explore a positional language model to weigh them differently based on their distances to the query. Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. Compared with the official results from KBP 2010 participants, our system shows competitive performance.

3 0.099324271 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

Author: Prodromos Malakasiotis ; Ion Androutsopoulos

Abstract: We present a method that paraphrases a given sentence by first generating candidate paraphrases and then ranking (or classifying) them. The candidates are generated by applying existing paraphrasing rules extracted from parallel corpora. The ranking component considers not only the overall quality of the rules that produced each candidate, but also the extent to which they preserve grammaticality and meaning in the particular context of the input sentence, as well as the degree to which the candidate differs from the input. We experimented with both a Maximum Entropy classifier and an SVR ranker. Experimental results show that incorporating features from an existing paraphrase recognizer in the ranking component improves performance, and that our overall method compares well against a state of the art paraphrase generator, when paraphrasing rules apply to the input sentences. We also propose a new methodology to evaluate the ranking components of generate-and-rank paraphrase generators, which evaluates them across different combinations of weights for grammaticality, meaning preservation, and diversity. The paper is accompanied by a paraphrasing dataset we constructed for evaluations of this kind.

4 0.081793807 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base

Author: Ni Lao ; Tom Mitchell ; William W. Cohen

Abstract: t om . We consider the problem of performing learning and inference in a large scale knowledge base containing imperfect knowledge with incomplete coverage. We show that a soft inference procedure based on a combination of constrained, weighted, random walks through the knowledge base graph can be used to reliably infer new beliefs for the knowledge base. More specifically, we show that the system can learn to infer different target relations by tuning the weights associated with random walks that follow different paths through the graph, using a version of the Path Ranking Algorithm (Lao and Cohen, 2010b). We apply this approach to a knowledge base of approximately 500,000 beliefs extracted imperfectly from the web by NELL, a never-ending language learner (Carlson et al., 2010). This new system improves significantly over NELL’s earlier Horn-clause learning and inference method: it obtains nearly double the precision at rank 100, and the new learning method is also applicable to many more inference tasks.

5 0.061364673 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels

Author: Chao Shen ; Tao Li

Abstract: In active dual supervision, not only informative examples but also features are selected for labeling to build a high quality classifier with low cost. However, how to measure the informativeness for both examples and feature on the same scale has not been well solved. In this paper, we propose a non-negative matrix factorization based approach to address this issue. We first extend the matrix factorization framework to explicitly model the corresponding relationships between feature classes and examples classes. Then by making use of the reconstruction error, we propose a unified scheme to determine which feature or example a classifier is most likely to benefit from having labeled. Empirical results demonstrate the effectiveness of our proposed methods.

6 0.060000468 116 emnlp-2011-Robust Disambiguation of Named Entities in Text

7 0.059629597 128 emnlp-2011-Structured Relation Discovery using Generative Models

8 0.058241688 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

9 0.056763321 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models

10 0.050923675 138 emnlp-2011-Tuning as Ranking

11 0.050838523 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

12 0.04962305 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

13 0.048721969 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization

14 0.047608409 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives

15 0.045394186 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge

16 0.045096174 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues

17 0.04361048 67 emnlp-2011-Hierarchical Verb Clustering Using Graph Factorization

18 0.042919338 17 emnlp-2011-Active Learning with Amazon Mechanical Turk

19 0.042022098 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

20 0.039081462 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.14), (1, -0.091), (2, -0.043), (3, -0.052), (4, -0.027), (5, -0.099), (6, -0.003), (7, -0.099), (8, -0.053), (9, 0.247), (10, 0.038), (11, -0.185), (12, -0.16), (13, 0.125), (14, 0.134), (15, -0.163), (16, 0.32), (17, 0.222), (18, 0.115), (19, -0.068), (20, -0.109), (21, 0.057), (22, -0.017), (23, 0.169), (24, 0.012), (25, -0.129), (26, 0.01), (27, -0.175), (28, 0.058), (29, -0.035), (30, -0.219), (31, -0.12), (32, -0.062), (33, 0.034), (34, 0.021), (35, -0.017), (36, -0.072), (37, 0.005), (38, -0.001), (39, -0.086), (40, -0.088), (41, 0.049), (42, 0.041), (43, 0.005), (44, 0.016), (45, -0.056), (46, -0.05), (47, -0.037), (48, -0.05), (49, -0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97246492 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking

Author: Zheng Chen ; Heng Ji

2 0.92064863 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion

Author: Swapna Gottipati ; Jing Jiang

3 0.35567674 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

Author: Fabien Cromieres ; Sadao Kurohashi

Abstract: We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. We also propose improvements of this algorithm allowing several kinds of flexible matchings.

4 0.31921351 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base

Author: Ni Lao ; Tom Mitchell ; William W. Cohen

5 0.25247365 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels

Author: Chao Shen ; Tao Li

6 0.24343859 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge

7 0.23652466 116 emnlp-2011-Robust Disambiguation of Named Entities in Text

8 0.22703516 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

9 0.18900695 138 emnlp-2011-Tuning as Ranking

10 0.18071979 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

11 0.17841668 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models

12 0.17822109 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization

13 0.16386642 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

14 0.160198 144 emnlp-2011-Unsupervised Learning of Selectional Restrictions and Detection of Argument Coercions

15 0.15707044 110 emnlp-2011-Ranking Human and Machine Summarization Systems

16 0.1567269 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues

17 0.15575448 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

18 0.15192352 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion

19 0.1504067 67 emnlp-2011-Hierarchical Verb Clustering Using Graph Factorization

20 0.14656706 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.073), (36, 0.013), (37, 0.026), (45, 0.117), (53, 0.012), (54, 0.03), (57, 0.016), (62, 0.015), (64, 0.011), (66, 0.027), (69, 0.011), (79, 0.036), (82, 0.018), (96, 0.028), (98, 0.456)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88766056 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition

Author: Lukas Michelbacher ; Alok Kothari ; Martin Forst ; Christina Lioma ; Hinrich Schutze

Abstract: Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system. – 1

2 0.86363685 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

Author: Ashish Venugopal ; Jakob Uszkoreit ; David Talbot ; Franz Och ; Juri Ganitkevitch

Abstract: We propose a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method is robust to local editing operations and provides well defined trade-offs between the ability to identify algorithm outputs and the quality of the watermarked output. Unlike previous work in the field, our approach does not rely on controlling the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm. We present an application in statistical machine translation, where machine translated output is watermarked at minimal loss in translation quality and detected with high recall. 1 Motivation Machine learning algorithms provide structured results to input queries by simulating human behavior. Examples include automatic machine translation (Brown et al. , 1993) or automatic text and rich media summarization (Goldstein et al. , 1999) . These algorithms often estimate some portion of their models from publicly available human generated data. As new services that output structured results are made available to the public and the results disseminated on the web, we face a daunting new challenge: Machine generated structured results contaminate the pool of naturally generated human data. For example, machine translated output 1363 2Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218, USA juri@cs.jhu.edu and human generated translations are currently both found extensively on the web, with no automatic way of distinguishing between them. Algorithms that mine data from the web (Uszkoreit et al. , 2010) , with the goal of learning to simulate human behavior, will now learn models from this contaminated and potentially selfgenerated data, reinforcing the errors committed by earlier versions of the algorithm. It is beneficial to be able to identify a set of encountered structured results as having been generated by one’s own algorithm, with the purpose of filtering such results when building new models. Problem Statement: We define a structured result of a query q as r = {z1 · · · zL} where tthuree odr rdeesru latn dof identity qof a sele rm =en {tzs zi are important to the quality of the result r. The structural aspect of the result implies the existence of alternative results (across both the order of elements and the elements themselves) that might vary in their quality. Given a collection of N results, CN = r1 · · · rN, where each result ri has k rankedC alterna·t·iv·ers Dk (qi) of relatively similar quality and queries q1 · · · qN are arbitrary and not controlled by the watermarking algorithm, we define the watermarking task as: Task. Replace ri with ri0 ∈ Dk (qi) for some subset of results in CN to produce a watermarked sceoltle ocfti orens CN0 slleucchti otnha Ct: • CN0 is probabilistically identifiable as having bCeen generated by one’s own algorithm. Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1.e ?tc ho2d0s1 in A Nsasotucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag uesis 1ti3c6s3–1372, • • • 2 the degradation in quality from CN to the wthaete dremgarrakdeadt CN0 isnho quulda bitye analytically controllable, trading quality for detection performance. CN0 should not be detectable as watermCarked content without access to the generating algorithms. the detection of CN0 should be robust to simple eddeitte operations performed on individual results r ∈ CN0. Impact on Statistical Machine Translation Recent work(Resnik and Smith, 2003; Munteanu and Marcu, 2005; Uszkoreit et al. , 2010) has shown that multilingual parallel documents can be efficiently identified on the web and used as training data to improve the quality of statistical machine translation. The availability of free translation services (Google Translate, Bing Translate) and tools (Moses, Joshua) , increase the risk that the content found by parallel data mining is in fact generated by a machine, rather than by humans. In this work, we focus on statistical machine translation as an application for watermarking, with the goal of discarding documents from training if they have been generated by one’s own algorithms. To estimate the magnitude of the problem, we used parallel document mining (Uszkoreit et al. , 2010) to generate a collection of bilingual document pairs across several languages. For each document, we inspected the page content for source code that indicates the use of translation modules/plug-ins that translate and publish the translated content. We computed the proportion of the content within our corpus that uses these modules. We find that a significant proportion of the mined parallel data for some language pairs is generated via one of these translation modules. The top 3 languages pairs, each with parallel translations into English, are Tagalog (50.6%) , Hindi (44.5%) and Galician (41.9%) . While these proportions do not reflect impact on each language’s monolingual web, they are certainly high 1364 enough to affect machine translations systems that train on mined parallel data. In this work, we develop a general approach to watermark structured outputs and apply it to the outputs of a statistical machine translation system with the goal of identifying these same outputs on the web. In the context of the watermarking task defined above, we output selecting alternative translations for input source sentences. These translations often undergo simple edit and formatting operations such as case changes, sentence and word deletion or post editing, prior to publishing on the web. We want to ensure that we can still detect watermarked translations despite these edit operations. Given the rapid pace of development within machine translation, it is also important that the watermark be robust to improvements in underlying translation quality. Results from several iterations of the system within a single collection of documents should be identifiable under probabilistic bounds. While we present evaluation results for statistical machine translation, our proposed approach and associated requirements are applicable to any algorithm that produces structured results with several plausible alternatives. The alternative results can arise as a result of inherent task ambiguity (for example, there are multiple correct translations for a given input source sentence) or modeling uncertainty (for example, a model assigning equal probability to two competing results) . 3 Watermark Structured Results Selecting an alternative r0 from the space of alternatives Dk (q) can be stated as: r0= arr∈gDmk(aqx)w(r,Dk(q),h) (1) where w ranks r ∈ Dk (q) based on r’s presentwahtieorne owf a watermarking signal computed by a hashing operation h. In this approach, w and its component operation h are the only secrets held by the watermarker. This selection criterion is applied to all system outputs, ensuring that watermarked and non-watermarked version of a collection will never be available for comparison. A specific implementation of w within our watermarking approach can be evaluated by the following metrics: • • • False Positive Rate: how often nonFwaaltseermarked collections are falsely identified as watermarked. Recall Rate: how often watermarked collRecectiaolnls R are correctly inde wntaitfeierdm as wdat ceorl-marked. Quality Degradation: how significantly dQoueasl CN0 d Dieffegrr fdraotmio CN when evaluated by tdaoseks specific quality Cmetrics. While identification is performed at the collection level, we can scale these metrics based on the size of each collection to provide more task sensitive metrics. For example, in machine translation, we count the number of words in the collection towards the false positive and recall rates. In Section 3.1, we define a random hashing operation h and a task independent implementation of the selector function w. Section 3.2 describes how to classify a collection of watermarked results. Section 3.3 and 3.4 describes refinements to the selection and classification criteria that mitigate quality degradation. Following a comparison to related work in Section 4, we present experimental results for several languages in Section 5. 3.1 Watermarking: CN → CN0 We define a random hashing operation h that is applied to result r. It consists of two components: • A hash function applied to a structured re- sAul ht r hto f generate a lbieitd sequence cotfu a dfix reedlength. • An optional mapping that maps a single cAannd oidptaitoen raels umlta r ntog a hsaett mofa spusb -are ssiunlgtsle. Each sub-result is then hashed to generate a concatenated bit sequence for r. A good hash function produces outputs whose bits are independent. This implies that we can treat the bits for any input structured results 1365 as having been generated by a binomial distribution with equal probability of generating 1s vs 0s. This condition also holds when accumulating the bit sequences over a collection of results as long as its elements are selected uniformly from the space of possible results. Therefore, the bits generated from a collection of unwatermarked results will follow a binomial distribution with parameter p = 0.5. This result provides a null hypothesis for a statistical test on a given bit sequence, testing whether it is likely to have been generated from a binomial distribution binomial(n, p) where p = 0.5 and n is the length of the bit sequence. For a collection CN = r1 · · · rN, we can define a Fwaorte arm coalrlekc ranking funct·i·o·nr w to systematically select alternatives ri0 ∈ Dk (q) , such that the resulting CN0 is unlikely ∈to D produce bit sequences ltthinagt f Collow the p = 0.5 binomial distribution. A straightforward biasing criteria would be to select the candidate whose bit sequence exhibits the highest ratio of 1s. w can be defined as: (2) w(r,Dk(q),h) =#(|h1,(rh)(|r)) where h(r) returns the randomized bit sequence for result r, and #(x, y) counts the number of occurrences of x in sequence Selecting alternatives results to exhibit this bias will result in watermarked collections that exhibit this same bias. y. 3.2 Detecting the Watermark To classify a collection CN as watermarked or non-watermarked, we apply the hashing operation h on each element in CN and concatenate ttihoen sequences. eTlhemis sequence is tested against the null hypothesis that it was generated by a binomial distribution with parameter p = 0.5. We can apply a Fisherian test of statistical significance to determine whether the observed distribution of bits is unlikely to have occurred by chance under the null hypothesis (binomial with p = 0.5) . We consider a collection of results that rejects the null hypothesis to be watermarked results generated by our own algorithms. The p-value under the null hypothesis is efficiently computed by: p − value = Pn (X ≥ x) = Xi=nx?ni?pi(1 − p)n−i (3) (4) where x is the number of 1s observed in the collection, and n is the total number of bits in the sequence. Comparing this p-value against a desired significance level α, we reject the null hypothesis for collections that have Pn(X ≥ x) < α, thus deciding that such collections( were gen- erated by our own system. This classification criteria has a fixed false positive rate. Setting α = 0.05, we know that 5% of non-watermarked bit sequences will be falsely labeled as watermarked. This parameter α can be controlled on an application specific basis. By biasing the selection of candidate results to produce more 1s than 0s, we have defined a watermarking approach that exhibits a fixed false positive rate, a probabilistically bounded detection rate and a task independent hashing and selection criteria. In the next sections, we will deal with the question of robustness to edit operations and quality degradation. 3.3 Robustness and Inherent Bias We would like the ability to identify watermarked collections to be robust to simple edit operations. Even slight modifications to the elements within an item r would yield (by construction of the hash function) , completely different bit sequences that no longer preserve the biases introduced by the watermark selection function. To ensure that the distributional biases introduced by the watermark selector are preserved, we can optionally map individual results into a set of sub-results, each one representing some local structure of r. h is then applied to each subresult and the results concatenated to represent r. This mapping is defined as a component of the h operation. While a particular edit operation might affect a small number of sub-results, the majority of the bits in the concatenated bit sequence for r would remain untouched, thereby limiting the damage to the biases selected during watermark1366 ing. This is of course no defense to edit operations that are applied globally across the result; our expectation is that such edits would either significantly degrade the quality of the result or be straightforward to identify directly. For example, a sequence of words r = z1 · · · zL can be mapped into a set of consecutive n-gram sequences. Operations to edit a word zi in r will only affect events that consider the word zi. To account for the fact that alternatives in Dk (q) might now result in bit sequences of different lengths, we can generalize the biasing criteria to directly reflect the expected contribution to the watermark by defining: w(r, Dk(q), h) = Pn(X ≥ #(1, h(r))) (5) where Pn gives probabilities from binomial(n = |h(r) |,p = 0.5) . (Irn)|h,epr =en 0t. 5c)o.llection level biases: Our null hypothesis is based on the assumption that collections of results draw uniformly from the space of possible results. This assumption might not always hold and depends on the type of the results and collection. For example, considering a text document as a collection of sentences, we can expect that some sentences might repeat more frequently than others. This scenario is even more likely when applying a mapping into sub-results. n-gram sequences follow long-tailed or Zipfian distributions, with a small number of n-grams contributing heavily toward the total number of n-grams in a document. A random hash function guarantees that inputs are distributed uniformly at random over the output range. However, the same input will be assigned the same output deterministically. Therefore, if the distribution of inputs is heavily skewed to certain elements of the input space, the output distribution will not be uniformly distributed. The bit sequences resulting from the high frequency sub-results have the potential to generate inherently biased distributions when accumulated at the collection level. We want to choose a mapping that tends towards generating uniformly from the space of sub-results. We can empirically measure the quality of a sub-result mapping for a specific task by computing the false positive rate on non-watermarked collections. For a given significance level α, an ideal mapping would result in false positive rates close to α as well. Figure 1 shows false positive rates from 4 alternative mappings, computed on a large corpus of French documents (see Table 1for statistics) . Classification decisions are made at the collection level (documents) but the contribution to the false positive rate is based on the number of words in the classified document. We consider mappings from a result (sentence) into its 1-grams, 1 − 5-grams and 3 − 5 grams as well as trahem non-mapping case, w 3h −ere 5 tghrea mfusll a sres wuelltl is hashed. Figure 1 shows that the 1-grams and 1 − 5gram generate wsusb t-hraetsul tthse t 1h-agtr rmessu latn idn 1h −eav 5-ily biased false positive rates. The 3 − 5 gram mapping yields pfaolsseit positive r.a Ttesh ecl 3os −e t 5o gthraemir theoretically expected values. 1 Small deviations are expected since documents make different contributions to the false positive rate as a function of the number of words that they represent. For the remainder of this work, we use the 3-5 gram mapping and the full sentence mapping, since the alternatives generate inherently distributions with very high false positive rates. 3.4 Considering Quality The watermarking described in Equation 3 chooses alternative results on a per result basis, with the goal of influencing collection level bit sequences. The selection criteria as described will choose the most biased candidates available in Dk (q) . The parameter k determines the extent to which lesser quality alternatives can be chosen. If all the alternatives in each Dk (q) are of relatively similar quality, we expect minimal degradation due to watermarking. Specific tasks however can be particularly sensitive to choosing alternative results. Discriminative approaches that optimize for arg max selection like (Och, 2003; Liang et al. , 2006; Chiang et al. , 2009) train model parameters such 1In the final version of this paper we will perform sampling to create a more reliable estimate of the false positive rate that is not overly influenced by document length distributions. 1367 that the top-ranked result is well separated from its competing alternatives. Different queries also differ in the inherent ambiguity expected from their results; sometimes there really is just one correct result for a query, while for other queries, several alternatives might be equally good. By generalizing the definition of the w function to interpolate the estimated loss in quality and the gain in the watermarking signal, we can trade-off the ability to identify the watermarked collections against quality degradation: w(r,Dk(q),fw)− =(1 λ − ∗ λ g)ai ∗nl( or,s D(rk,(Dq)k,(fqw)) (6) Loss: The loss(r, Dk (q)) function reflects the quality degradation that results from selecting alternative r as opposed to the best ranked candidate in Dk (q)) . We will experiment with two variants: lossrank (r, Dk (q)) = (rank(r) − k)/k losscost(r, Dk(q)) = (cost(r)−cost(r1))/ cost(r1) where: • • • rank(r) : returns the rank of r within Dk (q) . cost(r) : a weighted sum of features (not cnoosrtm(ra)li:ze ad over httheed sse uarmch o space) rine a loglinear model such as those mentioned in (Och, 2003). r1: the highest ranked alternative in Dk (q) . lossrank provides a generally applicable criteria to select alternatives, penalizing selection from deep within Dk (q) . This estimate of the quality degradation does not reflect the generating model’s opinion on relative quality. losscost considers the relative increase in the generating model’s cost assigned to the alternative translation. Gain: The gain(r, Dk (q) , fw) function reflects the gain in the watermarking signal by selecting candidate r. We simply define the gain as the Pn(X ≥ #(1, h(r))) from Equation 5. ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (a) 1-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (c) 3 − 5-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (b) 1− 5-grams mapping p-value threshold (d) Full result hashing Figure 1 Comparison : of expected false positive rates against observed false positive rates for different sub-result mappings. 4 Related Work Using watermarks with the goal of transmitting a hidden message within images, video, audio and monolingual text media is common. For structured text content, linguistic approaches like (Chapman et al. , 2001; Gupta et al., 2006) use language specific linguistic and semantic expansions to introduce hidden watermarks. These expansions provide alternative candidates within which messages can be encoded. Recent publications have extended this idea to machine translation, using multiple systems and expansions to generate alternative translations. (Stutsman et al. , 2006) uses a hashing function to select alternatives that encode the hidden message in the lower order bits of the translation. In each of these approaches, the watermarker has control over the collection of results into which the watermark is to be embedded. These approaches seek to embed a hidden message into a collection of results that is selected by the watermarker. In contrast, we address the condition where the input queries are not in the watermarker’s control. 1368 The goal is therefore to introduce the watermark into all generated results, with the goal of probabilistically identifying such outputs. Our approach is also task independent, avoiding the need for templates to generate additional alternatives. By addressing the problem directly within the search space of a dynamic programming algorithm, we have access to high quality alternatives with well defined models of quality loss. Finally, our approach is robust to local word editing. By using a sub-result mapping, we increase the level of editing required to obscure the watermark signal; at high levels of editing, the quality of the results themselves would be significantly degraded. 5 Experiments We evaluate our watermarking approach applied to the outputs of statistical machine translation under the following experimental setup. A repository of parallel (aligned source and target language) web documents is sampled to produce a large corpus on which to evaluate the watermarking classification performance. The corpora represent translations into 4 diverse target languages, using English as the source language. Each document in this corpus can be considered a collection of un-watermarked structured results, where source sentences are queries and each target sentence represents a structured result. Using a state-of-the-art phrase-based statistical machine translation system (Och and Ney, 2004) trained on parallel documents identified by (Uszkoreit et al. , 2010) , we generate a set of 100 alternative translations for each source sentence. We apply the proposed watermarking approach, along with the proposed refinements that address task specific loss (Section 3.4) and robustness to edit operations (Section 3.3) to generate watermarked corpora. Each method is controlled via a single parameter (like k or λ) which is varied to generate alternative watermarked collections. For each parameter value, we evaluate the Recall Rate and Quality Degradation with the goal of finding a setting that yields a high recall rate, minimal quality degradation. False positive rates are evaluated based on a fixed classification significance level of α = 0.05. The false positive and recall rates are evaluated on the word level; a document that is misclassified or correctly identified contributes its length in words towards the error calculation. In this work, we use α = 0.05 during classification corresponding to an expected 5% false positive rate. The false positive rate is a function of h and the significance level α and therefore constant across the parameter values k and λ. We evaluate quality degradation on human translated test corpora that are more typical for machine translation evaluation. Each test corpus consists of 5000 source sentences randomly selected from the web and translated into each respective language. We chose to evaluate quality on test corpora to ensure that degradations are not hidden by imperfectly matched web corpora and are consistent with the kind of results often reported for machine translation systems. As with the classification corpora, we create watermarked versions at each parameter value. For a given pa1369 recall Figure 2: BLEU loss against recall of watermarked content for the baseline approach (max K-best) , rank and cost interpolation. rameter value, we measure false positive and re- call rates on the classification corpora and quality degradation on the evaluation corpora. Table 1 shows corpus statistics for the classification and test corpora and non-watermarked BLEU scores for each target language. All source texts are in English. 5.1 Loss Interpolated Experiments Our first set of experiments demonstrates baseline performance using the watermarking criteria in Equation 5 versus the refinements suggested in Section 3.4 to mitigate quality degradation. The h function is computed on the full sentence result r with no sub-event mapping. The following methods are evaluated in Figure 2. • • Baseline method (labeled “max K-best” ): sBealescetlsin er0 purely (blaasbedel on gain Kin- bweastte”r):marking signal (Equation 5) and is parameterized by k: the number of alternatives considered for each result. Rank interpolation: incorporates rank into w, varying ptholea interpolation parameter nλ.t • Cost interpolation: incorporates cost into w, varying tohlea interpolation parameter nλ.t The observed false positive rate on the French classification corpora is 1.9%. ClassificationQuality Ta AbFHularei ankg1bdic:sehitCon#t12e08n7w39t1065o s40r7tda617tsi c#sfo1e r85n37c2018tl2a5e4s 5n0sicfeastion#adno1c68 q3u09 06ma70lietynsdegr#ad7 aw3 t534io9 0rn279dcsorp# as.e5 nN54 t08 oe369n-wceatsrmBaL21 kEe6320d. U462579 B%LEU scores are reported for the quality corpora. We consider 0.2% BLEU loss as a threshold for acceptable quality degradation. Each method is judged by its ability to achieve high recall below this quality degradation threshold. Applying cost interpolation yields the best results in Figure 2, achieving a recall of 85% at 0.2% BLEU loss, while rank interpolation achieves a recall of 76%. The baseline approach of selecting the highest gain candidate within a depth of k candidates does not provide sufficient parameterization to yield low quality degradation. At k = 2, this method yields almost 90% recall, but with approximately 0.4% BLEU loss. 5.2 Robustness Experiments In Section 5.2, we proposed mapping results into sub-events or features. We considered alternative feature mappings in Figure 1, finding that mapping sentence results into a collection of 35 grams yields acceptable false positive rates at varied levels of α. Figure 3 presents results that compare moving from the result level hashing to the 3-5 gram sub-result mapping. We show the impact of the mapping on the baseline max K-best method as well as for cost interpolation. There are substantial reductions in recall rate at the 0.2% BLEU loss level when applying sub-result mappings in cases. The cost interpolation method recall drops from 85% to 77% when using the 3-5 grams event mapping. The observed false positive rate of the 3-5 gram mapping is 4.7%. By using the 3-5 gram mapping, we expect to increase robustness against local word edit operations, but we have sacrificed recall rate due to the inherent distributional bias discussed in Section 3.3. 1370 recall Figure 3: BLEU loss against recall of watermarked content for the baseline and cost interpolation methods using both result level and 3-5 gram mapped events. 5.3 Multilingual Experiments The watermarking approach proposed here introduces no language specific watermarking operations and it is thus broadly applicable to translating into all languages. In Figure 4, we report results for the baseline and cost interpolation methods, considering both the result level and 3-5 gram mapping. We set α = 0.05 and measure recall at 0.2% BLEU degradation for translation from English into Arabic, French, Hindi and Turkish. The observed false positive rates for full sentence hashing are: Arabic: 2.4%, French: 1.8%, Hindi: 5.6% and Turkish: 5.5%, while for the 3-5 gram mapping, they are: Arabic: 5.8%, French: 7.5%, Hindi:3.5% and Turkish: 6.2%. Underlying translation quality plays an important role in translation quality degradation when watermarking. Without a sub-result mapping, French (BLEU: 26.45%) Figure 4: Loss of recall when using 3-5 gram mapping vs sentence level mapping for Arabic, French, Hindi and Turkish translations. achieves recall of 85% at 0.2% BLEU loss, while the other languages achieve over 90% recall at the same BLEU loss threshold. Using a subresult mapping degrades quality for each language pair, but changes the relative performance. Turkish experiences the highest relative drop in recall, unlike French and Arabic, where results are relatively more robust to using sub-sentence mappings. This is likely a result of differences in n-gram distributions across these languages. The languages considered here all use space separated words. For languages that do not, like Chinese or Thai, our approach can be applied at the character level. 6 Conclusions In this work we proposed a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method provides probabilistic bounds on detection ability, analytic control on quality degradation and is robust to local edit- ing operations. Our method is applicable to any task where structured outputs are generated with ambiguities or ties in the results. We applied this method to the outputs of statistical machine translation, evaluating each refinement to our approach with false positive and recall rates against BLEU score quality degradation. Our results show that it is possible, across several language pairs, to achieve high recall rates (over 80%) with low false positive rates (between 5 and 8%) at minimal quality degradation (0.2% 1371 BLEU) , while still allowing for local edit operations on the translated output. In future work we will continue to investigate methods to mitigate quality loss. References Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Minimum error rate training in statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311 . Mark Chapman, George Davida, and Marc Rennhardway. 2001. A practical and effective approach to large-scale automated linguistic steganography. In Proceedings of the Information Security Conference. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, pages 121–128. Gaurav Gupta, Josef Pieprzyk, and Hua Xiong Wang. 2006. An attack-localizing watermarking scheme for natural language documents. In Proceedings of the 2006 A CM Symposium on Information, computer and communications security, ASIACCS ’06, pages 157–165, New York, NY, USA. ACM. Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/A CL, pages 761–768. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics. Franz Josef Och and Hermann Ney. alignment template approach to statistical machine translation. Computational Linguistics. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 2003 Meeting of the Asssociation of Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. computational linguistics. Computational Linguistics. Ryan Stutsman, Mikhail Atallah, Christian Grothoff, and Krista Grothoff. 2006. Lost in just the translation. In Proceedings of the 2006 A CM Symposium on Applied Computing. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 2010 COLING. 1372 2004. The

same-paper 3 0.83801782 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking

Author: Zheng Chen ; Heng Ji

4 0.4229311 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

5 0.41884154 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning

Author: Edward Grefenstette ; Mehrnoosh Sadrzadeh

Abstract: Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (2010) using data from the BNC and evaluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The evaluation is based on the word disambiguation task developed by Mitchell and Lapata (2008) for intransitive sentences, and on a similar new experiment designed for transitive sentences. Our model matches the results of its competitors . in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity showcases the compositional power of our model.

6 0.41361862 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

7 0.40530953 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement

8 0.4049516 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases

9 0.40451652 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter

10 0.39832771 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

11 0.39629766 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion

12 0.39251167 70 emnlp-2011-Identifying Relations for Open Information Extraction

13 0.39060241 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs

14 0.38766482 116 emnlp-2011-Robust Disambiguation of Named Entities in Text

15 0.3875168 105 emnlp-2011-Predicting Thread Discourse Structure over Technical Web Forums

16 0.38558948 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs

17 0.38474572 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus

18 0.3810581 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

19 0.38058698 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation

20 0.37442353 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification