acl acl2013 acl2013-277 knowledge-graph by maker-knowledge-mining

277 acl-2013-Part-of-speech tagging with antagonistic adversaries


Source: pdf

Author: Anders Sgaard

Abstract: Supervised NLP tools and on-line services are often used on data that is very different from the manually annotated data used during development. The performance loss observed in such cross-domain applications is often attributed to covariate shifts, with out-of-vocabulary effects as an important subclass. Many discriminative learning algorithms are sensitive to such shifts because highly indicative features may swamp other indicative features. Regularized and adversarial learning algorithms have been proposed to be more robust against covariate shifts. We present a new perceptron learning algorithm using antagonistic adversaries and compare it to previous proposals on 12 multilin- gual cross-domain part-of-speech tagging datasets. While previous approaches do not improve on our supervised baseline, our approach is better across the board with an average 4% error reduction.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Part-of-speech tagging with antagonistic adversaries Anders Søgaard Center for Language Technology University of Copenhagen DK-2300 Copenhagen S s oegaard@hum . [sent-1, score-0.78]

2 The performance loss observed in such cross-domain applications is often attributed to covariate shifts, with out-of-vocabulary effects as an important subclass. [sent-4, score-0.176]

3 Many discriminative learning algorithms are sensitive to such shifts because highly indicative features may swamp other indicative features. [sent-5, score-0.304]

4 Regularized and adversarial learning algorithms have been proposed to be more robust against covariate shifts. [sent-6, score-0.518]

5 We present a new perceptron learning algorithm using antagonistic adversaries and compare it to previous proposals on 12 multilin- gual cross-domain part-of-speech tagging datasets. [sent-7, score-0.983]

6 While previous approaches do not improve on our supervised baseline, our approach is better across the board with an average 4% error reduction. [sent-8, score-0.131]

7 Significance is usually tested across data points in standard NLP test sets. [sent-10, score-0.124]

8 Such datasets typically contain running text rather than independently sampled sentences, thereby violating the assumption that data points are independently distributed and sampled at random. [sent-11, score-0.417]

9 More importantly, significance across data points only says something about the likelyhood of observing the same effect on more data sampled the same way, but says nothing about likely performance on sentences sampled from different sources or different domains. [sent-12, score-0.499]

10 We assume that we are OinUteNre,s VteEdR inB performance across udmatea sets or domains rather than just performance across data points, but that we do not know the target domain in advance. [sent-19, score-0.137]

11 We will do crossdomain experiments using several target domains in order to compute significance across domains, enabling us to say something about likely performance on new domains. [sent-21, score-0.216]

12 Several authors have noted how POS tagging performance is sensitive to cross-domain shifts (Blitzer et al. [sent-22, score-0.201]

13 , 2006; Daume III, 2007; Jiang and Zhai, 2007), and while most authors have assumed known target distributions and pool unlabeled target data in order to automatically correct cross-domain bias (Jiang and Zhai, 2007; Foster et al. [sent-23, score-0.038]

14 , 2010), methods such as feature bag- ging (Sutton et al. [sent-24, score-0.097]

15 , 2006), learning with random adversaries (Globerson and Roweis, 2006) and L∞-regularization (Dekel and Shamir, 2008) have been proposed to improve performance on unknown target distributions. [sent-25, score-0.417]

16 These methods explicitly or implicitly try to minimize average or worst-case expected error across a set of possible test distributions in various ways. [sent-26, score-0.205]

17 These algorithms are related because of the intimate relationship between adversarial corruption and regularization (Ghaoui and Lebret, 1997; Xu et al. [sent-27, score-0.437]

18 This paper presents a new method based on learning with antagonistic adversaries. [sent-32, score-0.362]

19 Section 2 introduces previous work on robust perceptron learning, as well as the methods dicussed in the paper. [sent-34, score-0.172]

20 Section 3 motivates and introduces learning with antagonistic adversaries. [sent-35, score-0.43]

21 Section 4 presents experiments on POS tagging and discusses how to evaluate cross-domain performance. [sent-36, score-0.065]

22 Learning with antagonistic adver- saries is superior to the other approaches across 10/12 datasets with an average error reduction of 4% over a supervised baseline. [sent-37, score-0.421]

23 The problem with out-of-vocabulary effects can be illustrated using a small labeled data set: {x1 = h1, h0, 1, 0ii , x2 = h1, h0, 1, 1ii , x3 = h0, h0, 0, 0ii , x4 = h1, h0, 0, 1ii }. [sent-39, score-0.034]

24 Most discriminate learning algorithms only update parameters when training examples are misclassified. [sent-42, score-0.168]

25 In this example, a model initialized by zero weights would misclassify x1, update the parameter associated with feature x2 at a fixed rate α, and the returned model would then classify all data points correctly. [sent-43, score-0.254]

26 Hence the parameter associated with feature x3 would never be updated, although this feature is also correlated with class. [sent-44, score-0.104]

27 If x2 is missing in our test data (out-of-vocabulary), we end up classifying all data points as negative. [sent-45, score-0.127]

28 2 Robust perceptron learning Our framework will be averaged perceptron learning (Freund and Schapire, 1999; Collins, 2002). [sent-47, score-0.248]

29 We use an additive update algorithm and average parameters to prevent over-fitting. [sent-48, score-0.18]

30 In adversarial learning, adversaries corrupt the data point by applying transformations to data points. [sent-49, score-0.929]

31 Antagonistic adversaries choose transformations informed by the current model parameters w, but random adversaries randomly select transformations from a predefined set of possible transformations, e. [sent-50, score-1.138]

32 , 2006), the data is represented by different bags of features or different views, and the models learned using different feature bags are combined by averaging. [sent-55, score-0.25]

33 We can reformulate feature bagging as an adversarial learning problem. [sent-56, score-0.567]

34 For each pass, the adversary chooses a deleting transformation corresponding to one of the feature bags. [sent-57, score-0.689]

35 (2006), the feature bags simply divide the features into two or more representations. [sent-59, score-0.151]

36 Globerson and Roweis (2006) let an adversary corrupt labeled data during training to learn better models of test data with missing features. [sent-62, score-0.583]

37 They assume that missing features are randomly distributed and show that the optimization problem is a second-order cone program. [sent-63, score-0.126]

38 LRA is an adversarial game in which the two players are unaware of the other player’s current move, and in particular, where the adversary does not see model parameters and only randomly corrupts the data points. [sent-64, score-0.917]

39 Globerson and Roweis (2006) formulate LRA as a batch learning problem of minimizing worst case loss under deleting transformations deleting at most k features. [sent-65, score-0.644]

40 This is related to regularization in the following way: If model parameters are chosen to minimize expected error in the absence of any k features, we explicitly prevent under-weighting more than n − k features, i. [sent-66, score-0.275]

41 the model must be ambolree to classify kda fteaa twurelels ,in i. [sent-68, score-0.034]

42 The sparsest possible model would thus assign weights to k + 1parameters. [sent-71, score-0.04]

43 L∞-regularization hedges its bets even more than adversarial learning by minimizing expected error with max | |w| | < C. [sent-72, score-0.537]

44 In the online setting, trohirs corresponds t|o| playing against an adversary that clips any weight above a certain threshold C, whether positive or negative (Dekel and Shamir, 2008). [sent-73, score-0.518]

45 In geometric terms the weights are projected back onto the hyper-cube C. [sent-74, score-0.04]

46 A related approach, which is not explored in the experiments below, is to regularize linear models toward weights with low variance (Bergsma et al. [sent-75, score-0.076]

47 1Note that the batch version of feature bagging is an instance of group L1regularization (Jacob et al. [sent-77, score-0.266]

48 Often group regularization is about finding sparser models rather than robust models. [sent-80, score-0.151]

49 Sparse models can be obtained by grouping corre- × lated features; non-sparse models can be obtained by using independent, exhaustive views. [sent-81, score-0.04]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('adversary', 0.44), ('adversaries', 0.385), ('antagonistic', 0.33), ('adversarial', 0.292), ('roweis', 0.195), ('globerson', 0.18), ('transformations', 0.162), ('deleting', 0.154), ('bagging', 0.154), ('lra', 0.146), ('sutton', 0.111), ('covariate', 0.11), ('shamir', 0.11), ('bags', 0.099), ('shifts', 0.096), ('perceptron', 0.092), ('corrupt', 0.09), ('sampled', 0.085), ('points', 0.074), ('copenhagen', 0.07), ('dekel', 0.068), ('tagging', 0.065), ('regularization', 0.065), ('batch', 0.06), ('says', 0.059), ('game', 0.057), ('zhai', 0.055), ('update', 0.054), ('missing', 0.053), ('feature', 0.052), ('independently', 0.052), ('across', 0.05), ('minimizing', 0.05), ('indicative', 0.049), ('prevent', 0.049), ('inb', 0.049), ('bets', 0.049), ('aeb', 0.049), ('oegaard', 0.049), ('robust', 0.046), ('hum', 0.045), ('ging', 0.045), ('unaware', 0.045), ('parameters', 0.044), ('significance', 0.044), ('transformation', 0.043), ('minimize', 0.043), ('something', 0.043), ('kr', 0.042), ('crossdomain', 0.042), ('clips', 0.042), ('corruption', 0.042), ('pos', 0.042), ('error', 0.041), ('weights', 0.04), ('sparser', 0.04), ('hedges', 0.04), ('board', 0.04), ('lated', 0.04), ('cone', 0.04), ('proposals', 0.04), ('schmidt', 0.04), ('sensitive', 0.04), ('jiang', 0.039), ('gual', 0.039), ('gaard', 0.039), ('players', 0.039), ('player', 0.039), ('algorithms', 0.038), ('distributions', 0.038), ('ku', 0.037), ('mf', 0.037), ('freund', 0.037), ('reformulate', 0.037), ('domains', 0.037), ('playing', 0.036), ('efo', 0.036), ('governed', 0.036), ('violating', 0.036), ('regularize', 0.036), ('daume', 0.036), ('murphy', 0.035), ('identically', 0.035), ('introduces', 0.034), ('motivates', 0.034), ('anders', 0.034), ('yth', 0.034), ('effects', 0.034), ('classify', 0.034), ('distributed', 0.033), ('expected', 0.033), ('hinton', 0.033), ('additive', 0.033), ('moves', 0.033), ('schapire', 0.033), ('motivating', 0.032), ('passes', 0.032), ('learning', 0.032), ('loss', 0.032), ('modelled', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

Author: Anders Sgaard

Abstract: Supervised NLP tools and on-line services are often used on data that is very different from the manually annotated data used during development. The performance loss observed in such cross-domain applications is often attributed to covariate shifts, with out-of-vocabulary effects as an important subclass. Many discriminative learning algorithms are sensitive to such shifts because highly indicative features may swamp other indicative features. Regularized and adversarial learning algorithms have been proposed to be more robust against covariate shifts. We present a new perceptron learning algorithm using antagonistic adversaries and compare it to previous proposals on 12 multilin- gual cross-domain part-of-speech tagging datasets. While previous approaches do not improve on our supervised baseline, our approach is better across the board with an average 4% error reduction.

2 0.051503539 227 acl-2013-Learning to lemmatise Polish noun phrases

Author: Adam Radziszewski

Abstract: We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while our approach allows to separate chunking from lemmatisation.

3 0.051080372 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Author: Xiaodong Zeng ; Derek F. Wong ; Lidia S. Chao ; Isabel Trancoso

Abstract: This paper introduces a graph-based semisupervised joint model of Chinese word segmentation and part-of-speech tagging. The proposed approach is based on a graph-based label propagation technique. One constructs a nearest-neighbor similarity graph over all trigrams of labeled and unlabeled data for propagating syntactic information, i.e., label distributions. The derived label distributions are regarded as virtual evidences to regularize the learning of linear conditional random fields (CRFs) on unlabeled data. An inductive character-based joint model is obtained eventually. Empirical results on Chinese tree bank (CTB-7) and Microsoft Research corpora (MSR) reveal that the proposed model can yield better results than the supervised baselines and other competitive semi-supervised CRFs in this task.

4 0.04865234 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

5 0.04775868 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

Author: Jun Suzuki ; Masaaki Nagata

Abstract: This paper proposes a framework of supervised model learning that realizes feature grouping to obtain lower complexity models. The main idea of our method is to integrate a discrete constraint into model learning with the help of the dual decomposition technique. Experiments on two well-studied NLP tasks, dependency parsing and NER, demonstrate that our method can provide state-of-the-art performance even if the degrees of freedom in trained models are surprisingly small, i.e., 8 or even 2. This significant benefit enables us to provide compact model representation, which is especially useful in actual use.

6 0.045372125 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

7 0.042743295 328 acl-2013-Stacking for Statistical Machine Translation

8 0.042264119 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

9 0.041570891 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

10 0.040482778 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

11 0.03880436 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

12 0.037306167 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

13 0.036084812 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

14 0.036003731 80 acl-2013-Chinese Parsing Exploiting Characters

15 0.0352537 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

16 0.034772016 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

17 0.034705017 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

18 0.034555379 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

19 0.033956278 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction

20 0.032574609 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.095), (1, -0.014), (2, -0.029), (3, 0.001), (4, 0.012), (5, -0.025), (6, 0.018), (7, -0.003), (8, -0.024), (9, 0.011), (10, -0.006), (11, 0.002), (12, -0.019), (13, -0.028), (14, -0.039), (15, 0.033), (16, -0.052), (17, 0.04), (18, 0.02), (19, -0.046), (20, 0.061), (21, 0.034), (22, 0.032), (23, 0.005), (24, 0.025), (25, 0.018), (26, 0.015), (27, -0.047), (28, -0.044), (29, -0.001), (30, 0.051), (31, -0.013), (32, -0.003), (33, 0.038), (34, -0.001), (35, 0.001), (36, 0.015), (37, 0.004), (38, 0.019), (39, 0.022), (40, 0.043), (41, -0.04), (42, 0.021), (43, -0.015), (44, 0.051), (45, 0.001), (46, 0.034), (47, -0.002), (48, 0.042), (49, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89079654 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

Author: Anders Sgaard

Abstract: Supervised NLP tools and on-line services are often used on data that is very different from the manually annotated data used during development. The performance loss observed in such cross-domain applications is often attributed to covariate shifts, with out-of-vocabulary effects as an important subclass. Many discriminative learning algorithms are sensitive to such shifts because highly indicative features may swamp other indicative features. Regularized and adversarial learning algorithms have been proposed to be more robust against covariate shifts. We present a new perceptron learning algorithm using antagonistic adversaries and compare it to previous proposals on 12 multilin- gual cross-domain part-of-speech tagging datasets. While previous approaches do not improve on our supervised baseline, our approach is better across the board with an average 4% error reduction.

2 0.68455762 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

3 0.66839784 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

Author: Jun Suzuki ; Masaaki Nagata

Abstract: This paper proposes a framework of supervised model learning that realizes feature grouping to obtain lower complexity models. The main idea of our method is to integrate a discrete constraint into model learning with the help of the dual decomposition technique. Experiments on two well-studied NLP tasks, dependency parsing and NER, demonstrate that our method can provide state-of-the-art performance even if the degrees of freedom in trained models are surprisingly small, i.e., 8 or even 2. This significant benefit enables us to provide compact model representation, which is especially useful in actual use.

4 0.64430308 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

Author: Vladimir Eidelman ; Ke Wu ; Ferhan Ture ; Philip Resnik ; Jimmy Lin

Abstract: We present an open-source framework for large-scale online structured learning. Developed with the flexibility to handle cost-augmented inference problems such as statistical machine translation (SMT), our large-margin learner can be used with any decoder. Integration with MapReduce using Hadoop streaming allows efficient scaling with increasing size of training data. Although designed with a focus on SMT, the decoder-agnostic design of our learner allows easy future extension to other structured learning problems such as sequence labeling and parsing.

5 0.62920207 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

6 0.58134246 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms

7 0.57163543 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

8 0.56244296 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals

9 0.54462379 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

10 0.53997844 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction

11 0.53944057 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

12 0.5356735 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

13 0.52221644 328 acl-2013-Stacking for Statistical Machine Translation

14 0.52170449 14 acl-2013-A Novel Classifier Based on Quantum Computation

15 0.51575297 24 acl-2013-A Tale about PRO and Monsters

16 0.50405854 390 acl-2013-Word surprisal predicts N400 amplitude during reading

17 0.49464914 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

18 0.49414808 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

19 0.49041167 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

20 0.48502386 63 acl-2013-Automatic detection of deception in child-produced speech using syntactic complexity features


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.599), (6, 0.022), (11, 0.029), (24, 0.016), (26, 0.035), (35, 0.045), (42, 0.027), (48, 0.042), (70, 0.023), (88, 0.023), (90, 0.012), (95, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96860957 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

Author: Daniel Bar ; Torsten Zesch ; Iryna Gurevych

Abstract: We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity additionally comes with a set of full-featured experimental setups which can be run out-of-the-box and be used for future systems to built upon.

2 0.95254159 269 acl-2013-PLIS: a Probabilistic Lexical Inference System

Author: Eyal Shnarch ; Erel Segal-haLevi ; Jacob Goldberger ; Ido Dagan

Abstract: This paper presents PLIS, an open source Probabilistic Lexical Inference System which combines two functionalities: (i) a tool for integrating lexical inference knowledge from diverse resources, and (ii) a framework for scoring textual inferences based on the integrated knowledge. We provide PLIS with two probabilistic implementation of this framework. PLIS is available for download and developers of text processing applications can use it as an off-the-shelf component for injecting lexical knowledge into their applications. PLIS is easily configurable, components can be extended or replaced with user generated ones to enable system customization and further research. PLIS includes an online interactive viewer, which is a powerful tool for investigating lexical inference processes. 1 Introduction and background Semantic Inference is the process by which machines perform reasoning over natural language texts. A semantic inference system is expected to be able to infer the meaning of one text from the meaning of another, identify parts of texts which convey a target meaning, and manipulate text units in order to deduce new meanings. Semantic inference is needed for many Natural Language Processing (NLP) applications. For instance, a Question Answering (QA) system may encounter the following question and candidate answer (Example 1): Q: which explorer discovered the New World? A: Christopher Columbus revealed America. As there are no overlapping words between the two sentences, to identify that A holds an answer for Q, background world knowledge is needed to link Christopher Columbus with explorer and America with New World. Linguistic knowledge is also needed to identify that reveal and discover refer to the same concept. Knowledge is needed in order to bridge the gap between text fragments, which may be dissimilar on their surface form but share a common meaning. For the purpose of semantic inference, such knowledge can be derived from various resources (e.g. WordNet (Fellbaum, 1998) and others, detailed in Section 2.1) in a form which we denote as inference links (often called inference/entailment rules), each is an ordered pair of elements in which the first implies the meaning of the second. For instance, the link ship→vessel can be derived from tshtaen hypernym rkel sahtiiopn→ ovfe Wsseolr cdNanet b. Other applications can benefit from utilizing inference links to identify similarity between language expressions. In Information Retrieval, the user’s information need may be expressed in relevant documents differently than it is expressed in the query. Summarization systems should identify text snippets which convey the same meaning. Our work addresses a generic, application in- dependent, setting of lexical inference. We therefore adopt the terminology of Textual Entailment (Dagan et al., 2006), a generic paradigm for applied semantic inference which captures inference needs of many NLP applications in a common underlying task: given two textual fragments, termed hypothesis (H) and text (T), the task is to recognize whether T implies the meaning of H, denoted T→H. For instance, in a QA application, H reprTe→seHnts. Fthoer question, a innd a T Q a c aanpdpilidcaattei answer. pInthis setting, T is likely to hold an answer for the question if it entails the question. It is challenging to properly extract the needed inference knowledge from available resources, and to effectively utilize it within the inference process. The integration of resources, each has its own format, is technically complex and the quality 97 ProceedingSsof oiaf, th Beu 5lg1asrtia A,n Anuuaglu Mst 4ee-9tin 2g0 o1f3. th ?ec A20ss1o3ci Aastisoonci faotrio Cno fomrp Cuotamtipountaalti Loinnaglu Lisitnigcsu,is patigcess 97–102, Figure 1: PLIS schema - a text-hypothesis pair is processed by the Lexical Integrator which uses a set of lexical resources to extract inference chains which connect the two. The Lexical Inference component provides probability estimations for the validity of each level of the process. ofthe resulting inference links is often unknown in advance and varies considerably. For coping with this challenge we developed PLIS, a Probabilistic Lexical Inference System1 . PLIS, illustrated in Fig 1, has two main modules: the Lexical Integra- tor (Section 2) accepts a set of lexical resources and a text-hypothesis pair, and finds all the lexical inference relations between any pair of text term ti and hypothesis term hj, based on the available lexical relations found in the resources (and their combination). The Lexical Inference module (Section 3) provides validity scores for these relations. These term-level scores are used to estimate the sentence-level likelihood that the meaning of the hypothesis can be inferred from the text, thus making PLIS a complete lexical inference system. Lexical inference systems do not look into the structure of texts but rather consider them as bag ofterms (words or multi-word expressions). These systems are easy to implement, fast to run, practical across different genres and languages, while maintaining a competitive level of performance. PLIS can be used as a stand-alone efficient inference system or as the lexical component of any NLP application. PLIS is a flexible system, allowing users to choose the set of knowledge resources as well as the model by which inference 1The complete software package is available at http:// www.cs.biu.ac.il/nlp/downloads/PLIS.html and an online interactive viewer is available for examination at http://irsrv2. cs.biu.ac.il/nlp-net/PLIS.html. is done. PLIS can be easily extended with new knowledge resources and new inference models. It comes with a set of ready-to-use plug-ins for many common lexical resources (Section 2.1) as well as two implementation of the scoring framework. These implementations, described in (Shnarch et al., 2011; Shnarch et al., 2012), provide probability estimations for inference. PLIS has an interactive online viewer (Section 4) which provides a visualization of the entire inference process, and is very helpful for analysing lexical inference models and lexical resources usability. 2 Lexical integrator The input for the lexical integrator is a set of lexical resources and a pair of text T and hypothesis H. The lexical integrator extracts lexical inference links from the various lexical resources to connect each text term ti ∈ T with each hypothesis term hj ∈ H2. A lexical i∈nfTer wenicthe elianckh hinydpicoathteess a semantic∈ rHelation between two terms. It could be a directional relation (Columbus→navigator) or a bai ddiirreeccttiioonnaall one (car ←→ automobile). dSirinecceti knowledge resources vary lien) their representation methods, the lexical integrator wraps each lexical resource in a common plug-in interface which encapsulates resource’s inner representation method and exposes its knowledge as a list of inference links. The implemented plug-ins that come with PLIS are described in Section 2.1. Adding a new lexical resource and integrating it with the others only demands the implementation of the plug-in interface. As the knowledge needed to connect a pair of terms, ti and hj, may be scattered across few resources, the lexical integrator combines inference links into lexical inference chains to deduce new pieces of knowledge, such as Columbus −r −e −so −u −rc −e →2 −r −e −so −u −rc −e →1 navigator explorer. Therefore, the only assumption −t −he − l−e −x →ica elx integrator makes, regarding its input lexical resources, is that the inferential lexical relations they provide are transitive. The lexical integrator generates lexical infer- ence chains by expanding the text and hypothesis terms with inference links. These links lead to new terms (e.g. navigator in the above chain example and t0 in Fig 1) which can be further expanded, as all inference links are transitive. A transitivity 2Where iand j run from 1 to the length of the text and hypothesis respectively. 98 limit is set by the user to determine the maximal length for inference chains. The lexical integrator uses a graph-based representation for the inference chains, as illustrates in Fig 1. A node holds the lemma, part-of-speech and sense of a single term. The sense is the ordinal number of WordNet sense. Whenever we do not know the sense of a term we implement the most frequent sense heuristic.3 An edge represents an inference link and is labeled with the semantic relation of this link (e.g. cytokine→protein is larbeellaetdio wni othf tt hheis sW linokrd (Nee.gt .re clayttiookni hypernym). 2.1 Available plug-ins for lexical resources We have implemented plug-ins for the follow- ing resources: the English lexicon WordNet (Fellbaum, 1998)(based on either JWI, JWNL or extJWNL java APIs4), CatVar (Habash and Dorr, 2003), a categorial variations database, Wikipedia-based resource (Shnarch et al., 2009), which applies several extraction methods to derive inference links from the text and structure of Wikipedia, VerbOcean (Chklovski and Pantel, 2004), a knowledge base of fine-grained semantic relations between verbs, Lin’s distributional similarity thesaurus (Lin, 1998), and DIRECT (Kotlerman et al., 2010), a directional distributional similarity thesaurus geared for lexical inference. To summarize, the lexical integrator finds all possible inference chains (of a predefined length), resulting from any combination of inference links extracted from lexical resources, which link any t, h pair of a given text-hypothesis. Developers can use this tool to save the hassle of interfacing with the different lexical knowledge resources, and spare the labor of combining their knowledge via inference chains. The lexical inference model, described next, provides a mean to decide whether a given hypothesis is inferred from a given text, based on weighing the lexical inference chains extracted by the lexical integrator. 3 Lexical inference There are many ways to implement an inference model which identifies inference relations between texts. A simple model may consider the 3This disambiguation policy was better than considering all senses of an ambiguous term in preliminary experiments. However, it is a matter of changing a variable in the configuration of PLIS to switch between these two policies. 4http://wordnet.princeton.edu/wordnet/related-projects/ number of hypothesis terms for which inference chains, originated from text terms, were found. In PLIS, the inference model is a plug-in, similar to the lexical knowledge resources, and can be easily replaced to change the inference logic. We provide PLIS with two implemented baseline lexical inference models which are mathematically based. These are two Probabilistic Lexical Models (PLMs), HN-PLM and M-PLM which are described in (Shnarch et al., 2011; Shnarch et al., 2012) respectively. A PLM provides probability estimations for the three parts of the inference process (as shown in Fig 1): the validity probability of each inference chain (i.e. the probability for a valid inference relation between its endpoint terms) P(ti → hj), the probability of each hypothesis term to →b e i hnferred by the entire text P(T → hj) (term-level probability), eanntdir teh tee probability o hf the entire hypothesis to be inferred by the text P(T → H) (sentencelteov eble probability). HN-PLM describes a generative process by which the hypothesis is generated from the text. Its parameters are the reliability level of each of the resources it utilizes (that is, the prior probability that applying an arbitrary inference link derived from each resource corresponds to a valid inference). For learning these parameters HN-PLM applies a schema of the EM algorithm (Dempster et al., 1977). Its performance on the recognizing textual entailment task, RTE (Bentivogli et al., 2009; Bentivogli et al., 2010), are in line with the state of the art inference systems, including complex systems which perform syntactic analysis. This model is improved by M-PLM, which deduces sentence-level probability from term-level probabilities by a Markovian process. PLIS with this model was used for a passage retrieval for a question answering task (Wang et al., 2007), and outperformed state of the art inference systems. Both PLMs model the following prominent aspects of the lexical inference phenomenon: (i) considering the different reliability levels of the input knowledge resources, (ii) reducing inference chain probability as its length increases, and (iii) increasing term-level probability as we have more inference chains which suggest that the hypothesis term is inferred by the text. Both PLMs only need sentence-level annotations from which they derive term-level inference probabilities. To summarize, the lexical inference module 99 ?(? → ?) Figure 2: PLIS interactive viewer with Example 1 demonstrates knowledge integration of multiple inference chains and resource combination (additional explanations, which are not part of the demo, are provided in orange). provides the setting for interfacing with the lexical integrator. Additionally, the module provides the framework for probabilistic inference models which estimate term-level probabilities and integrate them into a sentence-level inference decision, while implementing prominent aspects of lexical inference. The user can choose to apply another inference logic, not necessarily probabilistic, by plugging a different lexical inference model into the provided inference infrastructure. 4 The PLIS interactive system PLIS comes with an online interactive viewer5 in which the user sets the parameters of PLIS, inserts a text-hypothesis pair and gets a visualization of the entire inference process. This is a powerful tool for investigating knowledge integration and lexical inference models. Fig 2 presents a screenshot of the processing of Example 1. On the right side, the user configures the system by selecting knowledge resources, adjusting their configuration, setting the transitivity limit, and choosing the lexical inference model to be applied by PLIS. After inserting a text and a hypothesis to the appropriate text boxes, the user clicks on the infer button and PLIS generates all lexical inference chains, of length up to the transitivity limit, that connect text terms with hypothesis terms, as available from the combination of the selected input re5http://irsrv2.cs.biu.ac.il/nlp-net/PLIS.html sources. Each inference chain is presented in a line between the text and hypothesis. PLIS also displays the probability estimations for all inference levels; the probability of each chain is presented at the end of its line. For each hypothesis term, term-level probability, which weighs all inference chains found for it, is given below the dashed line. The overall sentence-level probability integrates the probabilities of all hypothesis terms and is displayed in the box at the bottom right corner. Next, we detail the inference process of Example 1, as presented in Fig 2. In this QA example, the probability of the candidate answer (set as the text) to be relevant for the given question (the hypothesis) is estimated. When utilizing only two knowledge resources (WordNet and Wikipedia), PLIS is able to recognize that explorer is inferred by Christopher Columbus and that New World is inferred by America. Each one of these pairs has two independent inference chains, numbered 1–4, as evidence for its inference relation. Both inference chains 1 and 3 include a single inference link, each derived from a different relation of the Wikipedia-based resource. The inference model assigns a higher probability for chain 1since the BeComp relation is much more reliable than the Link relation. This comparison illustrates the ability of the inference model to learn how to differ knowledge resources by their reliability. Comparing the probability assigned by the in100 ference model for inference chain 2 with the probabilities assigned for chains 1 and 3, reveals the sophisticated way by which the inference model integrates lexical knowledge. Inference chain 2 is longer than chain 1, therefore its probability is lower. However, the inference model assigns chain 2 a higher probability than chain 3, even though the latter is shorter, since the model is sensitive enough to consider the difference in reliability levels between the two highly reliable hypernym relations (from WordNet) of chain 2 and the less reliable Link relation (from Wikipedia) of chain 3. Another aspect of knowledge integration is exemplified in Fig 2 by the three circled probabilities. The inference model takes into consideration the multiple pieces of evidence for the inference of New World (inference chains 3 and 4, whose probabilities are circled). This results in a termlevel probability estimation for New World (the third circled probability) which is higher than the probabilities of each chain separately. The third term of the hypothesis, discover, remains uncovered by the text as no inference chain was found for it. Therefore, the sentence-level inference probability is very low, 37%. In order to identify that the hypothesis is indeed inferred from the text, the inference model should be provided with indications for the inference of discover. To that end, the user may increase the transitivity limit in hope that longer inference chains provide the needed information. In addition, the user can examine other knowledge resources in search for the missing inference link. In this example, it is enough to add VerbOcean to the input of PLIS to expose two inference chains which connect reveal with discover by combining an inference link from WordNet and another one from VerbOcean. With this additional information, the sentence-level probability increases to 76%. This is a typical scenario of utilizing PLIS, either via the interactive system or via the software, for analyzing the usability of the different knowledge resources and their combination. A feature of the interactive system, which is useful for lexical resources analysis, is that each term in a chain is clickable and links to another screen which presents all the terms that are inferred from it and those from which it is inferred. Additionally, the interactive system communicates with a server which runs PLIS, in a fullduplex WebSocket connection6. This mode of operation is publicly available and provides a method for utilizing PLIS, without having to install it or the lexical resources it uses. Finally, since PLIS is a lexical system it can easily be adjusted to other languages. One only needs to replace the basic lexical text processing tools and plug in knowledge resources in the target language. If PLIS is provided with bilingual resources,7 it can operate also as a cross-lingual inference system (Negri et al., 2012). For instance, the text in Fig 3 is given in English, while the hypothesis is written in Spanish (given as a list of lemma:part-of-speech). The left side of the figure depicts a cross-lingual inference process in which the only lexical knowledge resource used is a man- ually built English-Spanish dictionary. As can be seen, two Spanish terms, jugador and casa remain uncovered since the dictionary alone cannot connect them to any of the English terms in the text. As illustrated in the right side of Fig 3, PLIS enables the combination of the bilingual dictionary with monolingual resources to produce cross-lingual inference chains, such as footballer−h −y −p −er−n y −m →player− −m −a −nu − →aljugador. Such inferenc−e − c−h −a −in − →s hpalavey trh− e− capability otro. overcome monolingual language variability (the first link in this chain) as well as to provide cross-lingual translation (the second link). 5 Conclusions To utilize PLIS one should gather lexical resources, obtain sentence-level annotations and train the inference model. Annotations are available in common data sets for task such as QA, Information Retrieval (queries are hypotheses and snippets are texts) and Student Response Analysis (reference answers are the hypotheses that should be inferred by the student answers). For developers of NLP applications, PLIS offers a ready-to-use lexical knowledge integrator which can interface with many common lexical knowledge resources and constructs lexical inference chains which combine the knowledge in them. A developer who wants to overcome lexical language variability, or to incorporate background knowledge, can utilize PLIS to inject lex6We used the socket.io implementation. 7A bilingual resource holds inference links which connect terms in different languages (e.g. an English-Spanish dictionary can provide the inference link explorer→explorador). 101 Figure 3 : PLIS as a cross-lingual inference system. Left: the process with a single manual bilingual resource. Right: PLIS composes cross-lingual inference chains to increase hypothesis coverage and increase sentence-level inference probability. ical knowledge into any text understanding application. PLIS can be used as a lightweight inference system or as the lexical component of larger, more complex inference systems. Additionally, PLIS provides scores for infer- ence chains and determines the way to combine them in order to recognize sentence-level inference. PLIS comes with two probabilistic lexical inference models which achieved competitive performance levels in the tasks of recognizing textual entailment and passage retrieval for QA. All aspects of PLIS are configurable. The user can easily switch between the built-in lexical resources, inference models and even languages, or extend the system with additional lexical resources and new inference models. Acknowledgments The authors thank Eden Erez for his help with the interactive viewer and Miquel Espl a` Gomis for the bilingual dictionaries. This work was partially supported by the European Community’s 7th Framework Programme (FP7/2007-2013) under grant agreement no. 287923 (EXCITEMENT) and the Israel Science Foundation grant 880/12. References Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proc. of TAC. Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2010. The sixth PASCAL recognizing textual entailment challenge. In Proc. of TAC. Timothy Chklovski and Patrick Pantel. 2004. VerbOcean: Mining the web for fine-grained semantic verb relations. In Proc. of EMNLP. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Lecture Notes in Computer Science, volume 3944, pages 177–190. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, series [B], 39(1): 1–38. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts. Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for English. In Proc. of NAACL. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. 2010. Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4):359–389. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proc. of COLOING-ACL. Matteo Negri, Alessandro Marchetti, Yashar Mehdad, Luisa Bentivogli, and Danilo Giampiccolo. 2012. Semeval-2012 task 8: Cross-lingual textual entailment for content synchronization. In Proc. of SemEval. Eyal Shnarch, Libby Barak, and Ido Dagan. 2009. Extracting lexical reference rules from Wikipedia. In Proc. of ACL. Eyal Shnarch, Jacob Goldberger, and Ido Dagan. 2011. Towards a probabilistic model for lexical entailment. In Proc. of the TextInfer Workshop. Eyal Shnarch, Ido Dagan, and Jacob Goldberger. 2012. A probabilistic lexical model for ranking textual inferences. In Proc. of *SEM. Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the Jeopardy model? A quasisynchronous grammar for QA. In Proc. of EMNLP. 102

3 0.95005941 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures

Author: Sean Szumlanski ; Fernando Gomez ; Valerie K. Sims

Abstract: We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0.77, σ = 0.09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0.90). This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures.

4 0.94917774 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou

Abstract: U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the METANET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with METANET’s aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both mul- tilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.

5 0.94114172 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

Author: Andre Martins ; Miguel Almeida ; Noah A. Smith

Abstract: We present fast, accurate, direct nonprojective dependency parsers with thirdorder features. Our approach uses AD3, an accelerated dual decomposition algorithm which we extend to handle specialized head automata and sequential head bigram models. Experiments in fourteen languages yield parsing speeds competitive to projective parsers, with state-ofthe-art accuracies for the largest datasets (English, Czech, and German).

same-paper 6 0.94046623 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

7 0.87719452 284 acl-2013-Probabilistic Sense Sentiment Similarity through Hidden Emotions

8 0.81569427 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

9 0.77137536 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

10 0.71983409 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

11 0.70358199 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

12 0.63713503 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE

13 0.63673276 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

14 0.63104039 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

15 0.61686623 297 acl-2013-Recognizing Partial Textual Entailment

16 0.59628296 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

17 0.59419346 237 acl-2013-Margin-based Decomposed Amortized Inference

18 0.5899694 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

19 0.58886778 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

20 0.57952672 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages