emnlp emnlp2013 emnlp2013-96 knowledge-graph by maker-knowledge-mining

96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora


Source: pdf

Author: Karl Pichotta ; John DeNero

Abstract: We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 pichott a @ c s ut exa s edu Abstract We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. [sent-3, score-1.007]

2 Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. [sent-4, score-0.25]

3 Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set. [sent-5, score-0.788]

4 1 Introduction A multiword expression (MWE), or noncompositional compound, is a sequence of words whose meaning cannot be composed directly from the meanings of its constituent words. [sent-6, score-0.406]

5 com We focus on a particular subset ofMWEs, English phrasal verbs. [sent-20, score-0.57]

6 A phrasal verb consists of a head verb followed by one or more particles, such that the meaning of the phrase cannot be determined by combining the simplex meanings of its constituent words (Baldwin and Villavicencio, 2002; Dixon, 1982; Bannard et al. [sent-21, score-1.019]

7 1 Examples of phrasal verbs include count on [rely], look after [tend], or take off [remove], the meanings of which do not involve counting, looking, or taking. [sent-23, score-0.878]

8 In contrast, there are verbs followed by particles that are not phrasal verbs, because their meaning is compositional, such as walk towards, sit behind, or paint on. [sent-24, score-0.854]

9 We identify phrasal verbs by using frequency statistics calculated from parallel corpora, consisting of bilingual pairs of documents such that one is a translation of the other, with one document in English. [sent-25, score-1.116]

10 We leverage the observation that a verb will translate in an atypical way when occurring as the head of a phrasal verb. [sent-26, score-0.739]

11 For example, the word look in the context of look after will tend to translate differently from how look translates generally. [sent-27, score-0.302]

12 We expect that idiomatic phrasal verbs will tend to have unexpected translation of their head verbs, measured by the Kullback-Leibler divergence between those distributions. [sent-29, score-1.042]

13 Ourpolyglot ranking approach is motivated by the hypothesis that using many parallel corpora of different languages will help determine the degree of semantic idiomaticity of a phrase. [sent-30, score-0.244]

14 In order to com1Nomenclature varies: the term verb-particle construction is also used to denote what we call phrasal verbs; further, the term phrasal verb is sometimes used to denote a broader class of constructions. [sent-31, score-1.309]

15 hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 6t3ic6s–646, bine evidence from multiple languages, we develop a novel boosting algorithm tailored to the task of ranking multiword expressions by their degree of idiomaticity. [sent-34, score-0.555]

16 We train and evaluate on disjoint subsets of the phrasal verbs in English Wiktionary2. [sent-35, score-0.788]

17 In our experiments, the set of phrasal verbs identified automatically by our method achieves held-out recall that nears the performance of the phrasal verbs in WordNet 3. [sent-36, score-1.576]

18 Our approach strongly outperforms a monolingual system, and continues to improve when incrementally adding translation statistics for 50 different languages. [sent-38, score-0.26]

19 2 Identifying Phrasal Verbs The task of identifying phrasal verbs using corpus information raises several issues of experimental design. [sent-39, score-0.826]

20 Many phrasal verbs admit compositional senses in addition to idiomatic ones—contrast idiomatic “look down on him for his politics” with compositional “look down on him from the balcony. [sent-44, score-1.23]

21 ” In this paper, we focus on the task of determining whether a phrase type is a phrasal verb, meaning that it frequently expresses an idiomatic meaning across its many token usages in a corpus. [sent-45, score-0.801]

22 Identifying phrasal verbs involves relative, rather than categorical, judgments: some phrasal verbs are more compositional than others, but retain a degree of noncompositionality (McCarthy et al. [sent-49, score-1.678]

23 Moreover, a polysemous phrasal verb may express an idiosyncratic sense more or less often than a compositional sense in a particular corpus. [sent-51, score-0.806]

24 Therefore, we should expect a corpus-driven system not to classify phrases as strictly idiomatic or compositional, but instead assign a ranking or relative scoring to a set of candidates. [sent-52, score-0.27]

25 We distinguish between the task of identifying candidate multiword expressions 2http : / / en . [sent-54, score-0.506]

26 With English phrasal verbs, it is straightforward to enumerate all desired verbs followed by one or more particles, and rank the entire set. [sent-58, score-0.788]

27 To the best of our knowledge, ours is the first approach to use translational distributions to leverage the observation that a verb typically translates differently when it heads a phrasal verb. [sent-63, score-0.771]

28 3 The Polyglot Ranking Approach Our approach uses bilingual and monolingual statistics as features, computed over unlabeled corpora. [sent-64, score-0.349]

29 Each statistic characterizes the degree of idiosyncrasy of a candidate phrasal verb, using a single monolingual or bilingual corpus. [sent-65, score-1.052]

30 We combine features for many language pairs using a boosting algorithm that optimizes a ranking objective using a supervised training set of English phrasal verbs. [sent-66, score-0.726]

31 We capture this effect by measuring the divergence between how a verb translates generally and how it translates when heading a candidate phrasal verb. [sent-70, score-0.924]

32 A phrase e aligns to another phrase f if some word in e aligns to some word in f and no word in e or f aligns outside of f or e, respectively. [sent-73, score-0.271]

33 Given an English phrase e, define F(e) to be the set of all foreign phrases observed aligned to e in a parallel corpus. [sent-75, score-0.312]

34 This probability is estimated as the relative frequency of observing f and e as an aligned phrase pair, conditioned on observing e aligned to any phrase in the corpus: P(f|e) =PNf0(Ne,(fe,)f0) with N(e, f) the numbePr of times e and f are observed occurring as an aligned phrase pair. [sent-77, score-0.456]

35 Next, we assign statistics to individual verbs within phrases. [sent-78, score-0.278]

36 The first word of a candidate phrasal verb e is a verb. [sent-79, score-0.808]

37 For a candidate phrasal verb e and a foreign phrase f, let π1 (e, f) be the subphrase of f that is most commonly word-aligned to the first word of e. [sent-80, score-0.986]

38 For a phrase e and its set F(e) of aligned translations, we define the constituent translation probability of a foreign subphrase x as: Pe(x) = X P(f|e) · δ (π1(e,f),x) (1) f∈XF(e) where δ is the Kronecker delta function, taking value 1if its arguments are equal and 0 otherwise. [sent-84, score-0.335]

39 Intuitively, Pe assigns the probability mass for every f 638 to its subphrase most commonly aligned to the verb in e. [sent-85, score-0.313]

40 It expresses how this verb is translated in the context of a phrasal verb construction. [sent-86, score-0.908]

41 We also assign statistics to verbs as they are translated outside of the context of a phrase. [sent-88, score-0.278]

42 Let v(e) be the verb of a phrasal verb candidate e, which is always its first word. [sent-89, score-0.977]

43 For a single-word verb phrase v(e), we can compute the constituent translation probability Pv(e) (x), again using Equation (1). [sent-90, score-0.328]

44 Finally, for a phrase e and its verb v(e), we calculate the Kullback-Leibler (KL) divergence between the translation distribution of v(e) and e: DKL? [sent-92, score-0.346]

45 10/|D| i f x x ∈ / DD 3To extend this statistic to other types of multiword expressions, one could compute a similar distribution for other content words in the phrase. [sent-101, score-0.419]

46 For a candidate English phrasal verb e (for example, look up), let E denote the set of inflections of that phrasal verb (for look up, this will be [look|looks|looked|looking] up). [sent-125, score-1.727]

47 The final score computed from a phrase-aligned parallel corpus translating English sentences into a language L is the average KL divergence of smoothed constituent transla- 639 tion distributions for any inflected form ϕL(e) ei ∈ E: = |E1|eXi∈EDKL? [sent-127, score-0.261]

48 2 Monolingual Statistics We also collect a number of monolingual statistics for each phrasal verb candidate, motivated by the considerable body of previous work on the topic (Church and Hanks, 1990; Lin, 1999; McCarthy et al. [sent-130, score-0.951]

49 The monolingual statistics are designed to identify frequent collocations in a language. [sent-132, score-0.253]

50 This set of monolingual features is not comprehensive, as we focus our attention primarily on bilingual features in this pap e r. [sent-133, score-0.289]

51 This statistic characterizes the degree of association between a verb and its phrasal extension. [sent-140, score-0.863]

52 Finally, define µ3(e) to be the relative frequency of the phrasal verb e augmented by an accusative pronoun, conditioned on the verb. [sent-142, score-0.777]

53 For e = look up, A = {look up, look X up, look up X, leo =oks l up, luopo,k As X = up, looko kusp up X, . [sent-144, score-0.27]

54 This statistic is designedP Pto exploit the intuition that phrasal verbs frequently have accusative pronouns either inserted into the middle (e. [sent-152, score-0.95]

55 1, a function ϕL (e) assigning real values to candidate phrasal verbs e, which we hypothesize is higher on average for more idiomatic compounds. [sent-160, score-1.011]

56 For each ϕL and µi, we compute a ranked list of candidate phrasal verbs, ordered from highest to lowest value. [sent-165, score-0.639]

57 To simplify learning, we consider only the top 5000 candidate phrasal verbs according to µ1, µ2, and µ3. [sent-166, score-0.857]

58 This pruning procedure excludes candidates that do not appear in our monolingual corpus. [sent-167, score-0.246]

59 We optimize the ranker using an unranked, incomplete training set of phrasal verbs. [sent-168, score-0.676]

60 ht ) for← ←i = 1( : |X| do i f = xi ∈ ht t|h deon w[i] ← Z1w [i] exp (−αt) elsew w[i] ← Z1w [i] exp (αt) endw wif[ − 21: end for 22: end for to this gold-standard training set. [sent-176, score-0.263]

61 The algorithm maintains a weight vector w (summing to 1) which contains a positive real number for each gold standard phrasal verb in the training set X. [sent-186, score-0.739]

62 1 Training and Test Set In order to train and evaluate our system, we construct a gold-standard list of phrasal verbs from the freely available English Wiktionary. [sent-220, score-0.788]

63 We gather phrasal verbs from three sources within Wiktionary: 1. [sent-221, score-0.788]

64 Many of the idioms and derived terms are not phrasal verbs (e. [sent-232, score-0.788]

65 We omit prepositions that do not productively form English phrasal verbs, such as amid and as. [sent-236, score-0.602]

66 This process also omits some compounds that are sometimes called phrasal verbs, such as light verb constructions, e. [sent-237, score-0.775]

67 There are a number of extant phrasal verb corpora. [sent-242, score-0.739]

68 (2003) present graded human compositionality judgments for 116 phrasal verbs, and Baldwin (2008) presents a large set of candidates produced by an automated system, with false positives manually removed. [sent-244, score-0.664]

69 2 Filtering and Data Partition The merged list ofphrasal verbs extracted from Wiktionary included some common collocations that have compositional semantics (e. [sent-247, score-0.326]

70 We sorted all Wiktionary phrasal verbs according to this value. [sent-255, score-0.788]

71 Our parallel corpora from English to each of 50 languages also consist of documents collected from × the web via distributed data mining of parallel documents based on the text content of web pages (Uszkoreit et al. [sent-267, score-0.256]

72 Monolingual and bilingual statistics are calculated using the corpora described in section 4. [sent-285, score-0.247]

73 3, with candidate phrasal verbs being drawn from the set described in section 4. [sent-286, score-0.857]

74 We evaluate our method of identifying phrasal verbs by computing recall-at-N. [sent-288, score-0.826]

75 This statistic is the fraction of the Wiktionary test set that appears in the top N proposed phrasal verbs by the method, where N is an arbitrary number of top-ranked candidates held constant when comparing different approaches (we use N = 1220). [sent-289, score-1.006]

76 We do not compute precision, because the test set to which we compare is not an exhaustive list of phrasal verbs, due to the development/test split, frequency filtering, and omissions in the original lexical resource. [sent-290, score-0.57]

77 Proposing a phrasal verb not in the test set is not necessarily an error, but identifying many phrasal verbs from the test set is an indication of an effective method. [sent-291, score-1.565]

78 As a competitive baseline, we evaluated the set of phrasal verbs in WordNet 3. [sent-295, score-0.788]

79 Such divergence is typical: lexical resources often disagree about what multiword expressions to include (Lin, 1999). [sent-303, score-0.451]

80 The three final lines in Table 3 evaluate our 0% 0 5 10 15 20 25 30 35 40 45 50 Number of languages (k) Figure 2: The solid line shows recall-at-1220 when combining the k best-performing bilingual statistics and three monolingual statistics. [sent-304, score-0.389]

81 Automatically detecting phrasal verbs using monolingual features alone strongly outperformed the frequency-based lower bound, but underperformed the WordNet baseline. [sent-307, score-0.94]

82 6 Feature Analysis The solid line in Figure 2 shows the recall-at-1220 for a boosted ranker using all monolingual statistics and k bilingual statistics, for increasing k. [sent-311, score-0.455]

83 That is, the point at k = 0 uses only µ1, µ2, and µ3, the point at k = 1 adds the best individually-performing bilingual statistic (Spanish) as a weak ranker, the next point adds the second-best bilingual statistic (German), etc. [sent-313, score-0.586]

84 643 they are useful in addition to the 50 bilingual statistics combined, and no single statistic provides maximal performance. [sent-316, score-0.321]

85 Table 4 shows the effect of adding different subsets of the monolingual statistics to the set of all 50 bilingual statistics. [sent-322, score-0.349]

86 7 Error Analysis Table 5 shows the 100 highest ranked phrasal verb candidates by our system that do not appear in either the development or test sets. [sent-330, score-0.833]

87 sets during filtering, and the remainder are in fact not phrasal verbs (true precision errors). [sent-333, score-0.788]

88 Other candidates are not phrasal verbs, but instead phrases that tend to have a different syntactic role, such as suit against, instance through, fit for, and lie over (conjugated as lay over). [sent-337, score-0.754]

89 5 Related Work The idea of using word-aligned parallel corpora to identify idiomatic expressions has been pursued in a number of different ways. [sent-339, score-0.391]

90 Tsvetkov and Wintner (2012) generate candidate MWEs by finding one-to-one alignments in parallel corpora which are not in a bilingual dictionary, and ranking them based on monolingual statistics. [sent-344, score-0.562]

91 A large body of work has investigated the identification of noncompositional compounds from monolingual sources (Lin, 1999; Schone and Jurafsky, 2001 ; Fazly and Stevenson, 2006; McCarthy et al. [sent-347, score-0.296]

92 Many of these monolingual statistics could be viewed as weak rankers and fruitfully incorporated into our framework. [sent-350, score-0.336]

93 6 Conclusion We have presented the polyglot ranking approach to phrasal verb identification, using parallel corpora from many languages to identify phrasal verbs. [sent-354, score-1.622]

94 We developed a recall-oriented learning method that integrates multiple weak ranking signals, and demonstrated experimentally that combining statistical evidence from a large number of bilingual corpora, as well as from monolingual corpora, produces the most effective system overall. [sent-356, score-0.424]

95 Identification and treatment of multiword expressions applied to information retrieval. [sent-362, score-0.399]

96 Automatically constructing a lexicon of verb phrase idiomatic combinations. [sent-429, score-0.4]

97 Detecting a continuum of compositionality in phrasal verbs. [sent-474, score-0.57]

98 Improving statistical machine translation using domain bilingual multiword expressions. [sent-487, score-0.48]

99 Predicting the compositionality of multiword expressions using translations in multiple languages. [sent-497, score-0.44]

100 Is knowledge-free induction of multiword unit dictionary headwords a solved problem? [sent-501, score-0.295]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('phrasal', 0.57), ('multiword', 0.295), ('verbs', 0.218), ('verb', 0.169), ('idiomatic', 0.154), ('monolingual', 0.152), ('wiktionary', 0.145), ('bilingual', 0.137), ('mwes', 0.136), ('pv', 0.136), ('statistic', 0.124), ('ht', 0.114), ('ranker', 0.106), ('expressions', 0.104), ('candidates', 0.094), ('look', 0.09), ('boosting', 0.085), ('parallel', 0.083), ('baldwin', 0.079), ('noncompositional', 0.077), ('phrase', 0.077), ('aligned', 0.075), ('ranking', 0.071), ('candidate', 0.069), ('aline', 0.069), ('polyglot', 0.069), ('subphrase', 0.069), ('mwe', 0.069), ('compositional', 0.067), ('particles', 0.066), ('weak', 0.064), ('ud', 0.06), ('lg', 0.06), ('rankers', 0.06), ('statistics', 0.06), ('wordnet', 0.057), ('sag', 0.055), ('shoot', 0.055), ('pe', 0.053), ('mccarthy', 0.053), ('caseli', 0.052), ('moir', 0.052), ('salehi', 0.052), ('tsvetkov', 0.052), ('villada', 0.052), ('villavicencio', 0.052), ('divergence', 0.052), ('corpora', 0.05), ('timothy', 0.05), ('translation', 0.048), ('freund', 0.046), ('melamed', 0.045), ('suit', 0.045), ('phrases', 0.045), ('catch', 0.044), ('english', 0.044), ('collocations', 0.041), ('ionary', 0.041), ('adaboost', 0.041), ('translations', 0.041), ('languages', 0.04), ('aligns', 0.039), ('accusative', 0.038), ('dkl', 0.038), ('identifying', 0.038), ('compounds', 0.036), ('bannard', 0.036), ('round', 0.035), ('acosta', 0.035), ('buzz', 0.035), ('endw', 0.035), ('engl', 0.035), ('evii', 0.035), ('exi', 0.035), ('haul', 0.035), ('inquire', 0.035), ('lgn', 0.035), ('lgx', 0.035), ('lnppevii', 0.035), ('noncompositionality', 0.035), ('slap', 0.035), ('unranked', 0.035), ('constituent', 0.034), ('translates', 0.032), ('foreign', 0.032), ('prepositions', 0.032), ('ei', 0.032), ('inflected', 0.032), ('constructions', 0.031), ('identification', 0.031), ('sums', 0.031), ('wintner', 0.03), ('birke', 0.03), ('venkatapathy', 0.03), ('finlayson', 0.03), ('entries', 0.029), ('get', 0.029), ('proceedings', 0.028), ('smoothed', 0.028), ('church', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000017 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

Author: Karl Pichotta ; John DeNero

Abstract: We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set.

2 0.25793517 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment

Author: Xuchen Yao ; Benjamin Van Durme ; Chris Callison-Burch ; Peter Clark

Abstract: We introduce a novel discriminative model for phrase-based monolingual alignment using a semi-Markov CRF. Our model achieves stateof-the-art alignment accuracy on two phrasebased alignment datasets (RTE and paraphrase), while doing significantly better than other strong baselines in both non-identical alignment and phrase-only alignment. Additional experiments highlight the potential benefit of our alignment model to RTE, paraphrase identification and question answering, where even a naive application of our model’s alignment score approaches the state ofthe art.

3 0.23978293 32 emnlp-2013-Automatic Idiom Identification in Wiktionary

Author: Grace Muzny ; Luke Zettlemoyer

Abstract: Online resources, such as Wiktionary, provide an accurate but incomplete source ofidiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed compositionally. Experiments demonstrate that the learned classifier can provide high quality idiom labels, more than doubling the number of idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiom detection in sentences, by simply using known word sense disambiguation algorithms to match phrases to their definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boosts detection recall by over 28 percentage points.

4 0.23769718 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

Author: Fandong Meng ; Jun Xie ; Linfeng Song ; Yajuan Lu ; Qun Liu

Abstract: We present a novel translation model, which simultaneously exploits the constituency and dependency trees on the source side, to combine the advantages of two types of trees. We take head-dependents relations of dependency trees as backbone and incorporate phrasal nodes of constituency trees as the source side of our translation rules, and the target side as strings. Our rules hold the property of long distance reorderings and the compatibility with phrases. Large-scale experimental results show that our model achieves significantly improvements over the constituency-to-string (+2.45 BLEU on average) and dependencyto-string (+0.91 BLEU on average) models, which only employ single type of trees, and significantly outperforms the state-of-theart hierarchical phrase-based model (+1.12 BLEU on average), on three Chinese-English NIST test sets.

5 0.18232693 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models

Author: Douwe Kiela ; Stephen Clark

Abstract: We present a novel unsupervised approach to detecting the compositionality of multi-word expressions. We compute the compositionality of a phrase through substituting the constituent words with their “neighbours” in a semantic vector space and averaging over the distance between the original phrase and the substituted neighbour phrases. Several methods of obtaining neighbours are presented. The results are compared to existing supervised results and achieve state-of-the-art performance on a verb-object dataset of human compositionality ratings.

6 0.12182074 151 emnlp-2013-Paraphrasing 4 Microblog Normalization

7 0.11067296 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

8 0.10127131 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

9 0.092521518 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

10 0.090933882 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation

11 0.087233901 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning

12 0.080690831 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

13 0.08043737 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

14 0.078970432 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge

15 0.077696905 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors

16 0.077456124 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

17 0.075225756 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation

18 0.073522165 123 emnlp-2013-Learning to Rank Lexical Substitutions

19 0.071363837 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

20 0.07080394 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.237), (1, -0.107), (2, 0.0), (3, -0.013), (4, 0.047), (5, 0.092), (6, -0.07), (7, -0.017), (8, -0.075), (9, -0.163), (10, 0.134), (11, 0.169), (12, 0.22), (13, 0.282), (14, 0.065), (15, -0.03), (16, -0.053), (17, 0.033), (18, 0.153), (19, -0.323), (20, 0.016), (21, -0.022), (22, -0.113), (23, 0.104), (24, 0.083), (25, 0.07), (26, 0.2), (27, -0.004), (28, -0.085), (29, 0.048), (30, 0.082), (31, -0.074), (32, -0.029), (33, -0.163), (34, -0.038), (35, 0.01), (36, -0.08), (37, 0.032), (38, 0.034), (39, -0.06), (40, -0.024), (41, 0.039), (42, 0.033), (43, 0.07), (44, -0.034), (45, 0.002), (46, -0.024), (47, 0.049), (48, -0.084), (49, 0.07)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96482599 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

Author: Karl Pichotta ; John DeNero

Abstract: We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set.

2 0.7682597 32 emnlp-2013-Automatic Idiom Identification in Wiktionary

Author: Grace Muzny ; Luke Zettlemoyer

Abstract: Online resources, such as Wiktionary, provide an accurate but incomplete source ofidiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed compositionally. Experiments demonstrate that the learned classifier can provide high quality idiom labels, more than doubling the number of idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiom detection in sentences, by simply using known word sense disambiguation algorithms to match phrases to their definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boosts detection recall by over 28 percentage points.

3 0.62351203 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models

Author: Douwe Kiela ; Stephen Clark

Abstract: We present a novel unsupervised approach to detecting the compositionality of multi-word expressions. We compute the compositionality of a phrase through substituting the constituent words with their “neighbours” in a semantic vector space and averaging over the distance between the original phrase and the substituted neighbour phrases. Several methods of obtaining neighbours are presented. The results are compared to existing supervised results and achieve state-of-the-art performance on a verb-object dataset of human compositionality ratings.

4 0.58973247 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment

Author: Xuchen Yao ; Benjamin Van Durme ; Chris Callison-Burch ; Peter Clark

Abstract: We introduce a novel discriminative model for phrase-based monolingual alignment using a semi-Markov CRF. Our model achieves stateof-the-art alignment accuracy on two phrasebased alignment datasets (RTE and paraphrase), while doing significantly better than other strong baselines in both non-identical alignment and phrase-only alignment. Additional experiments highlight the potential benefit of our alignment model to RTE, paraphrase identification and question answering, where even a naive application of our model’s alignment score approaches the state ofthe art.

5 0.56151432 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

Author: Fandong Meng ; Jun Xie ; Linfeng Song ; Yajuan Lu ; Qun Liu

Abstract: We present a novel translation model, which simultaneously exploits the constituency and dependency trees on the source side, to combine the advantages of two types of trees. We take head-dependents relations of dependency trees as backbone and incorporate phrasal nodes of constituency trees as the source side of our translation rules, and the target side as strings. Our rules hold the property of long distance reorderings and the compatibility with phrases. Large-scale experimental results show that our model achieves significantly improvements over the constituency-to-string (+2.45 BLEU on average) and dependencyto-string (+0.91 BLEU on average) models, which only employ single type of trees, and significantly outperforms the state-of-theart hierarchical phrase-based model (+1.12 BLEU on average), on three Chinese-English NIST test sets.

6 0.39163947 151 emnlp-2013-Paraphrasing 4 Microblog Normalization

7 0.37403411 123 emnlp-2013-Learning to Rank Lexical Substitutions

8 0.35820195 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

9 0.34685859 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk

10 0.32775253 23 emnlp-2013-Animacy Detection with Voting Models

11 0.3256028 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

12 0.32415983 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

13 0.31738979 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation

14 0.30530414 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge

15 0.29762211 183 emnlp-2013-The VerbCorner Project: Toward an Empirically-Based Semantic Decomposition of Verbs

16 0.2893185 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

17 0.28910697 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

18 0.28633255 139 emnlp-2013-Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora

19 0.27793041 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

20 0.27582896 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.019), (18, 0.022), (22, 0.041), (30, 0.061), (45, 0.01), (50, 0.012), (51, 0.65), (54, 0.016), (66, 0.024), (71, 0.026), (75, 0.012), (77, 0.023), (96, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99820626 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

Author: Karl Pichotta ; John DeNero

Abstract: We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set.

2 0.99794108 24 emnlp-2013-Application of Localized Similarity for Web Documents

Author: Peter Rebersek ; Mateja Verlic

Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.

3 0.99791646 32 emnlp-2013-Automatic Idiom Identification in Wiktionary

Author: Grace Muzny ; Luke Zettlemoyer

Abstract: Online resources, such as Wiktionary, provide an accurate but incomplete source ofidiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed compositionally. Experiments demonstrate that the learned classifier can provide high quality idiom labels, more than doubling the number of idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiom detection in sentences, by simply using known word sense disambiguation algorithms to match phrases to their definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boosts detection recall by over 28 percentage points.

4 0.99786556 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,

5 0.99725425 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching

Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger

Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.

6 0.99700838 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game

7 0.99511987 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

8 0.99492061 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

9 0.95384014 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models

10 0.94617295 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

11 0.94419819 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

12 0.94378352 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology

13 0.93788224 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

14 0.9375661 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

15 0.93704504 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

16 0.93436784 27 emnlp-2013-Authorship Attribution of Micro-Messages

17 0.93183917 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

18 0.93127251 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

19 0.92966801 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

20 0.92856443 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation