emnlp emnlp2012 emnlp2012-48 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sze-Meng Jojo Wong ; Mark Dras ; Mark Johnson
Abstract: The task of inferring the native language of an author based on texts written in a second language has generally been tackled as a classification problem, typically using as features a mix of n-grams over characters and part of speech tags (for small and fixed n) and unigram function words. To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor grammars have some promise. In this work we investigate their extension to identifying n-gram collocations of arbitrary length over a mix of PoS tags and words, using both maxent and induced syntactic language model approaches to classification. After presenting a new, simple baseline, we show that learned collocations used as features in a maxent model perform better still, but that the story is more mixed for the syntactic language model.
Reference: text
sentIndex sentText sentNum sentScore
1 To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor grammars have some promise. [sent-7, score-0.797]
2 In this work we investigate their extension to identifying n-gram collocations of arbitrary length over a mix of PoS tags and words, using both maxent and induced syntactic language model approaches to classification. [sent-8, score-0.496]
3 After presenting a new, simple baseline, we show that learned collocations used as features in a maxent model perform better still, but that the story is more mixed for the syntactic language model. [sent-9, score-0.41]
4 1 Introduction The task ofinferring the native language of an author based on texts written in a second language native language identification (NLI) has, since the seminal work of Koppel et al. [sent-10, score-0.6]
5 The recent work of Wong and Dras (201 1), motivated by ideas from Second Language Acquisition (SLA), has shown that syntactic features potentially capturing syntactic er— — — 699 rors characteristic of a particular native language improve performance over purely lexical ones. [sent-15, score-0.319]
6 For the purpose of NLI, small n-gram sizes like bigram or trigram might not suffice to capture sequences that are characteristic of — a particular native language. [sent-18, score-0.319]
7 Adaptor grammars (Johnson, 2010), a hierarchical non-parametric extension of PCFGs (and also interpretable as an extension ofLDA-based topic models), hold out some promise here. [sent-21, score-0.291]
8 In that initial work, Johnson’s model learnt collocations of arbitrary length such as gradient descent and cost function, under a topic associated with machine learning. [sent-22, score-0.451]
9 (2010) applied this idea to perspective classification, learning collocations such as palestinian violence and palestinian freedom, the use of which as features was demonstrated to help the classification of texts from the Bitter Lemons corpus as either Palestinian or Israeli perspective. [sent-24, score-0.478]
10 lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl investigate whether the power of adaptor grammars to discover collocations specifically, ones of arbitrary length that are useful for classification extends to features beyond the purely lexical. [sent-33, score-1.218]
11 We first utilise adaptor grammars for discovery of high performing ‘quasi-syntactic collocations’ of arbitrary length as mentioned above and use them as classification features in a conventional maximum entropy (maxent) model for identifying the author’s native language. [sent-35, score-1.162]
12 The grammar learned can then be used to infer the most probable native language that a given text written in a second language is associated with. [sent-37, score-0.425]
13 (2010) using adaptor grammars for perspective modeling, which inspired our general approach. [sent-39, score-0.822]
14 In Section 2, we review the existing work of NLI as well as the mechanics of adaptor grammars along with their applications to classification. [sent-43, score-0.825]
15 Section 3 details the supervised maxent classification of NLI with collocation (n-gram) features discovered by adaptor grammars. [sent-44, score-0.76]
16 1 Native Language Identification Most of the existing research treats the task of native language identification as a form of text classification deploying supervised machine learning approaches. [sent-48, score-0.333]
17 Their experiments were conducted on English essays writ- ten by authors whose native language one of Bulgarian, Czech, French, Russian, or Spanish. [sent-51, score-0.274]
18 (2005), Tsur and Rappoport (2007) replicated their work and hypothe- sised that word choices in second language writing is highly influenced by the frequency of native language syllables. [sent-62, score-0.274]
19 (2007) tackled the broader task of developing profiles of authors, including native language and various other demographic and psychometric author traits, across a smaller set oflanguages (English, Spanish and Arabic). [sent-65, score-0.327]
20 , 2003; Griffiths and Steyvers, 2004) as a form of feature dimensionality reduction technique to discover coherent latent factors (‘topics’) that might capture predictive features for individual native languages. [sent-75, score-0.301]
21 The work of the present paper differs in that it uses Bayesian techniques to discover collocations of arbitrary length for use in classification, over a mix of both PoS and function words, rather than for use as feature dimensionality reduction. [sent-78, score-0.441]
22 Adaptor Grammars can be viewed as extending PCFGs by permitting the grammar to contain an unbounded number of productions; they are nonparametric in the sense that the particular productions used to analyse a corpus depends on the corpus itself. [sent-83, score-0.215]
23 Instead, the productions used in an adaptor grammar are specified indirectly using a base grammar: the subtrees of the base grammar’s “adapted nonterminals” serve as the possible productions of the adaptor grammar (Johnson et al. [sent-85, score-1.614]
24 In a PCFG productions are generated independently conditioned on the parent nonterminal, while in an Adaptor Grammar the probability of generating a subtree rooted in an adapted 1For computational efficiency reasons Adaptor Grammars require the subtrees to completely expand to terminals. [sent-88, score-0.168]
25 1 Mechanics of adaptor grammars Adaptor Grammars are specified by a PCFG G, plus a subset of G’s non-terminals that are called the adapted non-terminals, as well as a discount parameter aA, where 0 ≤ aA < 1 and a concentration parameter bA, ≤whe are b > −a, for each adapted non-terminal A. [sent-94, score-0.901]
26 wAnh adaptor grammar adcehfines a two-parameter Poisson-Dirichlet Process for each adapted non-terminal A governed by the parameters aA and bA. [sent-95, score-0.769]
27 Each adapted non-terminal A is associated with its own Chinese Restaurant, where the tables are labelled with subtrees generated by the grammar rooted in A. [sent-99, score-0.255]
28 2 Adaptor grammars as LDA extension With the ability to rewrite non-terminals to entire subtrees, adaptor grammars have been used to extend unigram-based LDA topic models (Johnson, 2010). [sent-109, score-1.088]
29 It has also been shown that it is crucial to go beyond the bag-of-words assumption as topical collocations capture more meaning information and represent more interpretable topics (Wang et al. [sent-111, score-0.405]
30 Taking the PCFG formulation for the LDA topic models, it can be modified such that each topic Topici generates sequences of words by adapting each of the Topici non-terminals (usually indicated with an underline in an adaptor grammar). [sent-113, score-0.77]
31 The overall schema for capturing topical collocations with an adaptor grammar is as follows: Sentence → Docj j ∈ 1, . [sent-114, score-1.055]
32 , t Topici → Words Words → WWoorrdds WWoorrddss → WWoorrdds Word WWoorrdd → w w ∈ V i There is a non-grammar-based approach to finding topical collocations as demonstrated by Wang et al. [sent-129, score-0.338]
33 Both of these approaches learned useful collocations: for instance, as mentioned in Section 1, Johnson (2010) found collocations such gradient descent and cost function associated with the topic of machine learning; Wang et al. [sent-131, score-0.444]
34 (2007) found the topic of human receptive system comprises of collocations such as visual cortext and motion detector. [sent-132, score-0.391]
35 Adaptor grammars have also been deployed as a form of feature selection in discovering useful collocations for perspective classification. [sent-133, score-0.531]
36 We are adopting a similar approach in this paper for classi702 fying texts with respect to the author’s native language; but the key difference with Hardisty et al. [sent-136, score-0.301]
37 (2010)’s approach is that our focus is on collocations that mix PoS and lexical elements, rather than being purely lexical. [sent-137, score-0.357]
38 3 Maxent Classification In this section, we first explain the procedures taken to set up the conventional supervised classification task for NLI through the deployment of adaptor grammars for discovery of ‘quasi-syntactic collocations’ of arbitrary length. [sent-138, score-0.856]
39 2 Following our earlier NLI work in Wong and Dras (201 1), our data set consists of 490 texts written in English by authors of seven different native language groups: Bulgarian, Czech, French, Russian, Spanish, Chinese, and Japanese. [sent-147, score-0.333]
40 Each native language contributes 70 out of the 490 texts. [sent-148, score-0.274]
41 2 Adaptor grammars for supervised classification We derive two adaptor grammars for the maxent classification setting, where each is associated with a different set of vocabulary (i. [sent-152, score-1.253]
42 Rules of the form Docj → Docj Topi ci that encode the possible topics t→hat are associated with a document j are given similar α priors as used in LDA (α = 5/t where t = 25 in our experiments). [sent-164, score-0.161]
43 Likewise, similar β priors from LDA are placed on the adapted rules expanding from Topi ci → Words, representing the possible sequences of →word Wso trhdats ,ea rceph topic comprises (β = e0. [sent-165, score-0.209]
44 3c Tsh oef inference algorithm for the adaptor grammars are based on the Markov Chain Monte Carlo technique made available online by Johnson (2010). [sent-167, score-0.797]
45 3 Classification models with n-gram features Based on the two adaptor grammars inferred, the resulting collocations (n-grams) are extracted as features for the classification task of identifying authors’ native language. [sent-170, score-1.433]
46 These n-grams found by the adaptor grammars are only a (not necessarily proper) subset of those n-grams that are strongly characteristic of a particular native language. [sent-171, score-1.116]
47 The use of adaptor grammars here can be viewed as a form of feature selection, as in Hardisty et al. [sent-173, score-0.797]
48 The first motivation for this feature set is that, in a sense, this should give a rough upper bound for the adaptor grammar’s PoS-alone ngrams, as these latter should most often be a subset of the former. [sent-187, score-0.624]
49 We look at the top n-grams up to length 5 selected by IG: the top 2,800 and the top 6,500 (for com- Pi5=1 parability with adaptor grammar feature sets, below), as well as the top 10,000 and the top 20,000 (to study the effect of larger feature space). [sent-192, score-0.749]
50 Adaptor grammar n-gram models The classification features are the two sets of selected collocations inferred by the adaptor grammars which are the main interest of this paper. [sent-193, score-1.33]
51 AG-POS This first set of the adaptor grammarinferred features comprise of pure PoS n-grams (i. [sent-194, score-0.649]
52 The largest length of n-gram found is 17, but about 97% of the collocations are of length between 2 to 5. [sent-197, score-0.367]
53 AG-POS+FW This second set of the adaptor grammar-inferred features are mixtures of PoS and function words (i. [sent-199, score-0.669]
54 The largest length of n-gram found for this set is 19 and the total number of different collocations found is much higher. [sent-202, score-0.335]
55 On the whole, both sets of the collocations inferred by the adaptor grammars perform better than the two baselines. [sent-210, score-1.148]
56 We make the following observations: — • • Regarding ENUM-POS as a (rough) upper bRoeugnadrd, nthge adaptor grammar AG-POS with a comparable number of features performs almost as well. [sent-211, score-0.717]
57 Collocations with a mix of PoS and function wCoolrldosc adtioo nins wfaitcht laea md xto o higher accuracy as compared to those of pure PoS (except for the top 200 n-grams); for instance, compare the 2,800 n-grams up to length 5 from the two corresponding sets (71. [sent-213, score-0.166]
58 • Furthermore, the adaptor grammar-inferred cFoulrltohceartmioonrse w,ith mixtures of PoS and function 5MegaM software is available on http : / /www . [sent-217, score-0.669]
59 The best performing of all the models is achieved by combining the mixed PoS and function word collocations with the adap- tor grammar-inferred PoS, producing the best accuracy thus far of 75. [sent-235, score-0.328]
60 This demonstrates that features inferred by adaptor grammars do capture some useful information and function words are playing a role. [sent-237, score-0.87]
61 As seen in Table 2, method a works better for the combination of the two adaptor grammar feature sets; whereas method b works better for combining adaptor grammar features with enumerated n-gram features. [sent-239, score-1.47]
62 Using adaptor grammar collocations also outperforms the alternative baseline of adding in function words as unigrams. [sent-240, score-1.045]
63 This demonstrates that our more general PoS plus function word collocations derived from adaptor grammars are indeed useful, and supports the argument of Wang et al. [sent-246, score-1.125]
64 4 Language Model-based Classification In this section, we take a language modeling approach to native language identification; the idea here is to adopt grammatical inference to learn a grammar-based language model to represent the texts written by non-English native users. [sent-248, score-0.575]
65 The grammar learned is then used to predict the most probable native language that a document (a sentence) is associated with. [sent-249, score-0.453]
66 In a sense, we are using a parser-based language model to rank the documents with respect to native language. [sent-250, score-0.274]
67 We take a similar approach to developing an grammatical induction technique, although where they used a standard LDA topic model-based PCFG, we use an adaptor grammar. [sent-256, score-0.682]
68 However, the approach is of interest for a few reasons: because, whereas the adaptor grammar plays an ancillary, fea705 ture selection role in Section 3, here the feature selection is an organic part of the approach as per the actual implementation of Hardisty et al. [sent-258, score-0.744]
69 (2010); because adaptor grammars can potentially be extended in a natural way with unlabelled data; and because, for the purposes of this paper, it constitutes a second, quite different way to evaluate the use of n-gram collocations. [sent-259, score-0.797]
70 1 Language Models We derive two adaptor grammar-based language models. [sent-261, score-0.594]
71 The assumption that we make is that each document (each sentence) is a mixture of two sets of topics: one is the native language-specific topic (i. [sent-263, score-0.431]
72 characteristic of the native language) and the other is the generic topic (i. [sent-265, score-0.407]
73 In other words, there are eight topics, representing seven native language groups that are of interest (Bulgarian, Czech, French, Russian, Spanish, Chinese, and Japanese) and the second language English itself. [sent-269, score-0.306]
74 Words → Words Word WWoorrddss → WWoorrdd It should be noted that the two grammars above can in theory be applied to an entire document or on individual sentences. [sent-274, score-0.231]
75 For this present work, we work on the sentence level as the run-time of the current implementation of the adaptor grammars grows proportional to the cube ofthe sentence length. [sent-275, score-0.797]
76 The reason is that we are now having a form of supervised topic models where the learning process is guided by the native languages. [sent-286, score-0.362]
77 To evaluate the grammars learned, as in B ¨orschinger et al. [sent-288, score-0.203]
78 (201 1) we need to slightly modify the grammars above by removing the language identifiers ( lang) from the Root rules and then parse the unlabeled sentences using a publicly available CKY parser. [sent-289, score-0.203]
79 7 The predicted native language is inferred from the parse output by reading off the langTopics that the Root is rewritten to. [sent-290, score-0.322]
80 We take that as the most probable native language for a particular test sentence. [sent-291, score-0.274]
81 Nonetheless, the language models over the mixture of PoS and function words appear to be a more suitable representative of our learner corpus as compared to those over purely PoS, confirming the usefulness of integrated function words for the NLI classification task. [sent-319, score-0.178]
82 5 Discussion Here we take a closer look at how well each approach does in identifying the individual native languages. [sent-327, score-0.274]
83 In fact, there is a subtle difference in the experimental setting of the models derived from the two approaches with respect to the adaptor grammar: the number of topics. [sent-337, score-0.594]
84 Under the maxent setting, the number of topics t was set to 25, while we restricted the models with the language modeling approach to only eight topics (seven for the individual native languages and one for the common second language, English). [sent-338, score-0.542]
85 Looking more deeply into the topics themselves reveals that there appears to be at least two out of the 25 topics (from the supervised models) associated with n-grams that are indicative of the native languages, taking Chinese and Japanese as examples (see the associated topics in Table 7). [sent-339, score-0.531]
86 8 Perhaps associating each native language with only one generalised topic is not sufficient. [sent-340, score-0.362]
87 subtrees of collocations derived from the adaptor grammars) are quite different between the two approaches although the total num- 8Taking the examples from Wong et al. [sent-343, score-0.949]
88 707 the 25 topics representative of Japanese and Chinese (under maxent setting). [sent-345, score-0.174]
89 For the language modeling ones, a high number of n-grams were associated with the generic topic and each language-specific topic langTopic has a lower number of n-grams relative to bi-grams (Table 8) associated with it. [sent-348, score-0.232]
90 For the nullTopic9 maxent models, in contrast, the majority of the topics were associated with a higher number of n-grams (Table 9). [sent-349, score-0.202]
91 Nonetheless, the language models inferred discover relevant n-grams that are representative of individual native languages. [sent-351, score-0.349]
92 006: this phenomenon as char— — 9This is quite plausible as there should be quite a number of structures that are representative of native English speakers that are shared by non-native speakers. [sent-359, score-0.274]
93 (a) subcolumns for n-grams of pure PoS and (b) subcolumns are for n-grams of mixtures of PoS and function words. [sent-361, score-0.214]
94 of pure PoS and (b) subcolumns are are for n-grams of mixtures of PoS and function words. [sent-362, score-0.172]
95 (Note that this collocation as well as its pure PoS counterpart PP S S VB are amongst the top n-grams discovered under the maxent setting as seen in Table 7. [sent-365, score-0.162]
96 To investigate further the issue associated with the number of topics under the language modeling setting, we attempted to extend the adaptor grammar with three additional topics that represent the language family of the seven native languages of interest: Slavic, Romance, and Oriental. [sent-367, score-1.212]
97 More specifically, when added to a new baseline presented in this paper, the combined feature set of both types of adaptor grammar inferred collocations produces the best result in the context of using n-grams for NLI. [sent-371, score-1.068]
98 The usefulness of the collocations does vary, however, with the technique used for classification. [sent-372, score-0.303]
99 Future work will involve a broader exploration of the parameter space of the adaptor grammars, in particular the number of topics and the value of α; a look at other non-parametric extensions of PCFGs, such as infinite PCFGs (Liang et al. [sent-373, score-0.661]
100 Using classifier features for studying the effect of native language on the choice of written second language words. [sent-454, score-0.274]
wordName wordTfidf (topN-words)
[('adaptor', 0.594), ('collocations', 0.303), ('native', 0.274), ('langtopics', 0.25), ('grammars', 0.203), ('nli', 0.195), ('vpos', 0.139), ('koppel', 0.13), ('grammar', 0.123), ('langtopic', 0.111), ('pos', 0.108), ('maxent', 0.107), ('wong', 0.104), ('fw', 0.093), ('topic', 0.088), ('dras', 0.084), ('docj', 0.083), ('nulltopic', 0.083), ('hardisty', 0.076), ('topics', 0.067), ('japanese', 0.067), ('ig', 0.066), ('productions', 0.064), ('pcfg', 0.06), ('icle', 0.06), ('classification', 0.059), ('russian', 0.056), ('bulgarian', 0.056), ('topici', 0.056), ('pure', 0.055), ('johnson', 0.054), ('mix', 0.054), ('adapted', 0.052), ('subtrees', 0.052), ('mixtures', 0.05), ('orschinger', 0.05), ('inferred', 0.048), ('wwoorrdd', 0.048), ('pcfgs', 0.046), ('spanish', 0.045), ('characteristic', 0.045), ('chinese', 0.043), ('bigrams', 0.043), ('estival', 0.042), ('subcolumns', 0.042), ('tsur', 0.042), ('wwoorrddss', 0.042), ('mixture', 0.041), ('lda', 0.041), ('lang', 0.04), ('priors', 0.038), ('french', 0.038), ('jojo', 0.036), ('enumerated', 0.036), ('topical', 0.035), ('authorship', 0.032), ('palestinian', 0.032), ('seven', 0.032), ('enumerate', 0.032), ('length', 0.032), ('expanding', 0.031), ('rough', 0.03), ('mark', 0.029), ('aa', 0.029), ('tackled', 0.028), ('learner', 0.028), ('bitter', 0.028), ('familytopic', 0.028), ('granger', 0.028), ('lemons', 0.028), ('mechanics', 0.028), ('misclassifications', 0.028), ('moshe', 0.028), ('mraesm', 0.028), ('oriental', 0.028), ('permitting', 0.028), ('pfsnoel', 0.028), ('topi', 0.028), ('wrdosr', 0.028), ('wwoorrdds', 0.028), ('associated', 0.028), ('document', 0.028), ('restaurant', 0.028), ('lengths', 0.027), ('texts', 0.027), ('discover', 0.027), ('ture', 0.027), ('languages', 0.027), ('perspective', 0.025), ('function', 0.025), ('grounded', 0.025), ('author', 0.025), ('nn', 0.025), ('czech', 0.024), ('expands', 0.024), ('ishikawa', 0.024), ('cience', 0.024), ('ohns', 0.024), ('rappoport', 0.024), ('soccer', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification
Author: Sze-Meng Jojo Wong ; Mark Dras ; Mark Johnson
Abstract: The task of inferring the native language of an author based on texts written in a second language has generally been tackled as a classification problem, typically using as features a mix of n-grams over characters and part of speech tags (for small and fixed n) and unigram function words. To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor grammars have some promise. In this work we investigate their extension to identifying n-gram collocations of arbitrary length over a mix of PoS tags and words, using both maxent and induced syntactic language model approaches to classification. After presenting a new, simple baseline, we show that learned collocations used as features in a maxent model perform better still, but that the story is more mixed for the syntactic language model.
2 0.12328219 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes
Author: Robert Lindsey ; William Headden ; Michael Stipicevic
Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.
3 0.09506043 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model
Author: Lan Du ; Wray Buntine ; Huidong Jin
Abstract: Topic models are increasingly being used for text analysis tasks, often times replacing earlier semantic techniques such as latent semantic analysis. In this paper, we develop a novel adaptive topic model with the ability to adapt topics from both the previous segment and the parent document. For this proposed model, a Gibbs sampler is developed for doing posterior inference. Experimental results show that with topic adaptation, our model significantly improves over existing approaches in terms of perplexity, and is able to uncover clear sequential structure on, for example, Herman Melville’s book “Moby Dick”.
4 0.090285257 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation
Author: David Hall ; Dan Klein
Abstract: PCFGs can grow exponentially as additional annotations are added to an initially simple base grammar. We present an approach where multiple annotations coexist, but in a factored manner that avoids this combinatorial explosion. Our method works with linguisticallymotivated annotations, induced latent structure, lexicalization, or any mix of the three. We use a structured expectation propagation algorithm that makes use of the factored structure in two ways. First, by partitioning the factors, it speeds up parsing exponentially over the unfactored approach. Second, it minimizes the redundancy of the factors during training, improving accuracy over an independent approach. Using purely latent variable annotations, we can efficiently train and parse with up to 8 latent bits per symbol, achieving F1 scores up to 88.4 on the Penn Treebank while using two orders of magnitudes fewer parameters compared to the na¨ ıve approach. Combining latent, lexicalized, and unlexicalized anno- tations, our best parser gets 89.4 F1 on all sentences from section 23 of the Penn Treebank.
5 0.077822395 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure
Author: Song Feng ; Ritwik Banerjee ; Yejin Choi
Abstract: Much of the writing styles recognized in rhetorical and composition theories involve deep syntactic elements. However, most previous research for computational stylometric analysis has relied on shallow lexico-syntactic patterns. Some very recent work has shown that PCFG models can detect distributional difference in syntactic styles, but without offering much insights into exactly what constitute salient stylistic elements in sentence structure characterizing each authorship. In this paper, we present a comprehensive exploration of syntactic elements in writing styles, with particular emphasis on interpretable characterization of stylistic elements. We present analytic insights with respect to the authorship attribution task in two different domains. ,
6 0.074533984 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
7 0.071099736 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics
8 0.069875665 106 emnlp-2012-Part-of-Speech Tagging for Chinese-English Mixed Texts with Dynamic Features
9 0.069845498 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars
10 0.069141395 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
11 0.068443671 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
12 0.065392807 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
13 0.063941643 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model
14 0.051042244 28 emnlp-2012-Collocation Polarity Disambiguation Using Web-based Pseudo Contexts
15 0.05093107 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
16 0.048630614 81 emnlp-2012-Learning to Map into a Universal POS Tagset
17 0.048366081 133 emnlp-2012-Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision
18 0.046497811 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories
19 0.045154538 19 emnlp-2012-An Entity-Topic Model for Entity Linking
20 0.044921849 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming
topicId topicWeight
[(0, 0.169), (1, -0.026), (2, 0.095), (3, 0.08), (4, -0.118), (5, 0.104), (6, -0.026), (7, -0.079), (8, -0.076), (9, -0.019), (10, -0.02), (11, 0.039), (12, -0.009), (13, 0.11), (14, -0.048), (15, -0.03), (16, -0.036), (17, 0.077), (18, -0.069), (19, -0.113), (20, -0.014), (21, 0.116), (22, 0.056), (23, 0.087), (24, -0.182), (25, 0.08), (26, 0.109), (27, 0.212), (28, -0.021), (29, -0.002), (30, -0.058), (31, -0.094), (32, 0.061), (33, 0.015), (34, -0.121), (35, -0.147), (36, -0.058), (37, -0.205), (38, -0.045), (39, -0.047), (40, -0.041), (41, -0.02), (42, -0.1), (43, 0.031), (44, -0.061), (45, 0.045), (46, -0.168), (47, -0.047), (48, 0.013), (49, 0.147)]
simIndex simValue paperId paperTitle
same-paper 1 0.94131923 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification
Author: Sze-Meng Jojo Wong ; Mark Dras ; Mark Johnson
Abstract: The task of inferring the native language of an author based on texts written in a second language has generally been tackled as a classification problem, typically using as features a mix of n-grams over characters and part of speech tags (for small and fixed n) and unigram function words. To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor grammars have some promise. In this work we investigate their extension to identifying n-gram collocations of arbitrary length over a mix of PoS tags and words, using both maxent and induced syntactic language model approaches to classification. After presenting a new, simple baseline, we show that learned collocations used as features in a maxent model perform better still, but that the story is more mixed for the syntactic language model.
2 0.5260216 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure
Author: Song Feng ; Ritwik Banerjee ; Yejin Choi
Abstract: Much of the writing styles recognized in rhetorical and composition theories involve deep syntactic elements. However, most previous research for computational stylometric analysis has relied on shallow lexico-syntactic patterns. Some very recent work has shown that PCFG models can detect distributional difference in syntactic styles, but without offering much insights into exactly what constitute salient stylistic elements in sentence structure characterizing each authorship. In this paper, we present a comprehensive exploration of syntactic elements in writing styles, with particular emphasis on interpretable characterization of stylistic elements. We present analytic insights with respect to the authorship attribution task in two different domains. ,
3 0.47992584 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes
Author: Robert Lindsey ; William Headden ; Michael Stipicevic
Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.
4 0.47422144 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars
Author: Kewei Tu ; Vasant Honavar
Abstract: We introduce a novel approach named unambiguity regularization for unsupervised learning of probabilistic natural language grammars. The approach is based on the observation that natural language is remarkably unambiguous in the sense that only a tiny portion of the large number of possible parses of a natural language sentence are syntactically valid. We incorporate an inductive bias into grammar learning in favor of grammars that lead to unambiguous parses on natural language sentences. The resulting family of algorithms includes the expectation-maximization algorithm (EM) and its variant, Viterbi EM, as well as a so-called softmax-EM algorithm. The softmax-EM algorithm can be implemented with a simple and computationally efficient extension to standard EM. In our experiments of unsupervised dependency grammar learn- ing, we show that unambiguity regularization is beneficial to learning, and in combination with annealing (of the regularization strength) and sparsity priors it leads to improvement over the current state of the art.
5 0.42434418 133 emnlp-2012-Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision
Author: Joohyun Kim ; Raymond Mooney
Abstract: “Grounded” language learning employs training data in the form of sentences paired with relevant but ambiguous perceptual contexts. B ¨orschinger et al. (201 1) introduced an approach to grounded language learning based on unsupervised PCFG induction. Their approach works well when each sentence potentially refers to one of a small set of possible meanings, such as in the sportscasting task. However, it does not scale to problems with a large set of potential meanings for each sentence, such as the navigation instruction following task studied by Chen and Mooney (201 1). This paper presents an enhancement of the PCFG approach that scales to such problems with highly-ambiguous supervision. Experimental results on the navigation task demonstrates the effectiveness of our approach.
6 0.41554227 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation
7 0.36179331 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
8 0.3604081 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model
9 0.35614881 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
10 0.33858794 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model
11 0.30498564 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
12 0.27706277 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics
13 0.25723267 106 emnlp-2012-Part-of-Speech Tagging for Chinese-English Mixed Texts with Dynamic Features
14 0.25317407 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
15 0.24512853 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media
16 0.23063971 131 emnlp-2012-Unified Dependency Parsing of Chinese Morphological and Syntactic Structures
17 0.20879275 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation
18 0.19257045 61 emnlp-2012-Grounded Models of Semantic Representation
19 0.18936212 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage
20 0.18926707 46 emnlp-2012-Exploiting Reducibility in Unsupervised Dependency Parsing
topicId topicWeight
[(2, 0.014), (16, 0.033), (25, 0.013), (29, 0.01), (34, 0.051), (45, 0.016), (60, 0.446), (63, 0.065), (64, 0.018), (65, 0.013), (70, 0.017), (73, 0.017), (74, 0.073), (76, 0.06), (79, 0.011), (80, 0.01), (86, 0.02), (95, 0.023)]
simIndex simValue paperId paperTitle
1 0.98822969 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
Author: Aurelien Max ; Houda Bouamor ; Anne Vilnat
Abstract: This paper describes a study on the impact of the original signal (text, speech, visual scene, event) of a text pair on the task of both manual and automatic sub-sentential paraphrase acquisition. A corpus of 2,500 annotated sentences in English and French is described, and performance on this corpus is reported for an efficient system combination exploiting a large set of features for paraphrase recognition. A detailed quantified typology of subsentential paraphrases found in our corpus types is given.
2 0.98705429 84 emnlp-2012-Linking Named Entities to Any Database
Author: Avirup Sil ; Ernest Cronin ; Penghai Nie ; Yinfei Yang ; Ana-Maria Popescu ; Alexander Yates
Abstract: Existing techniques for disambiguating named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. In experiments on two domains, one with poor coverage by Wikipedia and the other with near-perfect coverage, our Open-DB NED strategies outperform a state-of-the-art Wikipedia NED system by over 25% in accuracy.
3 0.98517507 61 emnlp-2012-Grounded Models of Semantic Representation
Author: Carina Silberer ; Mirella Lapata
Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
4 0.98473608 41 emnlp-2012-Entity based QA Retrieval
Author: Amit Singh
Abstract: Bridging the lexical gap between the user’s question and the question-answer pairs in the Q&A; archives has been a major challenge for Q&A; retrieval. State-of-the-art approaches address this issue by implicitly expanding the queries with additional words using statistical translation models. While useful, the effectiveness of these models is highly dependant on the availability of quality corpus in the absence of which they are troubled by noise issues. Moreover these models perform word based expansion in a context agnostic manner resulting in translation that might be mixed and fairly general. This results in degraded retrieval performance. In this work we address the above issues by extending the lexical word based translation model to incorporate semantic concepts (entities). We explore strategies to learn the translation probabilities between words and the concepts using the Q&A; archives and a popular entity catalog. Experiments conducted on a large scale real data show that the proposed techniques are promising.
Author: Wenbin Jiang ; Fandong Meng ; Qun Liu ; Yajuan Lu
Abstract: In this paper we first describe the technology of automatic annotation transformation, which is based on the annotation adaptation algorithm (Jiang et al., 2009). It can automatically transform a human-annotated corpus from one annotation guideline to another. We then propose two optimization strategies, iterative training and predict-selfreestimation, to further improve the accuracy of annotation guideline transformation. Experiments on Chinese word segmentation show that, the iterative training strategy together with predictself reestimation brings significant improvement over the simple annotation transformation baseline, and leads to classifiers with significantly higher accuracy and several times faster processing than annotation adaptation does. On the Penn Chinese Treebank 5.0, , it achieves an F-measure of 98.43%, significantly outperforms previous works although using a single classifier with only local features.
same-paper 6 0.97671771 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification
7 0.91418868 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
8 0.89935261 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
9 0.88072568 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
10 0.8777104 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
11 0.87642819 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes
12 0.86964023 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?
13 0.86555797 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
14 0.86299658 19 emnlp-2012-An Entity-Topic Model for Entity Linking
15 0.86136729 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction
16 0.84883595 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
17 0.84324175 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
18 0.83867669 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
19 0.83148372 72 emnlp-2012-Joint Inference for Event Timeline Construction
20 0.83090448 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP