acl acl2011 acl2011-267 knowledge-graph by maker-knowledge-mining

267 acl-2011-Reversible Stochastic Attribute-Value Grammars

Source: pdf

Author: Daniel de Kok ; Barbara Plank ; Gertjan van Noord

Abstract: An attractive property of attribute-value grammars is their reversibility. Attribute-value grammars are usually coupled with separate statistical components for parse selection and fluency ranking. We propose reversible stochastic attribute-value grammars, in which a single statistical model is employed both for parse selection and fluency ranking.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 nl Abstract An attractive property of attribute-value grammars is their reversibility. [sent-6, score-0.051]

2 Attribute-value grammars are usually coupled with separate statistical components for parse selection and fluency ranking. [sent-7, score-0.381]

3 We propose reversible stochastic attribute-value grammars, in which a single statistical model is employed both for parse selection and fluency ranking. [sent-8, score-1.127]

4 1 Introduction Reversible grammars were introduced as early as 1975 by Martin Kay (1975). [sent-9, score-0.051]

5 In the eighties, the popularity of attribute-value grammars (AVG) was in part motivated by their inherent reversible na- ture. [sent-10, score-0.708]

6 Later, AVG were enriched with a statistical component (Abney, 1997): stochastic AVG (SAVG). [sent-11, score-0.142]

7 Training a SAVG is feasible if a stochastic model is assumed which is conditioned on the input sentences (Johnson et al. [sent-12, score-0.199]

8 , 2002; van Noord and Malouf, 2005; Miyao and Tsujii, 2005; Clark and Curran, 2004; Forst, 2007). [sent-16, score-0.132]

9 SAVG can be applied for generation to select the most fluent realization from the set of possible realizations (Velldal et al. [sent-17, score-0.201]

10 In this case, the stochastic model is conditioned on the input logical forms. [sent-19, score-0.276]

11 If an AVG is applied both to parsing and generation, two distinct stochastic components are required, one for parsing, and one for generation. [sent-22, score-0.199]

12 For instance, features that represent aspects of the surface word order are important for generation, but irrelevant for parsing. [sent-31, score-0.083]

13 Similarly, features which describe aspects of the logical form are important for parsing, but irrelevant for generation. [sent-32, score-0.16]

14 For instance, for Dutch, a very effective feature signals a direct object NP in fronted position in main clauses. [sent-34, score-0.065]

15 If a main clause is parsed which starts with a NP, the disambiguation component will favor a subject reading of that NP. [sent-35, score-0.168]

16 In generation, the fluency component will favor subject fronting over object fronting. [sent-36, score-0.405]

17 some extent In this paper we propose reversible SAVG in which a single stochastic component is applied both in parsing and generation. [sent-38, score-0.886]

18 We provide experimental evidence that such reversible SAVG achieve similar performance as their directional counterparts. [sent-39, score-0.856]

19 A single, reversible model is to be preferred over two distinct models because it explains why preferences in a disambiguation component and a flu- ency component, such as the preference for subject fronting over object fronting, are shared. [sent-40, score-1.027]

20 A single, reversible model is furthermore of practical interest for its simplicity, compactness, and maintainability. [sent-41, score-0.685]

21 As an important additional advantage, reversible models are applicable for tasks which combine aspects of parsing and generation, such as word-graph parsing and paraphrasing. [sent-42, score-0.863]

22 In situations where only a small amount of training data is available for parsing or generation, cross-pollination improves the perforProceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-43, score-0.109]

23 If preferences are shared between parsing and generation, it follows that a generator could benefit from parsing data and vice versa. [sent-46, score-0.275]

24 We present experimental results indicating that in such a bootstrap scenario a reversible model achieves better performance. [sent-47, score-0.685]

25 2 Reversible SAVG As Abney (1997) shows, we cannot use relatively simple techniques such as relative frequencies to obtain a model for estimating derivation probabilities in attribute-value grammars. [sent-48, score-0.094]

26 As an alternative, he proposes a maximum entropy model, where the probability of a derivation d is defined as: p(d) =Z1expXiλifi(d) (1) fi(d) is the frequency of feature fi in derivation d. [sent-49, score-0.287]

27 In (1), Z is a normalizer which is defined as follows, where Ω is the set of derivations defined by the grammar: Z = X expXλifi(d0) Xi (2) dX0∈Ω Training this model requires access to all derivations Ω allowed by the grammar, which makes it hard to implement the model in practice. [sent-51, score-0.467]

28 (1999) alleviate this problem by proposing a model which conditions on the input sentence s: p(d|s). [sent-53, score-0.058]

29 Since the number of derivations fseorn a given s pe(ndt|esn)c. [sent-54, score-0.186]

30 e S s cise usually finite, ft dhee civaalctiuolna-s tion of the normalizer is much more practical. [sent-55, score-0.039]

31 Conversely, in generation the model is conditioned on the input logical form l,p(d|l) (Velldal et al. [sent-56, score-0.246]

32 If X is the set of inputs (for parsing, all sentences in the treebank; for generation, all logical forms), then we have: Ep(fi) − E p˜(fi) = 0 ≡ (5) X X p˜(x)p(d|x)fi(x,d) −˜ p (x,d)fi(x,d) = 0 xX∈X d∈XΩ(x) Here we assume a uniform distribution for p˜ (x). [sent-60, score-0.099]

33 Let j(d) be a function which returns 0 if the derivation d is inconsistent with the treebank, and 1in case the derivation is correct. [sent-61, score-0.132]

34 Since parsing and generation both create derivations that are in agreement with the constraints implied by the input, a single model can accompany the attribute-value grammar. [sent-63, score-0.409]

35 Such a model estimates the probability of a derivation d given a set of constraints c, p(d|c). [sent-64, score-0.12]

36 s Wtimea utsee p(d|c) : p(d|c) =Z(1c)expXiλifi(c,d) Z(c) = X expXλifi(c,d0) d0∈XΩ(c) Xi (7) (8) We derive a reversible model by training on data for parse disambiguation and fluency ranking simultaneously. [sent-66, score-1.235]

37 In contrast to directional models, we impose the two constraints per feature given in figure 1: one on the feature value with respect to the sentences S in the parse disambiguation treebank and the other on the feature value with respect to logical forms L in the fluency ranking treebank. [sent-67, score-0.994]

38 As a result of the constraints on training defined in figure 1, the feature weights in the reversible model distinguish, at the same time, good parses from bad parses as well as good realizations from bad realizations. [sent-68, score-0.863]

39 3 Experimental setup and evaluation To evaluate reversible SAVG, we conduct experiments in the context of the Alpino system for Dutch. [sent-69, score-0.657]

40 X X ˜p(s)p(d|c = s)fi(s,d) −˜ p (c = s,d)fi(s,d) =0 Xs∈S d∈XΩ(s) X X p˜(l)p(d|c = l)fi(l,d) −˜ p (c = l,d)fi(l,d) = 0 Xl∈L d∈XΩ(l) Figure 1: Constraints imposed on feature values for training reversible models p(d|c). [sent-70, score-0.715]

41 Recently, a sentence realizer has been added that uses the same grammar and lexicon (de Kok and van Noord, 2010). [sent-72, score-0.162]

42 In the experiments, the cdbl part of the Alpino Treebank (van der Beek et al. [sent-73, score-0.072]

43 1 Features The features that we use in the experiment are the same features which are available in the Alpino parser and generator. [sent-78, score-0.056]

44 Two word adjacency features are used as auxiliary distributions (Johnson and Riezler, 2000). [sent-81, score-0.069]

45 The first feature is the probability of the sentence according to a word trigram model. [sent-82, score-0.068]

46 The second feature is the probability of the sentence according to a tag trigram model that uses the partof-speech tags assigned by the Alpino system. [sent-83, score-0.096]

47 In conventional parsing tasks, the value of the word trigram model is the same for all derivations of a given input sentence. [sent-86, score-0.363]

48 Lexical analysis is applied dur- ing parsing to find all possible subcategorization frames for the tokens in the input sentence. [sent-88, score-0.154]

49 Since some frames occur more frequently in good parses than others, we use feature templates that record the frames that were used in a parse. [sent-89, score-0.2]

50 We also use an auxiliary distribution of word and frame combinations that was trained on a large corpus of automatically annotated sentences (436 million words). [sent-91, score-0.073]

51 The values of lexical frame features are constant for all derivations in sentence realization, unless the frame is not specified in the logical form. [sent-92, score-0.355]

52 There are also feature templates which describe aspects of the dependency structure. [sent-94, score-0.147]

53 For each dependency, three types of dependency features are extracted. [sent-95, score-0.085]

54 Examples of such features are ”a pronoun is used as the subject of a verb”, ”the pronoun ’she’ is used as the subject of a verb”, ”the noun ’beer’ is used as the object of the verb ’drink’”. [sent-96, score-0.117]

55 In addition, features are used which implement auxiliary distributions for selectional preferences, as described in Van Noord (2007). [sent-97, score-0.069]

56 In conventional realization tasks, the values of these features are constant for all derivations for a given input representation. [sent-98, score-0.306]

57 Syntactic features include features which record the application of each grammar rule, as well as features which record the application of a rule in the context of another rule. [sent-100, score-0.188]

58 An example of the latter is ’rule 167 is used to construct the second daughter of a derivation constructed by rule 233’ . [sent-101, score-0.066]

59 In addition, there are features describing more complex syntactic patterns such as: fronting of subjects and other noun phrases, orderings in the middle field, long-distance dependencies, and parallelism of conjuncts in coordination. [sent-102, score-0.112]

60 2 Parse disambiguation Earlier we assumed that a treebank is a set of correct derivations. [sent-104, score-0.164]

61 In practice, however, a treebank only contains an abstraction of such derivations (in our case sentences with corresponding dependency structures), thus abstracting away from syntactic details needed in a parse disambiguation model. [sent-105, score-0.505]

62 As in Osborne (2000), the derivations for the parse disam- biguation model are created by parsing the training corpus. [sent-106, score-0.421]

63 In the current setting, up to at most 3000 derivations are created for every sentence. [sent-107, score-0.186]

64 These derivations are then compared to the gold standard dependency structure to judge the quality of the parses. [sent-108, score-0.243]

65 3 Fluency ranking For fluency ranking we also need access to full derivations. [sent-111, score-0.412]

66 To ensure that the system is able to generate from the dependency structures in the treebank, we parse the corresponding sentence, and select the parse with the dependency structure that corresponds most closely to the dependency structure in the treebank. [sent-112, score-0.394]

67 The resulting dependency structures are fed into the Alpino chart generator to construct derivations for each dependency structure. [sent-113, score-0.367]

68 The derivations for which the corresponding sentences are closest to the original sentence in the treebank are marked correct. [sent-114, score-0.242]

69 Due to a limit on generation time, some longer sentences and corresponding dependency structures were excluded from the data. [sent-115, score-0.166]

70 To compare a realization to the correct sentence, we use the General Text Matcher (GTM) method (Melamed et al. [sent-118, score-0.062]

71 A feature f partitions Ω(c), if there are derivations d and d0 in Ω(c) such that f(c, d) f(c, d0). [sent-124, score-0.246]

72 A feature is useidn iΩf ict partitions t fh(ec i,ndf)o6 =rm afti(vce, sample of Ω(c) for at least two c. [sent-125, score-0.06]

73 1 Parse disambiguation Table 2 shows the results for parse disambiguation. [sent-129, score-0.206]

74 The table also provides lower and upper bounds: the baseline model selects an arbitrary parse per sentence; the oracle chooses the best available parse. [sent-130, score-0.155]

75 Figure 2 shows the learning curves for the directional parsing model and the reversible model. [sent-131, score-1.003]

76 21 Table 2: Concept Accuracy scores and f-scores in terms of named dependency relations for the parsing-specific model versus the reversible model. [sent-140, score-0.742]

77 The results show that the general, reversible, model comes very close to the accuracy obtained by the dedicated, parsing specific, model. [sent-141, score-0.115]

78 2 Fluency ranking Table 3 compares the reversible model with a directional fluency ranking model. [sent-145, score-1.296]

79 Figure 3 shows the learning curves for the directional generation model and the reversible model. [sent-146, score-0.998]

80 The reversible model achieves similar performance as the directional model (the difference is not significant). [sent-147, score-0.912]

81 To show that a reversible model can actually profit from mutually shared features, we report on an experiment where only a small amount of generation 1http : / / github . [sent-148, score-0.767]

82 com/ danieldk /t inye st Proportion parse training data Figure 2: Learning curve for directional and reversible models for parsing. [sent-149, score-0.976]

83 The reversible model uses all training data for generation. [sent-150, score-0.707]

84 69 Table 3: General Text Matcher scores for fluency ranking using various models. [sent-155, score-0.322]

85 In this experiment, we manually annotated 234 dependency structures from the cdbl part of the Alpino Treebank, by adding correct realizations. [sent-157, score-0.132]

86 We then used this data to train a directional fluency ranking model and a reversible model. [sent-159, score-1.206]

87 Since the reversible model outperforms the directional model we conclude that indeed fluency ranking benefits from parse disambiguation data. [sent-161, score-1.44]

88 20 Table 4: Fluency ranking using a small amount of annotated fluency ranking training data (difference is significant at p < 0. [sent-164, score-0.434]

89 198 Proportion generation training data Figure 3: Learning curves for directional and reversible models for generation. [sent-166, score-0.992]

90 The reversible models uses all training data for parsing. [sent-167, score-0.679]

91 5 Conclusion We proposed reversible SAVG as an alternative to directional SAVG, based on the observation that syntactic preferences are shared between parse dis- ambiguation and fluency ranking. [sent-168, score-1.247]

92 This framework is not purely of theoretical interest, since the experiments show that reversible models achieve accuracies that are similar to those of directional models. [sent-169, score-0.856]

93 Moreover, we showed that a fluency ranking model trained on a small data set can be improved by complementing it with parse disambiguation data. [sent-170, score-0.556]

94 The integration of knowledge from parse disambiguation and fluency ranking could be beneficial for tasks which combine aspects of parsing and generation, such as word-graph parsing or paraphrasing. [sent-171, score-0.734]

95 Stochastic realisation ranking for a free word order language. [sent-178, score-0.09]

96 Filling statistics with linguistics: property design for the disambiguation of german lfg parses. [sent-200, score-0.108]

97 Probabilistic models for disambiguation of an hpsg-based chart generator. [sent-225, score-0.108]

98 Estimation of stochastic attributevalue grammars using an informative sample. [sent-234, score-0.163]

99 Learning to parse natural language with maximum entropy models. [sent-238, score-0.125]

100 Leonoor van der Beek, Gosse Bouma, Robert Malouf, and Gertjan van Noord. [sent-253, score-0.288]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reversible', 0.657), ('savg', 0.238), ('fluency', 0.232), ('directional', 0.199), ('derivations', 0.186), ('noord', 0.182), ('alpino', 0.168), ('gertjan', 0.135), ('van', 0.132), ('ifi', 0.116), ('stochastic', 0.112), ('disambiguation', 0.108), ('parse', 0.098), ('velldal', 0.097), ('fi', 0.092), ('ranking', 0.09), ('parsing', 0.087), ('fronting', 0.084), ('generation', 0.082), ('logical', 0.077), ('kok', 0.073), ('derivation', 0.066), ('expx', 0.063), ('realization', 0.062), ('preferences', 0.061), ('avg', 0.059), ('gtm', 0.058), ('rug', 0.058), ('groningen', 0.058), ('dependency', 0.057), ('treebank', 0.056), ('cahill', 0.054), ('grammars', 0.051), ('riezler', 0.049), ('morristown', 0.049), ('beek', 0.048), ('cdbl', 0.048), ('lassy', 0.048), ('matcher', 0.048), ('nakanishi', 0.048), ('trouw', 0.048), ('miyao', 0.048), ('johnson', 0.045), ('nj', 0.044), ('expxi', 0.042), ('forst', 0.042), ('auxiliary', 0.041), ('generator', 0.04), ('dani', 0.039), ('aoife', 0.039), ('normalizer', 0.039), ('clin', 0.039), ('frames', 0.037), ('record', 0.037), ('gosse', 0.036), ('malouf', 0.036), ('feature', 0.036), ('martin', 0.035), ('tlt', 0.034), ('oepen', 0.034), ('stefan', 0.034), ('treebanks', 0.033), ('trigram', 0.032), ('aspects', 0.032), ('curves', 0.032), ('realizations', 0.032), ('melamed', 0.032), ('frame', 0.032), ('parses', 0.031), ('ronald', 0.031), ('hpsg', 0.031), ('subject', 0.03), ('input', 0.03), ('stephan', 0.03), ('component', 0.03), ('grammar', 0.03), ('conditioned', 0.029), ('object', 0.029), ('oracle', 0.029), ('abney', 0.029), ('netherlands', 0.028), ('erik', 0.028), ('features', 0.028), ('model', 0.028), ('entropy', 0.027), ('iwpt', 0.027), ('yusuke', 0.027), ('structures', 0.027), ('constraints', 0.026), ('fluent', 0.025), ('partitions', 0.024), ('der', 0.024), ('osborne', 0.024), ('stuart', 0.024), ('theories', 0.023), ('xi', 0.023), ('irrelevant', 0.023), ('training', 0.022), ('templates', 0.022), ('inputs', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 267 acl-2011-Reversible Stochastic Attribute-Value Grammars

Author: Daniel de Kok ; Barbara Plank ; Gertjan van Noord

2 0.12611899 317 acl-2011-Underspecifying and Predicting Voice for Surface Realisation Ranking

Author: Sina Zarriess ; Aoife Cahill ; Jonas Kuhn

Abstract: This paper addresses a data-driven surface realisation model based on a large-scale reversible grammar of German. We investigate the relationship between the surface realisation performance and the character of the input to generation, i.e. its degree of underspecification. We extend a syntactic surface realisation system, which can be trained to choose among word order variants, such that the candidate set includes active and passive variants. This allows us to study the interaction of voice and word order alternations in realistic German corpus data. We show that with an appropriately underspecified input, a linguistically informed realisation model trained to regenerate strings from the underlying semantic representation achieves 91.5% accuracy (over a baseline of 82.5%) in the prediction of the original voice. 1

3 0.1149434 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

Author: John DeNero ; Klaus Macherey

Abstract: Unsupervised word alignment is most often modeled as a Markov process that generates a sentence f conditioned on its translation e. A similar model generating e from f will make different alignment predictions. Statistical machine translation systems combine the predictions of two directional models, typically using heuristic combination procedures like grow-diag-final. This paper presents a graphical model that embeds two directional aligners into a single model. Inference can be performed via dual decomposition, which reuses the efficient inference algorithms of the directional models. Our bidirectional model enforces a one-to-one phrase constraint while accounting for the uncertainty in the underlying directional models. The resulting alignments improve upon baseline combination heuristics in word-level and phrase-level evaluations.

4 0.092667133 109 acl-2011-Effective Measures of Domain Similarity for Parsing

Author: Barbara Plank ; Gertjan van Noord

Abstract: It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective – it outperforms random data selection on both languages exam- ined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English. 1 Introduction and Motivation Previous research on domain adaptation has focused on the task of adapting a system trained on one domain, say newspaper text, to a particular new domain, say biomedical data. Usually, some amount of (labeled or unlabeled) data from the new domain was given which has been determined by a human. However, with the growth of the web, more and more data is becoming available, where each document “is potentially its own domain” (McClosky et al., 2010). It is not straightforward to determine – 1566 Gertjan van Noord University of Groningen The Netherlands G J M van Noord@ rug nl . . . . . which data or model (in case we have several source domain models) will perform best on a new (unknown) target domain. Therefore, an important issue that arises is how to measure domain similarity, i.e. whether we can find a simple yet effective method to determine which model or data is most beneficial for an arbitrary piece of new text. Moreover, if we had such a measure, a related question is whether it can tell us something more about what is actually meant by “domain”. So far, it was mostly arbitrarily used to refer to some kind of coherent unit (related to topic, style or genre), e.g.: newspaper text, biomedical abstracts, questions, fiction. Most previous work on domain adaptation, for instance Hara et al. (2005), McClosky et al. (2006), Blitzer et al. (2006), Daum e´ III (2007), sidestepped this problem of automatic domain selection and adaptation. For parsing, to our knowledge only one recent study has started to examine this issue (McClosky et al., 2010) we will discuss their approach in Section 2. Rather, an implicit assumption of all of these studies is that domains are given, i.e. that they are represented by the respective corpora. Thus, a corpus has been considered a homogeneous unit. As more data is becoming available, it is unlikely that – domains will be ‘given’ . Moreover, a given corpus might not always be as homogeneous as originally thought (Webber, 2009; Lippincott et al., 2010). For instance, recent work has shown that the well-known Penn Treebank (PT) Wall Street Journal (WSJ) actually contains a variety of genres, including letters, wit and short verse (Webber, 2009). In this study we take a different approach. Rather than viewing a given corpus as a monolithic entity, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s566–1576, we break it down to the article-level and disregard corpora boundaries. Given the resulting set of documents (articles), we evaluate various ways to automatically acquire related training data for a given test set, to find answers to the following questions: • Given a pool of data (a collection of articles fGriovmen nun ak pnooowln o domains) caonldle a test article, eiss there a way to automatically select data that is relevant for the new domain? If so: • Which similarity measure is good for parsing? • How does it compare to human-annotated data? • Is the measure also useful for other languages Iasnd th/oer mtaesakssu?r To this end, we evaluate measures of domain similarity and feature representations and their impact on dependency parsing accuracy. Given a collection of annotated articles, and a new article that we want to parse, we want to select the most similar articles to train the best parser for that new article. In the following, we will first compare automatic measures to human-annotated labels by examining parsing performance within subdomains of the Penn Treebank WSJ. Then, we extend the experiments to the domain adaptation scenario. Experiments were performed on two languages: English and Dutch. The empirical results show that a simple measure based on topic distributions is effective for both languages and works well also for Part-of-Speech tagging. As the approach is based on plain surfacelevel information (words) and it finds related data in a completely unsupervised fashion, it can be easily applied to other tasks or languages for which annotated (or automatically annotated) data is available. 2 Related Work The work most related to ours is McClosky et al. (2010). They try to find the best combination of source models to parse data from a new domain, which is related to Plank and Sima’an (2008). In the latter, unlabeled data was used to create several parsers by weighting trees in the WSJ according to their similarity to the subdomain. McClosky et al. (2010) coined the term multiple source domain adaptation. Inspired by work on parsing accuracy 1567 prediction (Ravi et al., 2008), they train a linear regression model to predict the best (linear interpolation) of source domain models. Similar to us, McClosky et al. (2010) regard a target domain as mixture of source domains, but they focus on phrasestructure parsing. Furthermore, our approach differs from theirs in two respects: we do not treat source corpora as one entity and try to mix models, but rather consider articles as base units and try to find subsets of related articles (the most similar articles); moreover, instead of creating a supervised model (in their case to predict parsing accuracy), our approach is ‘simplistic’ : we apply measures of domain simi- larity directly (in an unsupervised fashion), without the necessity to train a supervised model. Two other related studies are (Lippincott et al., 2010; Van Asch and Daelemans, 2010). Van Asch and Daelemans (2010) explore a measure of domain difference (Renyi divergence) between pairs of domains and its correlation to Part-of-Speech tagging accuracy. Their empirical results show a linear correlation between the measure and the performance loss. Their goal is different, but related: rather than finding related data for a new domain, they want to estimate the loss in accuracy of a PoS tagger when applied to a new domain. We will briefly discuss results obtained with the Renyi divergence in Section 5.1. Lippincott et al. (2010) examine subdomain variation in biomedicine corpora and propose awareness of NLP tools to such variation. However, they did not yet evaluate the effect on a practical task, thus our study is somewhat complementary to theirs. The issue of data selection has recently been examined for Language Modeling (Moore and Lewis, 2010). A subset of the available data is automatically selected as training data for a Language Model based on a scoring mechanism that compares cross- entropy scores. Their approach considerably outperformed random selection and two previous proposed approaches both based on perplexity scoring.1 3 Measures of Domain Similarity 3.1 Measuring Similarity Automatically Feature Representations A similarity function may be defined over any set of events that are con1We tested data selection by perplexity scoring, but found the Language Models too small to be useful in our setting. sidered to be relevant for the task at hand. For parsing, these might be words, characters, n-grams (of words or characters), Part-of-Speech (PoS) tags, bilexical dependencies, syntactic rules, etc. However, to obtain more abstract types such as PoS tags or dependency relations, one would first need to gather respective labels. The necessary tools for this are again trained on particular corpora, and will suffer from domain shifts, rendering labels noisy. Therefore, we want to gauge the effect of the simplest representation possible: plain surface characteristics (unlabeled text). This has the advantage that we do not need to rely on additional supervised tools; moreover, it is interesting to know how far we can get with this level of information only. We examine the following feature representations: relative frequencies of words, relative frequencies of character tetragrams, and topic models. Our motivation was as follows. Relative frequencies of words are a simple and effective representation used e.g. in text classification (Manning and Sch u¨tze, 1999), while character n-grams have proven successful in genre classification (Wu et al., 2010). Topic models (Blei et al., 2003; Steyvers and Griffiths, 2007) can be considered an advanced model over word distributions: every article is represented by a topic distribution, which in turn is a distribution over words. Similarity between documents can be measured by comparing topic distributions. Similarity Functions There are many possible similarity (or distance) functions. They fall broadly into two categories: probabilistically-motivated and geometrically-motivated functions. The similarity functions examined in this study will be described in the following. The Kullback-Leibler (KL) divergence D(q| |r) is a cTlahsesic Kaull measure oibfl ‘edri s(KtaLn)ce d’i2v ebregtweneceen D Dtw(oq probability distributions, and is defined as: D(q| |r) = Pyq(y)logrq((yy)). It is a non-negative, additive, aPsymmetric measure, and 0 iff the two distributions are identical. However, the KL-divergence is undefined if there exists an event y such that q(y) > 0 but r(y) = 0, which is a property that “makes it unsuitable for distributions derived via maximumlikelihood estimates” (Lee, 2001). 2It is not a proper distance metric since it is asymmetric. 1568 One option to overcome this limitation is to apply smoothing techniques to gather non-zero estimates for all y. The alternative, examined in this paper, is to consider an approximation to the KL divergence, such as the Jensen-Shannon (JS) divergence (Lin, 1991) and the skew divergence (Lee, 2001). The Jensen-Shannon divergence, which is symmetric, computes the KL-divergence between q, r, and the average between the two. We use the JS divergence as defined in Lee (2001): JS(q, r) = [D(q| |avg(q, r)) + D(r| |avg(q, r))] . The asymm[eDtr(icq |s|akvewg( divergence sα, proposed by Lee (2001), mixes one distribution with the other by a degree de- 21 fined by α ∈ [0, 1) : sα (q, r, α) = D(q| |αr + (1 α)q). Ays α α approaches 1, rt,hαe )sk =ew D divergence approximates the KL-divergence. An alternative way to measure similarity is to consider the distributions as vectors and apply geometrically-motivated distance functions. This family of similarity functions includes the cosine cos(q, r) = qq(y) · r(y)/ | |q(y) | | | |r(y) | |, euclidean − euc(q,r) = qPy(q(y) − r(y))2 and variational (also known asq LP1 or MPanhattan) distance function, defined as var(q, r) = Py |q(y) − r(y) |. 3.2 Human-annotatePd data In contrast to the automatic measures devised in the previous section, we might have access to human annotated data. That is, use label information such as topic or genre to define the set of similar articles. Genre For the Penn Treebank (PT) Wall Street Journal (WSJ) section, more specifically, the subset available in the Penn Discourse Treebank, there exists a partition of the data by genre (Webber, 2009). Every article is assigned one of the following genre labels: news, letters, highlights, essays, errata, wit and short verse, quarterly progress reports, notable and quotable. This classification has been made on the basis of meta-data (Webber, 2009). It is wellknown that there is no meta-data directly associated with the individual WSJ files in the Penn Treebank. However, meta-data can be obtained by looking at the articles in the ACL/DCI corpus (LDC99T42), and a mapping file that aligns document numbers of DCI (DOCNO) to WSJ keys (Webber, 2009). An example document is given in Figure 1. The metadata field HL contains headlines, SO source info, and the IN field includes topic markers.

5 0.089896731 282 acl-2011-Shift-Reduce CCG Parsing

Author: Yue Zhang ; Stephen Clark

Abstract: CCGs are directly compatible with binarybranching bottom-up parsing algorithms, in particular CKY and shift-reduce algorithms. While the chart-based approach has been the dominant approach for CCG, the shift-reduce method has been little explored. In this paper, we develop a shift-reduce CCG parser using a discriminative model and beam search, and compare its strengths and weaknesses with the chart-based C&C; parser. We study different errors made by the two parsers, and show that the shift-reduce parser gives competitive accuracies compared to C&C.; Considering our use of a small beam, and given the high ambiguity levels in an automatically-extracted grammar and the amount of information in the CCG lexical categories which form the shift actions, this is a surprising result.

6 0.086237818 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

7 0.080586761 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

8 0.076378599 166 acl-2011-Improving Decoding Generalization for Tree-to-String Translation

9 0.07590916 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

10 0.071244985 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

11 0.066666 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

12 0.063845299 167 acl-2011-Improving Dependency Parsing with Semantic Classes

13 0.063201122 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

14 0.06199814 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

15 0.061287969 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

16 0.060312167 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

17 0.05948361 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

18 0.058871761 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

19 0.05803775 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

20 0.058034826 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.15), (1, -0.054), (2, -0.026), (3, -0.11), (4, -0.031), (5, -0.013), (6, -0.015), (7, 0.04), (8, -0.004), (9, -0.015), (10, 0.011), (11, 0.01), (12, 0.022), (13, -0.001), (14, -0.023), (15, 0.027), (16, 0.018), (17, 0.013), (18, -0.019), (19, -0.031), (20, 0.035), (21, 0.005), (22, -0.02), (23, -0.035), (24, -0.032), (25, 0.04), (26, -0.035), (27, -0.025), (28, 0.011), (29, -0.068), (30, -0.043), (31, 0.067), (32, -0.017), (33, 0.039), (34, 0.036), (35, 0.087), (36, -0.071), (37, -0.066), (38, 0.006), (39, -0.075), (40, 0.003), (41, -0.031), (42, -0.176), (43, 0.027), (44, -0.048), (45, -0.025), (46, 0.025), (47, 0.054), (48, 0.101), (49, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87175077 267 acl-2011-Reversible Stochastic Attribute-Value Grammars

Author: Daniel de Kok ; Barbara Plank ; Gertjan van Noord

2 0.79834461 317 acl-2011-Underspecifying and Predicting Voice for Surface Realisation Ranking

Author: Sina Zarriess ; Aoife Cahill ; Jonas Kuhn

3 0.69615585 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

Author: Antske Fokkens

Abstract: When designing grammars of natural language, typically, more than one formal analysis can account for a given phenomenon. Moreover, because analyses interact, the choices made by the engineer influence the possibilities available in further grammar development. The order in which phenomena are treated may therefore have a major impact on the resulting grammar. This paper proposes to tackle this problem by using metagrammar development as a methodology for grammar engineering. Iargue that metagrammar engineering as an approach facilitates the systematic exploration of grammars through comparison of competing analyses. The idea is illustrated through a comparative study of auxiliary structures in HPSG-based grammars for German and Dutch. Auxiliaries form a central phenomenon of German and Dutch and are likely to influence many components of the grammar. This study shows that a special auxiliary+verb construction significantly improves efficiency compared to the standard argument-composition analysis for both parsing and generation.

4 0.65931386 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

Author: Nina Dethlefs ; Heriberto Cuayahuitl

Abstract: Surface realisation decisions in language generation can be sensitive to a language model, but also to decisions of content selection. We therefore propose the joint optimisation of content selection and surface realisation using Hierarchical Reinforcement Learning (HRL). To this end, we suggest a novel reward function that is induced from human data and is especially suited for surface realisation. It is based on a generation space in the form of a Hidden Markov Model (HMM). Results in terms of task success and human-likeness suggest that our unified approach performs better than greedy or random baselines.

5 0.6321069 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

Author: Gisle Ytrestl

Abstract: This paper describes a backtracking strategy for an incremental deterministic transitionbased parser for HPSG. The method could theoretically be implemented on any other transition-based parser with some adjustments. In this paper, the algorithm is evaluated on CuteForce, an efficient deterministic shiftreduce HPSG parser. The backtracking strategy may serve to improve existing parsers, or to assess if a deterministic parser would benefit from backtracking as a strategy to improve parsing.

6 0.61164987 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

7 0.60440445 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

8 0.55093443 239 acl-2011-P11-5002 k2opt.pdf

9 0.54367501 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars

10 0.53675067 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

11 0.53095722 298 acl-2011-The ACL Anthology Searchbench

12 0.52518302 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

13 0.5197506 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

14 0.517349 330 acl-2011-Using Derivation Trees for Treebank Error Detection

15 0.46494731 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

16 0.46155909 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

17 0.45995462 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing

18 0.45295227 282 acl-2011-Shift-Reduce CCG Parsing

19 0.45147535 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

20 0.44586268 106 acl-2011-Dual Decomposition for Natural Language Processing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.038), (17, 0.059), (26, 0.02), (28, 0.271), (31, 0.032), (37, 0.096), (39, 0.044), (41, 0.059), (55, 0.036), (59, 0.054), (72, 0.044), (91, 0.031), (96, 0.124), (97, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75889421 267 acl-2011-Reversible Stochastic Attribute-Value Grammars

Author: Daniel de Kok ; Barbara Plank ; Gertjan van Noord

2 0.72818065 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

Author: Matt Post

Abstract: In this paper, we show that local features computed from the derivations of tree substitution grammars such as the identify of particular fragments, and a count of large and small fragments are useful in binary grammatical classification tasks. Such features outperform n-gram features and various model scores by a wide margin. Although they fall short of the performance of the hand-crafted feature set of Charniak and Johnson (2005) developed for parse tree reranking, they do so with an order of magnitude fewer features. Furthermore, since the TSGs employed are learned in a Bayesian setting, the use of their derivations can be viewed as the automatic discovery of tree patterns useful for classification. On the BLLIP dataset, we achieve an accuracy of 89.9% in discriminating between grammatical text and samples from an n-gram language model. — —

3 0.71860558 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

Author: Kapil Thadani ; Kathleen McKeown

Abstract: The task of aligning corresponding phrases across two related sentences is an important component of approaches for natural language problems such as textual inference, paraphrase detection and text-to-text generation. In this work, we examine a state-of-the-art structured prediction model for the alignment task which uses a phrase-based representation and is forced to decode alignments using an approximate search approach. We propose instead a straightforward exact decoding technique based on integer linear programming that yields order-of-magnitude improvements in decoding speed. This ILP-based decoding strategy permits us to consider syntacticallyinformed constraints on alignments which significantly increase the precision of the model.

4 0.67709535 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

Author: Yue Zhang ; Joakim Nivre

Abstract: Transition-based dependency parsers generally use heuristic decoding algorithms but can accommodate arbitrarily rich feature representations. In this paper, we show that we can improve the accuracy of such parsers by considering even richer feature sets than those employed in previous systems. In the standard Penn Treebank setup, our novel features improve attachment score form 91.4% to 92.9%, giving the best results so far for transitionbased parsing and rivaling the best results overall. For the Chinese Treebank, they give a signficant improvement of the state of the art. An open source release of our parser is freely available.

5 0.59804332 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

Author: Mohit Bansal ; Dan Klein

Abstract: We investigate full-scale shortest-derivation parsing (SDP), wherein the parser selects an analysis built from the fewest number of training fragments. Shortest derivation parsing exhibits an unusual range of behaviors. At one extreme, in the fully unpruned case, it is neither fast nor accurate. At the other extreme, when pruned with a coarse unlexicalized PCFG, the shortest derivation criterion becomes both fast and surprisingly effective, rivaling more complex weighted-fragment approaches. Our analysis includes an investigation of tie-breaking and associated dynamic programs. At its best, our parser achieves an accuracy of 87% F1 on the English WSJ task with minimal annotation, and 90% F1 with richer annotation.

6 0.58145809 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

7 0.58035338 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

8 0.57963127 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

9 0.57671422 311 acl-2011-Translationese and Its Dialects

10 0.57582486 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

11 0.57441008 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

12 0.57414061 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

13 0.57408959 282 acl-2011-Shift-Reduce CCG Parsing

14 0.57328695 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

15 0.57306385 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

16 0.57254994 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

17 0.57179272 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

18 0.5717355 44 acl-2011-An exponential translation model for target language morphology

19 0.57123184 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

20 0.57040769 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates