acl acl2010 acl2010-214 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jennifer Gillenwater ; Kuzman Ganchev ; Joao Graca ; Fernando Pereira ; Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In ex- periments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other state-of-the-art techniques.
Reference: text
sentIndex sentText sentNum sentScore
1 com ra Abstract A strong inductive bias is essential in unsupervised grammar induction. [sent-5, score-0.155]
2 We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. [sent-6, score-0.551]
3 Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. [sent-7, score-0.438]
4 In ex- periments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6. [sent-9, score-0.157]
5 Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4. [sent-11, score-0.042]
6 1 Introduction We investigate an unsupervised learning method for dependency parsing models that imposes sparsity biases on the dependency types. [sent-14, score-0.465]
7 We assume a corpus annotated with POS tags, where the task is to induce a dependency model from the tags for corpus sentences. [sent-15, score-0.211]
8 In this setting, the type of a dependency is defined as a pair: tag of the dependent (also known as the child), and tag of the head (also known as the parent). [sent-16, score-0.448]
9 Given that POS tags are designed to convey information about grammatical relations, it is reasonable to assume that only some of the possible dependency types will be realized L2F INESC-ID Lisboa, Portugal j oao . [sent-17, score-0.211]
10 For instance, in English it is ungrammatical for nouns to dominate verbs, adjectives to dominate adverbs, and determiners to dominate almost any part of speech. [sent-23, score-0.241]
11 Thus, the realized dependency types should be a sparse subset of all possible types. [sent-24, score-0.155]
12 Previous work in unsupervised grammar induction has tried to achieve sparsity through priors. [sent-25, score-0.239]
13 (2008) experimented with a discounting Dirichlet prior, which encourages a standard dependency parsing model (see Section 2) to limit the number of dependent types for each head type. [sent-31, score-0.406]
14 Our experiments show a more effective sparsity pattern is one that limits the total number ofunique head-dependent tag pairs. [sent-32, score-0.194]
15 This kind of sparsity bias avoids inducing competition between dependent types for each head type. [sent-33, score-0.309]
16 We can achieve the desired bias with a constraint on model posteriors during learning, using the posterior regularization (PR) framework (Graça et al. [sent-34, score-0.437]
17 Specifically, to implement PR we augment the maximum marginal likelihood objective of the dependency model with a term that penalizes head-dependent tag distributions that are too permissive. [sent-36, score-0.323]
18 (2008) and Cohen and Smith (2009) investigated logistic normal priors, and Headden III et al. [sent-39, score-0.185]
19 2 Parsing Model The models we use are based on the generative dependency model with valence (DMV) (Klein and Manning, 2004). [sent-50, score-0.155]
20 For a sentence with tags x, the root POS r(x) is generated first. [sent-51, score-0.107]
21 Then the model decides whether to generate a right dependent conditioned on the POS of the root and whether other right dependents have already been generated for this head. [sent-52, score-0.29]
22 Upon deciding to generate a right dependent, the POS of the dependent is selected by conditioning on the head POS and the directionality. [sent-53, score-0.19]
23 After stopping on the right, the root generates left dependents using the mirror reversal of this process. [sent-54, score-0.237]
24 Once the root has generated all its dependents, the dependents generate their own dependents in the same manner. [sent-55, score-0.347]
25 The first extension alters the stopping probability by conditioning it not only on whether there are any dependents in a particular direction already, but also on how many such dependents there are. [sent-59, score-0.441]
26 When we talk about models with maximum stop valency Vs = S, this means it distinguishes S different cases: 0, 1, . [sent-60, score-0.042]
27 t The second model extension we implement is analogous to the first, but applies to dependent tag probabilities instead of stop probabilities. [sent-68, score-0.268]
28 Again, we expand the conditioning such that the model considers how many other dependents were already generated in the same direction. [sent-69, score-0.209]
29 When we talk about a model with maximum child valency Vc = C, this means we distinguish C different − − × cases. [sent-70, score-0.148]
30 Since this extension to the dependent probabilities dramatically increases model complexity, the third model extension we implement is to add a backoff for the dependent probabilities that does not condition on the identity of the parent POS (see Equation 2). [sent-72, score-0.592]
31 For the third model extension, the backoff to a probability not dependent on parent POS can be formally expressed as: λpchild(yc | yp, yd, yvc) + (1− λ)pchild(yc | yd, yvc) (2) for λ ∈ [0, 1] . [sent-74, score-0.311]
32 3 Previous Learning Approaches In our experiments, we compare PR learning to standard expectation maximization (EM) and to Bayesian learning with a sparsity-inducing prior. [sent-77, score-0.157]
33 The EM algoritPhm optimizes marginal likelihood L(θ) = log PY pθ(X, Y), where X = l{ihx1o , . [sent-78, score-0.086]
34 oTnhsetr aPinRe dm metihnoimd we present modifies the E-step by adding constraints. [sent-88, score-0.041]
35 Besides EM, we also compare to learning with several Bayesian priors that have been applied to the DMV. [sent-89, score-0.091]
36 One such prior is the Dirichlet, whose hyperparameter we will denote by α. [sent-90, score-0.042]
37 In this paper we will refer to our own implementation of the Dirichlet prior as the “discounting Dirichlet” (DD) method. [sent-96, score-0.042]
38 In addition to pθt 195 the Dirichlet, other types of priors have been applied, in particular logistic normal priors (LN) and shared logistic normal priors (SLN) (Cohen et al. [sent-97, score-0.643]
39 Essentially, this has a similar goal to sparsity-inducing methods in that it posits a more concise explanation for the grammar of a language. [sent-100, score-0.044]
40 (2009) also im- plement a sort of parameter tying for the E-DMV through a learning a backoff distribution on child probabilities. [sent-102, score-0.246]
41 4 Learning with Sparse Posteriors We would like to penalize models that predict a large number of distinct dependency types. [sent-104, score-0.155]
42 To enforce this penalty, we use the posterior regularization (PR) framework (Graça et al. [sent-105, score-0.243]
43 PR is closely related to generalized expectation constraints (Mann and McCallum, 2007; Mann and McCallum, 2008; Bellare et al. [sent-107, score-0.149]
44 , 2009), and is also indirectly related to a Bayesian view of learning with constraints on posteriors (Liang et al. [sent-108, score-0.169]
45 The PR framework uses constraints on posterior expectations to guide parameter estimation. [sent-110, score-0.156]
46 Here, PR allows a natural and tractable representation of sparsity constraints based on edge type counts that cannot easily be encoded in model parameters. [sent-111, score-0.155]
47 We use a version of PR where the desired bias is a penalty on the log likelihood (see Ganchev et al. [sent-112, score-0.243]
48 For a fixed set of model parameters θ the full PR penalty term is: mqin KL(q(Y) k pθ(Y|X)) + σ ||Eq [φ(X, Y)] ||β (6) where σ is the strength of the regularization. [sent-115, score-0.134]
49 PR seeks to maximize L(θ) minus this penalty term. [sent-116, score-0.134]
50 1 ‘1/‘∞ Regularization We now define precisely how to count dependency types. [sent-121, score-0.155]
51 For each child tag c, let irange over an enumeration of all occurrences of c in the corpus, and let p be another tag. [sent-122, score-0.188]
52 Let the indicator φcpi(X, Y) have value 1if p is the parent tag of the ith occurrence of c, and value 0 otherwise. [sent-123, score-0.207]
53 The number of unique dependency types is then: Xcpmiaxφcpi(X,Y) (7) Note there is an asymmetry in this count: occurrences of child type c are enumerated with i, but all occurrences of parent type p are or-ed in φcpi. [sent-124, score-0.386]
54 That is, φcpi = 1if any occurrence of p is the parent of the ith occurrence of c. [sent-125, score-0.125]
55 Instead of counting pairs of a child token and a parent type, we can alternatively count pairs of a child token and a parent token by letting p range over all tokens rather than types. [sent-127, score-0.462]
56 Then each potential dependency corresponds to a different indicator φcpij, and the penalty is symmetric with respect to parents and children. [sent-128, score-0.289]
57 Equation 7 can be viewed as a mixed-norm penalty on the features φcpi or φcpij: the sum corresponds to an ‘1 norm and the max to an ‘∞ norm. [sent-131, score-0.134]
58 Thus, the quantity we want to minimize fits precisely into the PR penalty framework. [sent-132, score-0.134]
59 ξcp ≤ Eq [φ(X, Y)] where ξcp corresponds to the maximum expectation of φ over all instances of c and p. [sent-134, score-0.106]
60 Column 3: Best PR result for each model, which is chosen by applying each of the two types of constraints (PR-S and PR-AS) and trying σ ∈ {80, 100, 120, 140, 160, 180}. [sent-145, score-0.043]
61 Before evaluating, we smooth the resulting models by adding to each learned parameter, merely to remove the chance of zero probabilities for unseen events. [sent-151, score-0.049]
62 We also note that development likelihood and the best setting for σ are not well-correlated, which un- fortunately makes it hard to pick these parameters without some supervision. [sent-175, score-0.041]
63 2 Comparison with Previous Work In this section we compare to previously published unsupervised dependency parsing results for English. [sent-177, score-0.198]
64 It might be argued that the comparison is unfair since we do supervised selection of model SPDRD-NLP e LRa-NAfrS(nαaTiS(mn=iσe(giσVl=1iM=,e&se;λ1t414Nh0ol)ed≤0a)6 512r9n. [sent-178, score-0.059]
65 However, we feel the comparison is not so unfair as we perform only a very limited search of the model-σ space. [sent-191, score-0.059]
66 The second two entries are logistic normal and shared logistic normal parameter tying results (Cohen et al. [sent-195, score-0.415]
67 For the bottom two entries in the table, which are for the EDMV, the last entry is best, corresponding to using a DD prior with α = 1(non-sparsifying), but with a special “random pools” initialization and a learned weight λ for the child backoff probability. [sent-198, score-0.243]
68 The result for PR-AS is well within the variance range of this last entry, and thus we conjecture that combining PR-AS with random pools initialization and learned λ would likely produce the best-performing model of all. [sent-199, score-0.059]
69 Una d papelera nc es vs un d objeto nc civilizado aq Unda1. [sent-250, score-0.162]
70 9 obnjceto civilaizqado Figure 1: Posterior edge probabilities for an example sentence from the Spanish test corpus. [sent-263, score-0.049]
71 The numbers on the edges are the values of the posterior probabilities. [sent-266, score-0.113]
72 6 Analysis One common EM error that PR fixes in many languages is the directionality of the noun-determiner relation. [sent-274, score-0.05]
73 Then it does not have to pay the cost of assigning a parent with a new tag to cover each noun that doesn’t come with a determiner. [sent-277, score-0.207]
74 7 Conclusion In this paper we presented a new method for unsupervised learning of dependency parsers. [sent-278, score-0.198]
75 Our approach consistently outperforms the standard EM algorithm and a discounting Dirichlet prior. [sent-280, score-0.061]
76 We have several ideas for further improving our constraints, such as: taking into account the directionality of the edges, using different regularization strengths for the root probabilities than for the child probabilities, and working directly on word types rather than on POS tags. [sent-281, score-0.386]
77 In the future, we would also like to try applying similar constraints to the more complex task of joint induction of POS tags and dependency parses. [sent-282, score-0.294]
78 Lo- gistic normal priors for unsupervised probabilistic grammar induction. [sent-349, score-0.255]
79 Maximum likelihood from incomplete data via the EM algorithm. [sent-360, score-0.041]
80 Improving unsupervised dependency parsing with richer contexts and smoothing. [sent-412, score-0.198]
81 Corpus-based induction of syntactic structure: Models of dependency and constituency. [sent-434, score-0.195]
82 Generalized expectation criteria for semi-supervised learning of conditional random fields. [sent-477, score-0.106]
83 MAMBA meets TIGER: Reconstructing a Swedish treebank from antiquity. [sent-501, score-0.042]
wordName wordTfidf (topN-words)
[('pr', 0.291), ('gra', 0.201), ('dmv', 0.188), ('cohen', 0.187), ('em', 0.184), ('ganchev', 0.159), ('dependency', 0.155), ('dependents', 0.148), ('dd', 0.148), ('headden', 0.134), ('penalty', 0.134), ('cpi', 0.134), ('regularization', 0.13), ('posteriors', 0.126), ('gillenwater', 0.126), ('parent', 0.125), ('posterior', 0.113), ('sparsity', 0.112), ('logistic', 0.108), ('dirichlet', 0.108), ('simov', 0.107), ('child', 0.106), ('expectation', 0.106), ('yc', 0.101), ('pchild', 0.1), ('yvc', 0.1), ('backoff', 0.095), ('priors', 0.091), ('dependent', 0.091), ('mann', 0.091), ('vs', 0.084), ('cp', 0.084), ('tag', 0.082), ('pos', 0.078), ('normal', 0.077), ('bias', 0.068), ('smith', 0.068), ('argqminkl', 0.067), ('bohomov', 0.067), ('civit', 0.067), ('cpij', 0.067), ('edmv', 0.067), ('eroski', 0.067), ('fct', 0.067), ('kawata', 0.067), ('pstop', 0.067), ('sln', 0.067), ('xvl', 0.067), ('xvr', 0.067), ('vc', 0.065), ('yd', 0.065), ('dominate', 0.065), ('discounting', 0.061), ('conditioning', 0.061), ('encourages', 0.061), ('spanish', 0.059), ('unfair', 0.059), ('bellare', 0.059), ('pools', 0.059), ('tiger', 0.059), ('slovene', 0.059), ('bulgarian', 0.059), ('iii', 0.057), ('tags', 0.056), ('afonso', 0.054), ('beek', 0.054), ('kromann', 0.054), ('root', 0.051), ('maximization', 0.051), ('directionality', 0.05), ('probabilities', 0.049), ('oflazer', 0.048), ('eq', 0.048), ('determiners', 0.046), ('jordan', 0.046), ('bayesian', 0.046), ('extension', 0.046), ('treebanks', 0.046), ('qt', 0.045), ('tying', 0.045), ('liang', 0.045), ('marginal', 0.045), ('grammar', 0.044), ('unsupervised', 0.043), ('constraints', 0.043), ('mart', 0.042), ('danish', 0.042), ('valency', 0.042), ('dempster', 0.042), ('taskar', 0.042), ('prior', 0.042), ('treebank', 0.042), ('likelihood', 0.041), ('dm', 0.041), ('induction', 0.04), ('nc', 0.039), ('head', 0.038), ('nilsson', 0.038), ('stopping', 0.038), ('yp', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 214 acl-2010-Sparsity in Dependency Grammar Induction
Author: Jennifer Gillenwater ; Kuzman Ganchev ; Joao Graca ; Fernando Pereira ; Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In ex- periments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other state-of-the-art techniques.
2 0.21688302 195 acl-2010-Phylogenetic Grammar Induction
Author: Taylor Berg-Kirkpatrick ; Dan Klein
Abstract: We present an approach to multilingual grammar induction that exploits a phylogeny-structured model of parameter drift. Our method does not require any translated texts or token-level alignments. Instead, the phylogenetic prior couples languages at a parameter level. Joint induction in the multilingual model substantially outperforms independent learning, with larger gains both from more articulated phylogenies and as well as from increasing numbers of languages. Across eight languages, the multilingual approach gives error reductions over the standard monolingual DMV averaging 21. 1% and reaching as high as 39%.
3 0.15707259 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging
Author: Ashish Vaswani ; Adam Pauls ; David Chiang
Abstract: The Minimum Description Length (MDL) principle is a method for model selection that trades off between the explanation of the data by the model and the complexity of the model itself. Inspired by the MDL principle, we develop an objective function for generative models that captures the description of the data by the model (log-likelihood) and the description of the model (model size). We also develop a efficient general search algorithm based on the MAP-EM framework to optimize this function. Since recent work has shown that minimizing the model size in a Hidden Markov Model for part-of-speech (POS) tagging leads to higher accuracies, we test our approach by applying it to this problem. The search algorithm involves a simple change to EM and achieves high POS tagging accuracies on both English and Italian data sets.
4 0.1372142 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations
Author: Markus Dickinson
Abstract: We outline different methods to detect errors in automatically-parsed dependency corpora, by comparing so-called dependency rules to their representation in the training data and flagging anomalous ones. By comparing each new rule to every relevant rule from training, we can identify parts of parse trees which are likely erroneous. Even the relatively simple methods of comparison we propose show promise for speeding up the annotation process. 1 Introduction and Motivation Given the need for high-quality dependency parses in applications such as statistical machine translation (Xu et al., 2009), natural language generation (Wan et al., 2009), and text summarization evaluation (Owczarzak, 2009), there is a corresponding need for high-quality dependency annotation, for the training and evaluation of dependency parsers (Buchholz and Marsi, 2006). Furthermore, parsing accuracy degrades unless sufficient amounts of labeled training data from the same domain are available (e.g., Gildea, 2001 ; Sekine, 1997), and thus we need larger and more varied annotated treebanks, covering a wide range of domains. However, there is a bottleneck in obtaining annotation, due to the need for manual intervention in annotating a treebank. One approach is to develop automatically-parsed corpora (van Noord and Bouma, 2009), but a natural disadvantage with such data is that it contains parsing errors. Identifying the most problematic parses for human post-processing could combine the benefits of automatic and manual annotation, by allowing a human annotator to efficiently correct automatic errors. We thus set out in this paper to detect errors in automatically-parsed data. If annotated corpora are to grow in scale and retain a high quality, annotation errors which arise from automatic processing must be minimized, as errors have a negative impact on training and eval- uation of NLP technology (see discussion and references in Boyd et al., 2008, sec. 1). There is work on detecting errors in dependency corpus annotation (Boyd et al., 2008), but this is based on finding inconsistencies in annotation for identical recurring strings. This emphasis on identical strings can result in high precision, but many strings do not recur, negatively impacting the recall of error detection. Furthermore, since the same strings often receive the same automatic parse, the types of inconsistencies detected are likely to have resulted from manual annotation. While we can build from the insight that simple methods can provide reliable annotation checks, we need an approach which relies on more general properties of the dependency structures, in order to develop techniques which work for automatically-parsed corpora. Developing techniques to detect errors in parses in a way which is independent of corpus and parser has fairly broad implications. By using only the information available in a training corpus, the methods we explore are applicable to annotation error detection for either hand-annotated or automatically-parsed corpora and can also provide insights for parse reranking (e.g., Hall and Nov a´k, 2005) or parse revision (Attardi and Ciaramita, 2007). Although we focus only on detecting errors in automatically-parsed data, similar techniques have been applied for hand-annotated data (Dickinson, 2008; Dickinson and Foster, 2009). Our general approach is based on extracting a grammar from an annotated corpus and comparing dependency rules in a new (automaticallyannotated) corpus to the grammar. Roughly speaking, if a dependency rule—which represents all the dependents of a head together (see section 3. 1)— does not fit well with the grammar, it is flagged as potentially erroneous. The methods do not have to be retrained for a given parser’s output (e.g., 729 Proce dinUgsp osfa tlhae, 4S8wthed Aen n,u 1a1l-1 M6e Jeutilnyg 2 o0f1 t0h.e ?c As2s0o1c0ia Atisosnoc foiart Cionom fopru Ctaotmiopnuatla Lti on gaulis Lti cnsg,u piasgtiecs 729–738, Campbell and Johnson, 2002), but work by comparing any tree to what is in the training grammar (cf. also approaches stacking hand-written rules on top of other parsers (Bick, 2007)). We propose to flag erroneous parse rules, using information which reflects different grammatical properties: POS lookup, bigram information, and full rule comparisons. We build on a method to detect so-called ad hoc rules, as described in section 2, and then turn to the main approaches in section 3. After a discussion of a simple way to flag POS anomalies in section 4, we evaluate the different methods in section 5, using the outputs from two different parsers. The methodology proposed in this paper is easy to implement and independent of corpus, language, or parser. 2 Approach We take as a starting point two methods for detecting ad hoc rules in constituency annotation (Dickinson, 2008). Ad hoc rules are CFG productions extracted from a treebank which are “used for specific constructions and unlikely to be used again,” indicating annotation errors and rules for ungrammaticalities (see also Dickinson and Foster, 2009). Each method compares a given CFG rule to all the rules in a treebank grammar. Based on the number of similar rules, a score is assigned, and rules with the lowest scores are flagged as potentially ad hoc. This procedure is applicable whether the rules in question are from a new data set—as in this paper, where parses are compared to a training data grammar—or drawn from the treebank grammar itself (i.e., an internal consistency check). The two methods differ in how the comparisons are done. First, the bigram method abstracts a rule to its bigrams. Thus, a rule such as NP → rJJu NeN to provides support fso,r aN rPu → uDcTh aJJs J NJ NN, iJnJ NthNat pitr vshidareess tuhpep oJrJt NfoNr sequence. By contrast, in the other method, which we call the whole rule method,1 a rule is compared in its totality to the grammar rules, using Levenshtein distance. There is no abstraction, meaning all elements are present—e.g., NP → DT JJ JJ NN is very similar to eNsePn → eD.gT. ,J NJ PN N→ b DeTcau JsJe J Jth Ne sequences mdiiflfearr by only one category. While previously used for constituencies, what is at issue is simply the valency of a rule, where by valency we refer to a head and its entire set 1This is referred to whole daughters in Dickinson (2008), but the meaning of “daughters” is less clear for dependencies. of arguments and adjuncts (cf. Przepi´ orkowski, 2006)—that is, a head and all its dependents. The methods work because we expect there to be regularities in valency structure in a treebank grammar; non-conformity to such regularities indicates a potential problem. 3 Ad hoc rule detection 3.1 An appropriate representation To capture valency, consider the dependency tree from the Talbanken05 corpus (Nilsson and Hall, 2005) in figure 1, for the Swedish sentence in (1), which has four dependency pairs.2 (1) Det g a˚r bara inte ihop . it goes just not together ‘It just doesn’t add up.’ SS MA NA PL Det g a˚r bara inte ihop PO VV AB AB AB Figure 1: Dependency graph example On a par with constituency rules, we define a grammar rule as a dependency relation rewriting as a head with its sequence of POS/dependent pairs (cf. Kuhlmann and Satta, 2009), as in figure 2. This representation supports the detection of idiosyncracies in valency.3 1. 12.. 23.. 34.. TOP → root ROOT:VV TROOPOT → → SoSt R:POOO VT:VV MVA:AB NA:AB PL:AB RSSO → P →O :5A. BN AN → ABB P SMSA → → AOB 56.. NPLA → A ABB Figure 2: Rule representation for (1) For example, for the ROOT category, the head is a verb (VV), and it has 4 dependents. The extent to which this rule is odd depends upon whether comparable rules—i.e., other ROOT rules or other VV rules (see section 3.2)—have a similar set of dependents. While many of the other rules seem rather spare, they provide useful information, showing categories which have no dependents. With a TOP rule, we have a rule for every 2Category definitions are in appendix A. 3Valency is difficult to define for coordination and is specific to an annotation scheme. We leave this for the future. 730 head, including the virtual root. Thus, we can find anomalous rules such as TOP → root ROOT:AV ROOT:NN, wulheesre su multiple categories hROavOe T b:AeeVn parsed as ROOT. 3.2 Making appropriate comparisons In comparing rules, we are trying to find evidence that a particular (parsed) rule is valid by examining the evidence from the (training) grammar. Units of comparison To determine similarity, one can compare dependency relations, POS tags, or both. Valency refers to both properties, e.g., verbs which allow verbal (POS) subjects (dependency). Thus, we use the pairs of dependency relations and POS tags as the units of comparison. Flagging individual elements Previous work scored only entire rules, but some dependencies are problematic and others are not. Thus, our methods score individual elements of a rule. Comparable rules We do not want to compare a rule to all grammar rules, only to those which should have the same valents. Comparability could be defined in terms of a rule’s dependency relation (LHS) or in terms of its head. Consider the four different object (OO) rules in (2). These vary a great deal, and much of the variability comes from the fact that they are headed by different POS categories, which tend to have different selectional properties. The head POS thus seems to be predictive of a rule’s valency. (2) a. OO → PO b. OO → DT:EN AT:AJ NN ET:VV c. OO → SS:PO QV VG:VV d. OO → DT:PO AT:AJ VN But we might lose information by ignoring rules with the same left-hand side (LHS). Our approach is thus to take the greater value of scores when comparing to rules either with the same depen- dency relation or with the same head. A rule has multiple chances to prove its value, and low scores will only be for rules without any type of support. Taking these points together, for a given rule of interest r, we assign a score (S) to each element ei in r, where r = e1...em by taking the maximum of scores for rules with the same head (h) or same LHS (lhs), as in (3). For the first element in (2b), for example, S(DT:EN) = max{s(DT:EN, NN), s(DT:EN, OO)}. TTh:eE question ixs now Tho:EwN we dNe)-, fsin(De s(ei, c) fOor)} t.he T comparable sele nmowen hto c. (3) S(ei) = max{s(ei, h) , s(ei, lhs)} 3.3 Whole rule anomalies 3.3.1 Motivation The whole rule method compares a list of a rule’s dependents to rules in a database, and then flags rule elements without much support. By using all dependents as a basis for comparison, this method detects improper dependencies (e.g., an adverb modifying a noun), dependencies in the wrong overall location of a rule (e.g., an adverb before an object), and rules with unnecessarily long ar- gument structures. For example, in (4), we have an improper relation between skall (‘shall’) and sambeskattas (‘be taxed together’), as in figure 3. It is parsed as an adverb (AA), whereas it should be a verb group (VG). The rule for this part of the tree is +F → ++:++ SV AA:VV, and the AA:VV position wF i→ll b +e low-scoring b:VecVau,s aen dth teh ++:++ VSVV context does not support it. (4) Makars o¨vriga inkomster a¨r B-inkomster spouses’ other incomes are B-incomes och skall som tidigare sambeskattas . and shall as previously be taxed togeher . ‘The other incomes of spouses are B-incomes and shall, as previously, be taxed together.’ ++ +F UK KA och skall som tidigare ++ SV UK AJ VG sambeskattas VV ++ +F UK SS och skall som tidigare ++ SV UK AJ AA sambeskattas VV Figure 3: Wrong label (top=gold, bottom=parsed) 3.3.2 Implementation The method we use to determine similarity arises from considering what a rule is like without a problematic element. Consider +F → ++:++ SV pArAob:VleVm afrtiocm e figure 3, Cwohnesried eArA + Fsh →ould + +b:e+ a d SifVferent category (VG). The rule without this error, +F → ++:++ SV, starts several rules in the 731 training data, including some with VG:VV as the next item. The subrule ++:++ SV seems to be reliable, whereas the subrules containing AA:VV (++:++ AA:VV and SV AA:VV) are less reliable. We thus determine reliability by seeing how often each subsequence occurs in the training rule set. Throughout this paper, we use the term subrule to refer to a rule subsequence which is exactly one element shorter than the rule it is a component of. We examine subrules, counting their frequency as subrules, not as complete rules. For example, TOP rules with more than one dependent are problematic, e.g., TOP → root ROOT:AV ROOT:NN. Correspondingly, Pth →ere r are no rOulTe:sA wVith R OthOrTee: NeNle-. ments containing the subrule root ROOT:AV. We formalize this by setting the score s(ei, c) equal to the summation of the frequencies of all comparable subrules containing ei from the training data, as in (5), where B is the set of subrules of r with length one less. (5) s(ei, c) = Psub∈B:ei∈sub C(sub, c) For example, Pwith c = +F, the frequency of +F → ++:++ SV as a subrule is added to the scores f→or ++:++ aVnd a sS aV. s Ibnr tlheis i case, d+ tFo → ++:++ SfoVr VG:BV, +dF S → ++:++ S cVas VG:AV, a +nd+ ++F+ → ++:++ VSV, +VFG →:VV + a:l+l +ad SdV support Vfo,r a n+dF → ++:++ +SV+ being a legitimate dsdub sruuplep.o Thus, ++:++ and SV are less likely to be the sources of any problems. Since +F → SV AA:VV and +F → ++:++ mAsA.:V SVin hcaev +e very l SittVle support i ann tdhe + trFai →ning data, AA:VV receives a low score. Note that the subrule count C(sub, c) is different than counting the number of rules containing a subrule, as can be seen with identical elements. For example, for SS → VN ET:PR ET:PR, C(VN ET:PR, SS) = 2, SinS keeping wE Tith:P tRhe E fTac:Pt Rth,a Ct t(hVerNe are 2 pieces of evidence for its legitimacy. 3.4 Bigram anomalies 3.4.1 Motivation The bigram method examines relationships between adjacent sisters, complementing the whole rule method by focusing on local properties. For (6), for example, we find the gold and parsed trees in figure 4. For the long parsed rule TA → PR HinD f:igIDur HeD 4.:ID F IoRr t:hIeR lAonNg:R pOar JR:IR, ea lTl Aele →men PtRs get low whole rule scores, i.e., are flagged as potentially erroneous. But only the final elements have anomalous bigrams: HD:ID IR:IR, IR:IR AN:RO, and AN:RO JR:IR all never occur. (6) N a¨r det g ¨aller inkomst a˚ret 1971 ( when it concerns the income year 1971 ( taxerings a˚ret 1972 ) skall barnet ... assessment year 1972 ) shall the child . . . ‘Concerning the income year of 1971 (assessment year 1972), the child . . . ’ 3.4.2 Implementation To obtain a bigram score for an element, we simply add together the bigrams which contain the element in question, as in (7). (7) s(ei, c) = C(ei−1ei, c) + C(eiei+1 , c) Consider the rule from figure 4. With c = TA, the bigram HD:ID IR:IR never occurs, so both HD:ID and IR:IR get 0 added to their score. HD:ID HD:ID, however, is a frequent bigram, so it adds weight to HD:ID, i.e., positive evidence comes from the bigram on the left. If we look at IR:IR, on the other hand, IR:IR AN:RO occurs 0 times, and so IR:IR gets a total score of 0. Both scoring methods treat each element independently. Every single element could be given a low score, even though once one is corrected, another would have a higher score. Future work can examine factoring in all elements at once. 4 Additional information The methods presented so far have limited definitions of comparability. As using complementary information has been useful in, e.g., POS error detection (Loftsson, 2009), we explore other simple comparable properties of a dependency grammar. Namely, we include: a) frequency information of an overall dependency rule and b) information on how likely each dependent is to be in a relation with its head, described next. 4.1 Including POS information Consider PA → SS:NN XX:XX HV OO:VN, as iCl ounsstirdaeterd P iAn figure :5N foNr XthXe :sXeXnte HncVe OinO (8). NT,h aiss rule is entirely correct, yet the XX:XX position has low whole rule and bigram scores. (8) Uppgift om vilka orter som information of which neighborhood who har utk o¨rning finner Ni has delivery find ocks a˚ i . . . you also in . . . ‘You can also find information about which neighborhoods have delivery services in . . . ’ 732 AA HD HD DT PA IR DT AN JR ... N a¨r det g ¨aller inkomst a˚ret 1971 ( taxerings a˚ret 1972 ) ... PR ID ID RO IR NN NN RO TAHDHDPAETIRDTANJR. N a¨r det g ¨aller PR ID inkomst a˚ret ID NN 1971 ( RO IR taxerings a˚ret NN 1972 RO IR ... ) ... IR ... Figure 4: A rule with extra dependents (top=gold, bottom=parsed) ET Uppgift NN DT om vilka PR PO SS orter NN XX PA som har XX OO utk o¨rning HV VN Figure 5: Overflagging (gold=parsed) One method which does not have this problem of overflagging uses a “lexicon” of POS tag pairs, examining relations between POS, irrespective of position. We extract POS pairs, note their dependency relation, and add a L/R to the label to indicate which is the head (Boyd et al., 2008). Additionally, we note how often two POS categories occur as a non-depenency, using the label NIL, to help determine whether there should be any attachment. We generate NILs by enumerating all POS pairs in a sentence. For example, from figure 5, the parsed POS pairs include NN PR → ETL, eN 5N, t hPeO p → NIL, eStc. p We convert the frequencies to probabilities. For example, of 4 total occurrences of XX HV in the training data, 2 are XX-R (cf. figure 5). A probability of 0.5 is quite high, given that NILs are often the most frequent label for POS pairs. 5 Evaluation In evaluating the methods, our main question is: how accurate are the dependencies, in terms of both attachment and labeling? We therefore currently examine the scores for elements functioning as dependents in a rule. In figure 5, for example, for har (‘has’), we look at its score within ET → PfoRr hPAar:H (‘Vha asn’)d, not wloohken a itt iftusn scctoiornes w as a head, as in PA → SS:NN XX:XX HV OO:VN. Relatedly, for each method, we are interested in whether elements with scores below a threshold have worse attachment accuracy than scores above, as we predict they do. We can measure this by scoring each testing data position below the threshold as a 1 if it has the correct head and dependency relation and a 0 otherwise. These are simply labeled attachment scores (LAS). Scoring separately for positions above and below a threshold views the task as one of sorting parser output into two bins, those more or less likely to be correctly parsed. For development, we also report unlabeled attachement scores (UAS). Since the goal is to speed up the post-editing of corpus data by flagging erroneous rules, we also report the precision and recall for error detection. We count either attachment or labeling errors as an error, and precision and recall are measured with respect to how many errors are found below the threshold. For development, we use two Fscores to provide a measure of the settings to ex- amine across language, corpus, and parser conditions: the balanced F1 measure and the F0.5 measure, weighing precision twice as much. Precision is likely more important in this context, so as to prevent annotators from sorting through too many false positives. In practice, one way to use these methods is to start with the lowest thresholds and work upwards until there are too many non-errors. To establish a basis for comparison, we compare 733 method performance to a parser on its own.4 By examining the parser output without any automatic assistance, how often does a correction need to be made? 5.1 The data All our data comes from the CoNLL-X Shared Task (Buchholz and Marsi, 2006), specifically the 4 data sets freely available online. We use the Swedish Talbanken data (Nilsson and Hall, 2005) and the transition-based dependency parser MaltParser (Nivre et al., 2007), with the default set- tings, for developing the method. To test across languages and corpora, we use MaltParser on the other 3 corpora: the Danish DDT (Kromann, 2003), Dutch Alpino (van der Beek et al., 2002), and Portuguese Bosque data (Afonso et al., 2002). Then, we present results using the graph-based parser MSTParser (McDonald and Pereira, 2006), again with default settings, to test the methods across parsers. We use the gold standard POS tags for all experiments. 5.2 Development data In the first line of table 1, we report the baseline MaltParser accuracies on the Swedish test data, including baseline error detection precision (=1LASb), recall, and (the best) F-scores. In the rest of table 1, we report the best-performing results for each of the methods,5 providing the number of rules below and above a particular threshold, along with corresponding UAS and LAS values. To get the raw number of identified rules, multiply the number of corpus position below a threshold (b) times the error detection precision (P). For ex- × ample, the bigram method with a threshold of 39 leads to finding 283 errors (455 .622). Dependency e 2le8m3e enrrtos rws (it4h5 frequency below the lowest threshold have lower attachment scores (66.6% vs. 90. 1% LAS), showing that simply using a complete rule helps sort dependencies. However, frequency thresholds have fairly low precision, i.e., 33.4% at their best. The whole rule and bigram methods reveal greater precision in identifying problematic dependencies, isolating elements with lower UAS and LAS scores than with frequency, along with corresponding greater pre4One may also use parser confidence or parser revision methods as a basis of comparison, but we are aware of no systematic evaluation of these approaches for detecting errors. 5Freq=rule frequency, WR=whole rule, Bi=bigram, POS=POS-based (POS scores multiplied by 10,000) cision and F-scores. The bigram method is more fine-grained, identifying small numbers of rule elements at each threshold, resulting in high error detection precision. With a threshold of 39, for example, we find over a quarter of the parser errors with 62% precision, from this one piece of information. For POS information, we flag 23.6% of the cases with over 60% precision (at 81.6). Taking all these results together, we can begin to sort more reliable from less reliable dependency tree elements, using very simple information. Additionally, these methods naturally group cases together by linguistic properties (e.g., adverbialverb dependencies within a particualr context), allowing a human to uncover the principle behind parse failure and ajudicate similar cases at the same time (cf. Wallis, 2003). 5.3 Discussion Examining some of the output from the Talbanken test data by hand, we find that a prominent cause of false positives, i.e., correctly-parsed cases with low scores, stems from low-frequency dependency-POS label pairs. If the dependency rarely occurs in the training data with the particular POS, then it receives a low score, regardless of its context. For example, the parsed rule TA → IG:IG RO has a correct dependency relation (IG) G be:tIwGee RnO Oth hea aPsO aS c tags IcGt d daenpde nitsd e hnecayd RO, yet is assigned a whole rule score of 2 and a bigram score of 20. It turns out that IG:IG only occurs 144 times in the training data, and in 11 of those cases (7.6%) it appears immediately before RO. One might consider normalizing the scores based on overall frequency or adjusting the scores to account for other dependency rules in the sentence: in this case, there may be no better attachment. Other false positives are correctly-parsed elements that are a part of erroneous rules. For instance, in AA → UK:UK SS:PO TA:AJ AV SP:AJ sOtaAn:PceR, +nF A:HAV → +F:HV, Kth SeS fi:rPsOt + TFA:H:AVJ AisV correct, yet given a low score (0 whole rule, 1 bigram). The following and erroneous +F:HV is similarly given a low score. As above, such cases might be handled by looking for attachments in other rules (cf. Attardi and Ciaramita, 2007), but these cases should be relatively unproblematic for handcorrection, given the neighboring error. We also examined false negatives, i.e., errors with high scores. There are many examples of PR PA:NN rules, for instance, with the NN improp734 erly attached, but there are also many correct instances of PR PA:NN. To sort out the errors, one needs to look at lexical knowledge and/or other dependencies in the tree. With so little context, frequent rules with only one dependent are not prime candidates for our methods of error detection. 5.4 Other corpora We now turn to the parsed data from three other corpora. The Alpino and Bosque corpora are approximately the same size as Talbanken, so we use the same thresholds for them. The DDT data is approximately half the size; to adjust, we simply halve the scores. In tables 2, 3, and 4, we present the results, using the best F0.5 and F1 settings from development. At a glance, we observe that the best method differs for each corpus and depending on an emphasis of precision or recall, with the bigram method generally having high precision. For Alpino, error detection is better with frequency than, for example, bigram scores. This is likely due to the fact that Alpino has the smallest label set of any of the corpora, with only 24 dependency labels and 12 POS tags (cf. 64 and 41 in Talbanken, respectively). With a smaller label set, there are less possible bigrams that could be anomalous, but more reliable statistics about a whole rule. Likewise, with fewer possible POS tag pairs, Alpino has lower precision for the lowthreshold POS scores than the other corpora. For the whole rule scores, the DDT data is worse (compare its 46. 1% precision with Bosque’s 45.6%, with vastly different recall values), which could be due to the smaller training data. One might also consider the qualitative differences in the dependency inventory of DDT compared to the others—e.g., appositions, distinctions in names, and more types of modifiers. 5.5 MSTParser Turning to the results of running the methods on the output of MSTParser, we find similar but slightly worse values for the whole rule and bigram methods, as shown in tables 5-8. What is 735 most striking are the differences in the POS-based method for Bosque and DDT (tables 7 and 8), where a large percentage of the test corpus is underneath the threshold. MSTParser is apparently positing fewer distinct head-dependent pairs, as most of them fall under the given thresholds. With the exception of the POS-based method for DDT (where LASb is actually higher than LASa) the different methods seem to be accurate enough to be used as part of corpus post-editing. 6 Summary and Outlook We have proposed different methods for flagging the errors in automatically-parsed corpora, by treating the problem as one of looking for anoma- lous rules with respect to a treebank grammar. The different methods incorporate differing types and amounts of information, notably comparisons among dependency rules and bigrams within such rules. Using these methods, we demonstrated success in sorting well-formed output from erroneous output across language, corpora, and parsers. Given that the rule representations and comparison methods use both POS and dependency information, a next step in evaluating and improving the methods is to examine automatically POStagged data. Our methods should be able to find POS errors in addition to dependency errors. Furthermore, although we have indicated that differences in accuracy can be linked to differences in the granularity and particular distinctions of the annotation scheme, it is still an open question as to which methods work best for which schemes and for which constructions (e.g., coordination). Acknowledgments Thanks to Sandra K ¨ubler and Amber Smith for comments on an earlier draft; Yvonne Samuelsson for help with the Swedish translations; the IU Computational Linguistics discussion group for feedback; and Julia Hockenmaier, Chris Brew, and Rebecca Hwa for discussion on the general topic. A Some Talbanken05 categories Dependencies 736 References Afonso, Susana, Eckhard Bick, Renato Haber and Diana Santos (2002). Floresta Sint a´(c)tica: a treebank for Portuguese. In Proceedings of LREC 2002. Las Palmas, pp. 1698–1703. Attardi, Giuseppe and Massimiliano Ciaramita (2007). Tree Revision Learning for Dependency Parsing. In Proceedings of NAACL-HLT-07. Rochester, NY, pp. 388–395. Bick, Eckhard (2007). Hybrid Ways to Improve Domain Independence in an ML Dependency Parser. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007. Prague, Czech Republic, pp. 1119–1 123. Boyd, Adriane, Markus Dickinson and Detmar Meurers (2008). On Detecting Errors in Dependency Treebanks. Research on Language and Computation 6(2), 113–137. Buchholz, Sabine and Erwin Marsi (2006). CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proceedings of CoNLL-X. New York City, pp. 149–164. Campbell, David and Stephen Johnson (2002). A transformational-based learner for dependency grammars in discharge summaries. In Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain. Phildadelphia, pp. 37–44. Dickinson, Markus (2008). Ad Hoc Treebank Structures. In Proceedings of ACL-08. Columbus, OH. Dickinson, Markus and Jennifer Foster (2009). Similarity Rules! Exploring Methods for AdHoc Rule Detection. In Proceedings of TLT-7. Groningen, The Netherlands. Gildea, Daniel (2001). Corpus Variation and Parser Performance. In Proceedings of EMNLP-01. Pittsburgh, PA. Hall, Keith and V ´aclav Nov a´k (2005). Corrective Modeling for Non-Projective Dependency Parsing. In Proceedings of IWPT-05. Vancouver, pp. 42–52. Kromann, Matthias Trautner (2003). The Danish Dependency Treebank and the underlying linguistic theory. In Proceedings of TLT-03. Kuhlmann, Marco and Giorgio Satta (2009). Treebank Grammar Techniques for Non-Projective Dependency Parsing. In Proceedings of EACL09. Athens, Greece, pp. 478–486. Loftsson, Hrafn (2009). Correcting a POS-Tagged Corpus Using Three Complementary Methods. In Proceedings of EACL-09. Athens, Greece, pp. 523–531. McDonald, Ryan and Fernando Pereira (2006). Online learning of approximate dependency parsing algorithms. In Proceedings of EACL06. Trento. Nilsson, Jens and Johan Hall (2005). Reconstruction of the Swedish Treebank Talbanken. MSI report 05067, V ¨axj¨ o University: School of Mathematics and Systems Engineering. Nivre, Joakim, Johan Hall, Jens Nilsson, Atanas Chanev, Gulsen Eryigit, Sandra K ¨ubler, Svetoslav Marinov and Erwin Marsi (2007). MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13(2), 95–135. Owczarzak, Karolina (2009). DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries. In Proceedings of ACL-AFNLP-09. Suntec, Singapore, pp. 190–198. Przepi´ orkowski, Adam (2006). What to acquire from corpora in automatic valence acquisition. In Violetta Koseska-Toszewa and Roman Roszko (eds.), Semantyka a konfrontacja jezykowa, tom 3, Warsaw: Slawistyczny O ´srodek Wydawniczy PAN, pp. 25–41. Sekine, Satoshi (1997). The Domain Dependence of Parsing. In Proceedings of ANLP-96. Washington, DC. van der Beek, Leonoor, Gosse Bouma, Robert Malouf and Gertjan van Noord (2002). The Alpino Dependency Treebank. In Proceedings of CLIN 2001. Rodopi. van Noord, Gertjan and Gosse Bouma (2009). Parsed Corpora for Linguistics. In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?. Athens, pp. 33–39. Wallis, Sean (2003). Completing Parsed Corpora. In Anne Abeill´ e (ed.), Treebanks: Building and using syntactically annoted corpora, Dordrecht: Kluwer Academic Publishers, pp. 61–71. Wan, Stephen, Mark Dras, Robert Dale and C ´ecile Paris (2009). Improving Grammaticality in Sta737 tistical Sentence Generation: Introducing a Dependency Spanning Tree Algorithm with an Argument Satisfaction Model. In Proceedings of EACL-09. Athens, Greece, pp. 852–860. Xu, Peng, Jaeho Kang, Michael Ringgaard and Franz Och (2009). Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages. In Proceedings of NAACL-HLT-09. Boulder, Colorado, pp. 245–253. 738
5 0.1333901 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
Author: Valentin I. Spitkovsky ; Daniel Jurafsky ; Hiyan Alshawi
Abstract: We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the- art by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3% accuracy out-of-domain, against the Brown corpus nearly 10% higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP.
6 0.12825641 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification
7 0.12226711 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging
8 0.11536831 162 acl-2010-Learning Common Grammar from Multilingual Corpus
9 0.1094358 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization
10 0.10784394 197 acl-2010-Practical Very Large Scale CRFs
11 0.10736671 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
12 0.094945766 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery
13 0.088523887 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules
14 0.088441439 99 acl-2010-Efficient Third-Order Dependency Parsers
15 0.085351519 46 acl-2010-Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
16 0.08273191 20 acl-2010-A Transition-Based Parser for 2-Planar Dependency Structures
17 0.08058694 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features
18 0.080210008 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names
19 0.079855546 130 acl-2010-Hard Constraints for Grammatical Function Labelling
20 0.07781931 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar
topicId topicWeight
[(0, -0.217), (1, -0.035), (2, 0.057), (3, -0.008), (4, -0.04), (5, -0.083), (6, 0.13), (7, -0.016), (8, 0.167), (9, 0.091), (10, -0.135), (11, 0.069), (12, 0.072), (13, -0.042), (14, 0.086), (15, -0.176), (16, -0.081), (17, 0.022), (18, 0.083), (19, -0.072), (20, -0.068), (21, 0.067), (22, 0.008), (23, -0.015), (24, -0.008), (25, 0.026), (26, 0.079), (27, -0.014), (28, 0.036), (29, 0.007), (30, 0.009), (31, -0.017), (32, -0.084), (33, -0.03), (34, 0.144), (35, -0.131), (36, 0.092), (37, -0.039), (38, -0.007), (39, -0.136), (40, 0.002), (41, -0.11), (42, -0.003), (43, 0.173), (44, -0.108), (45, -0.071), (46, -0.008), (47, -0.128), (48, 0.035), (49, -0.078)]
simIndex simValue paperId paperTitle
same-paper 1 0.958942 214 acl-2010-Sparsity in Dependency Grammar Induction
Author: Jennifer Gillenwater ; Kuzman Ganchev ; Joao Graca ; Fernando Pereira ; Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In ex- periments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other state-of-the-art techniques.
2 0.77240604 195 acl-2010-Phylogenetic Grammar Induction
Author: Taylor Berg-Kirkpatrick ; Dan Klein
Abstract: We present an approach to multilingual grammar induction that exploits a phylogeny-structured model of parameter drift. Our method does not require any translated texts or token-level alignments. Instead, the phylogenetic prior couples languages at a parameter level. Joint induction in the multilingual model substantially outperforms independent learning, with larger gains both from more articulated phylogenies and as well as from increasing numbers of languages. Across eight languages, the multilingual approach gives error reductions over the standard monolingual DMV averaging 21. 1% and reaching as high as 39%.
3 0.70640075 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging
Author: Ashish Vaswani ; Adam Pauls ; David Chiang
Abstract: The Minimum Description Length (MDL) principle is a method for model selection that trades off between the explanation of the data by the model and the complexity of the model itself. Inspired by the MDL principle, we develop an objective function for generative models that captures the description of the data by the model (log-likelihood) and the description of the model (model size). We also develop a efficient general search algorithm based on the MAP-EM framework to optimize this function. Since recent work has shown that minimizing the model size in a Hidden Markov Model for part-of-speech (POS) tagging leads to higher accuracies, we test our approach by applying it to this problem. The search algorithm involves a simple change to EM and achieves high POS tagging accuracies on both English and Italian data sets.
4 0.64116722 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization
Author: Shay Cohen ; Noah A Smith
Abstract: We consider the search for a maximum likelihood assignment of hidden derivations and grammar weights for a probabilistic context-free grammar, the problem approximately solved by “Viterbi training.” We show that solving and even approximating Viterbi training for PCFGs is NP-hard. We motivate the use of uniformat-random initialization for Viterbi EM as an optimal initializer in absence of further information about the correct model parameters, providing an approximate bound on the log-likelihood.
5 0.57099277 162 acl-2010-Learning Common Grammar from Multilingual Corpus
Author: Tomoharu Iwata ; Daichi Mochihashi ; Hiroshi Sawada
Abstract: We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel multilingual corpus of eleven languages demonstrate the feasibility of the proposed method.
6 0.55181098 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging
7 0.55121082 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations
8 0.49428236 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
9 0.4920156 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery
10 0.48218653 143 acl-2010-Importance of Linguistic Constraints in Statistical Dependency Parsing
11 0.47448304 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
12 0.46191162 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification
13 0.44978231 197 acl-2010-Practical Very Large Scale CRFs
14 0.43390542 12 acl-2010-A Probabilistic Generative Model for an Intermediate Constituency-Dependency Representation
15 0.41811532 116 acl-2010-Finding Cognate Groups Using Phylogenies
16 0.41767976 130 acl-2010-Hard Constraints for Grammatical Function Labelling
17 0.39520609 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities
18 0.39508262 99 acl-2010-Efficient Third-Order Dependency Parsers
20 0.37819108 98 acl-2010-Efficient Staggered Decoding for Sequence Labeling
topicId topicWeight
[(14, 0.088), (18, 0.025), (25, 0.051), (26, 0.018), (39, 0.017), (42, 0.242), (59, 0.103), (69, 0.012), (71, 0.021), (73, 0.04), (76, 0.016), (78, 0.026), (83, 0.089), (84, 0.026), (98, 0.13)]
simIndex simValue paperId paperTitle
1 0.95122665 178 acl-2010-Non-Cooperation in Dialogue
Author: Brian Pluss
Abstract: This paper presents ongoing research on computational models for non-cooperative dialogue. We start by analysing different levels of cooperation in conversation. Then, inspired by findings from an empirical study, we propose a technique for measuring non-cooperation in political interviews. Finally, we describe a research programme towards obtaining a suitable model and discuss previous accounts for conflictive dialogue, identifying the differences with our work.
2 0.93784446 149 acl-2010-Incorporating Extra-Linguistic Information into Reference Resolution in Collaborative Task Dialogue
Author: Ryu Iida ; Syumpei Kobayashi ; Takenobu Tokunaga
Abstract: This paper proposes an approach to reference resolution in situated dialogues by exploiting extra-linguistic information. Recently, investigations of referential behaviours involved in situations in the real world have received increasing attention by researchers (Di Eugenio et al., 2000; Byron, 2005; van Deemter, 2007; Spanger et al., 2009). In order to create an accurate reference resolution model, we need to handle extra-linguistic information as well as textual information examined by existing approaches (Soon et al., 2001 ; Ng and Cardie, 2002, etc.). In this paper, we incorporate extra-linguistic information into an existing corpus-based reference resolution model, and investigate its effects on refer- ence resolution problems within a corpus of Japanese dialogues. The results demonstrate that our proposed model achieves an accuracy of 79.0% for this task.
3 0.91629452 123 acl-2010-Generating Focused Topic-Specific Sentiment Lexicons
Author: Valentin Jijkoun ; Maarten de Rijke ; Wouter Weerkamp
Abstract: We present a method for automatically generating focused and accurate topicspecific subjectivity lexicons from a general purpose polarity lexicon that allow users to pin-point subjective on-topic information in a set of relevant documents. We motivate the need for such lexicons in the field of media analysis, describe a bootstrapping method for generating a topic-specific lexicon from a general purpose polarity lexicon, and evaluate the quality of the generated lexicons both manually and using a TREC Blog track test set for opinionated blog post retrieval. Although the generated lexicons can be an order of magnitude more selective than the general purpose lexicon, they maintain, or even improve, the performance of an opin- ion retrieval system.
4 0.87881649 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing
Author: Ruihong Huang ; Ellen Riloff
Abstract: This research explores the idea of inducing domain-specific semantic class taggers using only a domain-specific text collection and seed words. The learning process begins by inducing a classifier that only has access to contextual features, forcing it to generalize beyond the seeds. The contextual classifier then labels new instances, to expand and diversify the training set. Next, a cross-category bootstrapping process simultaneously trains a suite of classifiers for multiple semantic classes. The positive instances for one class are used as negative instances for the others in an iterative bootstrapping cycle. We also explore a one-semantic-class-per-discourse heuristic, and use the classifiers to dynam- ically create semantic features. We evaluate our approach by inducing six semantic taggers from a collection of veterinary medicine message board posts.
same-paper 5 0.869313 214 acl-2010-Sparsity in Dependency Grammar Induction
Author: Jennifer Gillenwater ; Kuzman Ganchev ; Joao Graca ; Fernando Pereira ; Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In ex- periments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other state-of-the-art techniques.
6 0.80081749 231 acl-2010-The Prevalence of Descriptive Referring Expressions in News and Narrative
7 0.79933906 22 acl-2010-A Unified Graph Model for Sentence-Based Opinion Retrieval
8 0.79376471 251 acl-2010-Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews
9 0.77239686 208 acl-2010-Sentence and Expression Level Annotation of Opinions in User-Generated Discourse
10 0.76063037 167 acl-2010-Learning to Adapt to Unknown Users: Referring Expression Generation in Spoken Dialogue Systems
11 0.75152218 134 acl-2010-Hierarchical Sequential Learning for Extracting Opinions and Their Attributes
12 0.71466541 42 acl-2010-Automatically Generating Annotator Rationales to Improve Sentiment Classification
13 0.70781362 142 acl-2010-Importance-Driven Turn-Bidding for Spoken Dialogue Systems
14 0.70650822 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
15 0.69246566 62 acl-2010-Combining Orthogonal Monolingual and Multilingual Sources of Evidence for All Words WSD
16 0.69081855 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
17 0.68903804 35 acl-2010-Automated Planning for Situated Natural Language Generation
18 0.68604469 194 acl-2010-Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning
19 0.68460548 219 acl-2010-Supervised Noun Phrase Coreference Research: The First Fifteen Years
20 0.68448567 112 acl-2010-Extracting Social Networks from Literary Fiction