acl acl2013 acl2013-348 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shay B. Cohen ; Mark Johnson
Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i. [sent-4, score-0.459]
2 , the sum of the “probabilities” of the trees the grammar generates can be less than one). [sent-6, score-0.313]
3 We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. [sent-8, score-0.489]
4 We conclude with a discussion of the impact of tightness empirically. [sent-11, score-0.252]
5 Cognitively it is implausible that children can perceive the parse trees of the language they are learning, but it is more reasonable to assume that they can obtain the terminal strings or yield of these trees. [sent-16, score-0.342]
6 Unsupervised methods for learning a grammar from terminal strings alone is also interesting from an engineering perspective because such training data is cheap and Mark Johnson Department of Computing Macquarie University mark . [sent-17, score-0.276]
7 Cohen and Smith (2012) show that inferring PCFG rule probabilities from strings alone is computationally intractable, so we should not expect to find an efficient, general-purpose algorithm for the unsupervised problem. [sent-21, score-0.252]
8 Bayesian estimators for PCFG rule probabilities have also been attracting attention because they provide a theoretically-principled way of incorporating prior information. [sent-24, score-0.337]
9 (2007) proposed MCMC samplers for the posterior distribution over rule probabilities and the parse trees of the training data strings. [sent-26, score-0.705]
10 An obvious question is then: how does tightness affect the inference of PCFGs? [sent-33, score-0.296]
11 Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 103 –1041, timates are always tight for both the supervised case (where the input consists of parse trees) and the unsupervised case (where the input consists of yields or terminal strings). [sent-36, score-0.476]
12 We show that for the special case of linear PCFGs (which include HMMs) with nondegenerate priors the posterior puts zero mass on non-tight PCFGs, so tightness is not an issue with Bayesian estimation of such grammars. [sent-39, score-0.685]
13 However, because all of the commonly used priors (such as the Dirichlet or the logistic normal) assign non- zero probability across the whole probability simplex, in general the posterior may assign non-zero probability to non-tight PCFGs. [sent-40, score-0.459]
14 the only-tight approach, where we modify the prior so it only assigns non-zero probability to tight PCFGs, 2. [sent-42, score-0.544]
15 the renormalization approach, where we renormalize non-tight PCFGs so they define a probability distribution over trees, and 3. [sent-43, score-0.4]
16 the sink-element approach, where we reinterpret non-tight PCFGs as assigning non-zero probability to a “sink element”, so both tight and non-tight PCFGs are properly normalized. [sent-44, score-0.409]
17 (2007) so it produces samples from the posterior distributions defined by the only-tight and renormalization approaches. [sent-46, score-0.544]
18 actually produces samples from the posterior distributions defined by × the sink-element approach. [sent-48, score-0.292]
19 We conclude by studying the effect of requiring tightness on the estimation of some simple PCFGs. [sent-49, score-0.295]
20 Because the Bayesian posterior converges around the (tight) ML estimate as the size of the data grows, requiring tightness only seems to make a difference with highly biased priors or with very small training corpora. [sent-50, score-0.476]
21 aAs aP rvoabriaabbilleis rtaicn gCinongt oevxte-rF(r Nee ×Gr Nam)m ∪a Tr (G, Θ) is a pair consisting of a context-free grammar G and a real-valued vector of length |R| indexed by productions, w vehcetroe θA→β eisn tthhe production probability associated with the production A → β ∈ Rb. [sent-53, score-0.368]
22 =Th Ae partition function Zn or measure onf o oafl lt possible trees is: Z(Θ) = X Yθrfr(t0) tX0∈T rY∈R where T is the set of all (finite) trees generated by eGre. [sent-57, score-0.404]
23 TA iPsC thFeG siest tight ilff ( tihniet partition feunencratitoend Z(Θ) = 1. [sent-58, score-0.445]
24 3 Bayesian inference for PCFGs The goal of Bayesian inference for PCFGs is to in- fer a posterior distribution over the rule probability vectors Θ given observed data D. [sent-70, score-0.437]
25 This posterior distribution is obtained by combining the likelihood P(D | Θ) with a prior distribution P(Θ) over Θd using Bayes Rithule a. [sent-71, score-0.415]
26 , we insist that the prior assign zero mass to non-tight rule probability vectors, so Z = 1. [sent-74, score-0.393]
27 µ1Θ −(t Z)(Θ) if t ∈ T 2This definition of a distribution over trees can be induced by a tight PCFG with a special ⊥ symbol in its vocabulary. [sent-77, score-0.575]
28 Gbyiv aen ti Ggh,t t PheC FfiGrst w wstitehp ias tpoe ccriaealt ⊥e a s ytimgbhto grammar Gca0b uuslainryg. [sent-78, score-0.285]
29 In the supervised setting the data D consists of a corpus of parse trees D = (t1, . [sent-87, score-0.233]
30 , tn) where each tree ti is generated by the PCFG G, so Yn P(D | Θ) = YP(ti | Θ) iY= Y1 In the unsupervised setting the data D consists of a corpus of strings D = (w1, . [sent-90, score-0.244]
31 In this setting Yn P(D | Θ) = YP(wi | Θ),where: iY= Y1 P(w | Θ) = X P(t | Θ) t∈T :yXield(t)=w 4 The special case of linear PCFGs One way to handle the issue of tightness is to identify a family of CFGs for which practically any parameter setting will yield a tight PCFG. [sent-94, score-0.801]
32 We cannot expect that a CFG will yield a tight PCFG for any assignment to the rule probabilities (i. [sent-97, score-0.528]
33 To put it more formally, we say that a prior P(Θ) is “tight almost everywhere for G” if P(Θ⊥) =Z⊥ P(Θ)dΘ = 0. [sent-105, score-0.244]
34 We now provide a sufficient condition (linearity) for CFGs under which they are tight almost everywhere with any continuous prior. [sent-106, score-0.487]
35 Definition 2 A nonterminal A ∈ N in a probabDielifsitnicit context-free grammar AG ∈ ∈w iNth parameters Θ is nonterminating if PG(A Here P(A ⇒+ ⇒+ X . [sent-119, score-0.384]
36 MAB is the expected number of B nonterminals generated from an A nonterminal in one single derivational step, so [Mk]AB is the expected number of B nonterminals generated from an A nonterminal in a k-step derivation (Wetherell, 1980). [sent-141, score-0.359]
37 Then, in turn, a PCFG is tight if it is subcritical. [sent-144, score-0.341]
38 The case of = 1is a borderline case that does not give sufficient information to know whether the PCFG is tight or not. [sent-146, score-0.372]
39 λ λ λ case, for a continuous prior such as the Dirichlet prior, this borderline case will have measure zero under the prior. [sent-148, score-0.234]
40 Because the grammar has no useless nonterminals, the probability of such a derivation is strictly smaller than 1. [sent-168, score-0.323]
41 Proposition 1Any continuous prior P(Θ) on a linear grammar G is tight almost everywhere for G. [sent-189, score-0.824]
42 With a continuous prior, the probability of G getting parameters from the prior which yield a useless non-terminal is 0 it would require setting at least one rule in the grammar with rule probability which is exactly 1. [sent-191, score-0.854]
43 Therefore, with probability 1, the parameters taken from the prior yield a PCFG which is linear and does not have nonterminating nonterminals. [sent-192, score-0.4]
44 We can then intersect the grammar GA with the regular language T∗cAT∗cAT∗ (for each nonterminal A ∈ N). [sent-199, score-0.269]
45 We conclude this section by remarking that many of the models used in computational linguistics are in fact equivalent to linear PCFGs, so continuous Bayesian priors are almost everywhere tight. [sent-203, score-0.269]
46 5 Dirichlet priors The first step in Bayesian inference is to specify a prior on Θ. [sent-206, score-0.263]
47 The prior is parameterized by a positive real valued vector α indexed by productions R, so each production probability θA→β has a corresponding Dirichlet parameter αA→β. [sent-208, score-0.367]
48 Ignoring issues of tightness for the moment and setting P(t | Θ) = µΘ (t), tshsi sfo means mtohamt einn tth aen supervised setting )th =e posterior distribution P(Θ | t, α) given a set of parse trees stt = (t1, . [sent-212, score-0.754]
49 = Y θfrr(t)+αr−1 rY∈R which is a product of Dirichlet distributions with parameters f(t) + α, where f(t) is the vector of rule counts in t indexed by r ∈ R. [sent-218, score-0.361]
50 We can thus wrurleite c: P(Θ | t, α) = P(Θ | f(t) + α) Input: Grammar G, vector of trees t, vector of hyperparameters α, previous parameters Θ0. [sent-219, score-0.341]
51 Result: A vector of parameters Θ repeat draw θ from products of Dirichlet with hyperparameters α + f(t) until Θ is tight for G; return Θ Algorithm 1: An algorithm for generating samples from P(Θ | t, α) for the only-tight ap- proach. [sent-220, score-0.603]
52 erparameters α, previous rule parameters Result: A vector of parameters Θ draw a proposal Θ∗ from a product of Dirichlets with parameters α + f(t). [sent-222, score-0.417]
53 α) which makes it clear that the rule counts are di- rectly added to the parameters of the prior to produce the parameters of the posterior. [sent-232, score-0.343]
54 6 Inference in the supervised setting We first discuss Bayesian inference in the supervised setting, as inference in the unsupervised setting is based on inference for the supervised setting. [sent-233, score-0.273]
55 Our MCMC algorithms for the unsupervised setting build on these samplers for the supervised setting. [sent-238, score-0.28]
56 1 The only-tight approach The “only-tight” approach requires that the prior assign zero mass to non-tight rule probability vectors Θ⊥. [sent-240, score-0.393]
57 One way to define such a distribution is to restrict the domain of an existing prior distribution with the set of tight Θ and renormalize. [sent-241, score-0.574]
58 In more detail, if P(Θ) is a prior over rule probabilities, then its renormalization is the prior P0 defined as: P0(Θ) =P(Θ)ZI((ΘΘ⊥ ∈/ ) Θ⊥). [sent-242, score-0.614]
59 1037 Input: Grammar G, vector of trees t, vector of hyperparameters α, previous parameters Θ0. [sent-244, score-0.341]
60 Result: A vector of parameters Θ draw Θ from products of Dirichlet with hyperparameters α + f(t) return Θ Algorithm 3: An algorithm for generating samples from P(Θ | t, α) for the sink-state approach. [sent-245, score-0.262]
61 Proposition 2 Let P(Θ |α) be a prior with hyperparameters α over Pth(eΘ parameters of G w stuhc hhy tpheratP is conjugate to the grammar likelihood. [sent-247, score-0.497]
62 2, is conjugate to the grammar likelihood as well. [sent-249, score-0.275]
63 Proof: Assume that trees t are observed, and the prior over the grammar parameters is the prior defined in Eq. [sent-250, score-0.641]
64 Sampling from the posterior over the parameters given a set of trees t is therefore quite simple when assuming the base prior being renormalized is a product of Dirichlets. [sent-257, score-0.579]
65 Algorithm 1 sam- ples from a product of Dirichlets distribution with hyperparameters α + f(t) repeatedly, each time checking and rejecting the sample until we obtain a tight PCFG. [sent-258, score-0.519]
66 The more mass the Dirichlet distribution with hyperparameters α + f(t) puts on non-tight PCFGs, the more rejections will happen. [sent-259, score-0.254]
67 In general, if the probability mass on non-tight PCFGs is q⊥, then it would require, on average 1/(1 q⊥) samples f irto wmo tuhlids rdeiqsutriirbeu,ti oonn a ivne orargdeer 1 t/o( 1ob −tai qn a tight PCFG. [sent-260, score-0.546]
68 2 The renormalization approach The renormalization approach modifies the likelihood function instead of the prior. [sent-262, score-0.546]
69 Here we use a product of Dirichlets prior P(Θ | α) on rule probability vte ocftDo risr cΘh, betust pthrieo presence o)f o tnhe r partition function Z(Θ) in Eq. [sent-263, score-0.457]
70 Algorithm 2 describes a Metropolis-Hastings sampler for sampling from the posterior in Eq. [sent-267, score-0.24]
71 tT ohfis tr means sthoa itt tchaen conjugacy argument given at the bottom of section 5 holds in this approach, so the posterior P(Θ | t,α) is a product of Dirichlseots t hwei ptho parameters f(t) + α. [sent-274, score-0.306]
72 Algorithm 3D gives a sampler for P(Θ | t,α) for the sink element approach. [sent-275, score-0.252]
73 (2007) provide two Markov chain Monte Carlo algorithms for Bayesian inference for PCFG rule probabilities in the unsupervised setting (i. [sent-277, score-0.282]
74 The algorithms we give here are based on their Gibbs sampler, which in each iteration first samples parse trees t = (t1, . [sent-283, score-0.256]
75 at the conditional distribution P(t | w, Θ) cise tuhnaatf tfhecete cdo idni ieoancahl doifs our itohnree P approaches (the partition functions cancel in the 1038 Input: Grammar G, vector of hyperparameters α, vector of strings w = (w1, . [sent-288, score-0.38]
76 Result: A vector of parameters Θ for i← 1to n do o id ←raw 1 t toi fr no mdo P(ti |wi , Θ0) end use Algorithm 2 to sample Θ given G, t, α and Θ0 return Θ Algorithm 4: One step of the Metropolis-withinGibbs sampler for the renormalization approach. [sent-292, score-0.441]
77 renormalization approach), so the algorithm for sampling from P(t | w, Θ) given by Johnson et asal. [sent-293, score-0.252]
78 ignored tightness and assumed that P(Θ | t, α) is a product of Dirichlets with parameters f(t) α. [sent-296, score-0.368]
79 In fact, we obtain samplers for the unsupervised setting for each of our approaches by “plugging in” the corresponding sampling algorithm (Eq. [sent-299, score-0.313]
80 g The one complication is that because we use a Metropolis-Hastings procedure to generate samples from P(Θ | t,α) in the renormalization ap- + proach, we use t|h te, Metropolis-within-Gibbs procedure given in Algorithm 4 (Robert and Casella, 2004). [sent-301, score-0.322]
81 Clearly the three approaches differ in terms of the grammars they admit (the only-tight approach requires the prior to only assign non-zero probability to tight PCFGs, while the other two approaches permit the prior to assign non-zero probability to non-tight PCFGs as well). [sent-303, score-0.849]
82 In this section we provide a grammar such that with a uniform prior over rule probabilities, the conditional distribution over trees given a fixed string varies under each of the three different approaches. [sent-307, score-0.589]
83 The grammar we consider has three rules S → S ST S|S S|a mwairth w probabilities θ1, θ2 aen rdu l1e θ1 θ2, respectively. [sent-308, score-0.248]
84 Then it’s easy to show that: P(ti| w) =Pi30q=(1tiq)(ti0) where the q used is baPsed on the approach to tightness desired. [sent-315, score-0.252]
85 pdf, ca and we confirmed these results to three decimal places using the samplers described above (which required 107 samples per approach). [sent-336, score-0.256]
86 While the differences between these conditional probabilities are not great, the conditional probabilities are clearly different, so the three approaches do in fact define different distributions over trees under a uniform prior on rule probabilities. [sent-337, score-0.596]
87 9 Empirical effects of the three approaches in unsupervised grammar induction In this section we present experiments using the three samplers just described in an unsupervised grammar induction problem. [sent-338, score-0.709]
88 Our goal here is not to improve the state-of-the-art in unsupervised grammar induction, but to try to measure empirical differences in the estimates produced by the three different approaches to tightness just described. [sent-339, score-0.495]
89 , 142) rejected proposals in 1,000 samples, while the renormalization approach results in an average of 232 (s. [sent-355, score-0.252]
90 (It’s not surprising that the only-tight approach results in more rejections as it keeps proposing new Θ until a tight proposal is found, while the renormalization approach simply uses the old Θ). [sent-358, score-0.661]
91 Thus our experiments provided no evidence that the samplers produced different distributions over trees, although it’s reasonable to expect that these distributions do indeed differ. [sent-362, score-0.35]
92 1040 10 Conclusion In this paper we characterized the notion of an al- µ most everywhere tight grammar in the Bayesian setting and showed it holds for linear CFGs. [sent-364, score-0.699]
93 The “onlytight” approach restricts attention to tight PCFGs, and perhaps surprisingly, we showed that conjugacy still obtains when the domain of a product of Dirichlets prior is restricted to the subset of tight grammars. [sent-366, score-0.972]
94 The renormalization approach involves renormalizing the PCFG measure over trees when the grammar is non-tight, which destroys conjugacy with a product ofDirichlets prior. [sent-367, score-0.673]
95 Perhaps most surprisingly of all, the sink-element approach, which assigns the missing mass in nontight PCFG to a sink element ⊥, turns out to be equivalent Gto t existing practice ⊥wh, etruern tightness bise ignored. [sent-368, score-0.5]
96 We studied the posterior distributions over trees induced by the three approaches under a uniform prior for a simple grammar and showed that they differ. [sent-369, score-0.703]
97 We leave for future work the important question of whether the classes of distributions over distributions over trees that the three ap- proaches define are the same or different. [sent-370, score-0.314]
98 We described samplers for the supervised and unsupervised settings for each of these approaches, and applied them to an unsupervised grammar induction problem. [sent-371, score-0.478]
99 We could not detect any difference in the posterior distributions over trees produced by these samplers, despite devoting considerable computational resources to the problem. [sent-377, score-0.372]
100 This suggests that for these kinds of problems at least, tightness is not of practical concern for Bayesian inference of PCFGs. [sent-378, score-0.296]
wordName wordTfidf (topN-words)
[('pcfgs', 0.384), ('tight', 0.341), ('pcfg', 0.311), ('renormalization', 0.252), ('tightness', 0.252), ('samplers', 0.186), ('grammar', 0.163), ('trees', 0.15), ('posterior', 0.14), ('prior', 0.135), ('dirichlets', 0.118), ('cfgs', 0.117), ('everywhere', 0.109), ('sink', 0.109), ('nonterminal', 0.106), ('partition', 0.104), ('sampler', 0.1), ('bayesian', 0.096), ('rule', 0.092), ('ti', 0.089), ('johnson', 0.087), ('priors', 0.084), ('distributions', 0.082), ('mathematica', 0.076), ('hyperparameters', 0.071), ('samples', 0.07), ('conjugate', 0.07), ('dirichlet', 0.07), ('probability', 0.068), ('mass', 0.067), ('strings', 0.061), ('productions', 0.06), ('product', 0.058), ('nederhof', 0.058), ('estimators', 0.058), ('parameters', 0.058), ('nonterminating', 0.057), ('rfr', 0.057), ('mcmc', 0.053), ('nonterminals', 0.053), ('terminal', 0.052), ('probabilities', 0.052), ('satta', 0.051), ('useless', 0.051), ('conjugacy', 0.05), ('eigenvalue', 0.05), ('distribution', 0.049), ('perhaps', 0.047), ('setting', 0.047), ('unsupervised', 0.047), ('inference', 0.044), ('mk', 0.044), ('gibbs', 0.044), ('yield', 0.043), ('element', 0.043), ('estimation', 0.043), ('cohen', 0.042), ('likelihood', 0.042), ('chi', 0.042), ('derivation', 0.041), ('indexed', 0.04), ('ss', 0.039), ('proof', 0.039), ('linear', 0.039), ('atherya', 0.038), ('kurihara', 0.038), ('mab', 0.038), ('nontightness', 0.038), ('ohn', 0.038), ('rejections', 0.038), ('renormalized', 0.038), ('sweeps', 0.038), ('wati', 0.038), ('continuous', 0.037), ('ml', 0.037), ('grammars', 0.036), ('parse', 0.036), ('symbol', 0.035), ('iy', 0.035), ('induction', 0.035), ('lari', 0.034), ('production', 0.033), ('approaches', 0.033), ('ry', 0.033), ('aen', 0.033), ('family', 0.032), ('draw', 0.032), ('borderline', 0.031), ('zn', 0.031), ('geman', 0.031), ('renormalize', 0.031), ('zero', 0.031), ('vector', 0.031), ('proposal', 0.03), ('ra', 0.03), ('monte', 0.029), ('booth', 0.029), ('puts', 0.029), ('shay', 0.029), ('surprisingly', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs
Author: Shay B. Cohen ; Mark Johnson
Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.
2 0.15433933 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
Author: Trevor Cohn ; Gholamreza Haffari
Abstract: Modern phrase-based machine translation systems make extensive use of wordbased translation models for inducing alignments from parallel corpora. This is problematic, as the systems are incapable of accurately modelling many translation phenomena that do not decompose into word-for-word translation. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. Overall this leads to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems.
3 0.14584221 275 acl-2013-Parsing with Compositional Vector Grammars
Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.
Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.
4 0.11833544 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.
5 0.11720657 349 acl-2013-The mathematics of language learning
Author: Andras Kornai ; Gerald Penn ; James Rogers ; Anssi Yli-Jyra
Abstract: unkown-abstract
6 0.11087526 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning
7 0.10249523 4 acl-2013-A Context Free TAG Variant
8 0.088047147 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models
9 0.087644309 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation
10 0.086108506 311 acl-2013-Semantic Neighborhoods as Hypergraphs
11 0.084776163 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
12 0.083280571 357 acl-2013-Transfer Learning for Constituency-Based Grammars
13 0.078163564 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars
14 0.07739152 165 acl-2013-General binarization for parsing and translation
15 0.070512384 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules
16 0.070305265 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation
17 0.069701448 57 acl-2013-Arguments and Modifiers from the Learner's Perspective
18 0.067299157 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions
19 0.064583853 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction
20 0.064507537 382 acl-2013-Variational Inference for Structured NLP Models
topicId topicWeight
[(0, 0.156), (1, -0.046), (2, -0.026), (3, -0.003), (4, -0.087), (5, -0.013), (6, 0.083), (7, -0.022), (8, -0.083), (9, -0.016), (10, 0.037), (11, -0.008), (12, 0.072), (13, -0.112), (14, -0.074), (15, -0.111), (16, 0.063), (17, 0.114), (18, 0.007), (19, -0.056), (20, 0.022), (21, -0.024), (22, 0.083), (23, 0.069), (24, -0.053), (25, -0.058), (26, -0.012), (27, 0.014), (28, 0.03), (29, 0.056), (30, 0.063), (31, -0.022), (32, 0.006), (33, -0.007), (34, 0.069), (35, 0.029), (36, -0.016), (37, -0.047), (38, -0.049), (39, 0.031), (40, -0.027), (41, 0.008), (42, -0.035), (43, -0.073), (44, 0.017), (45, 0.01), (46, -0.096), (47, 0.086), (48, -0.073), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.96800274 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs
Author: Shay B. Cohen ; Mark Johnson
Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.
2 0.69381815 165 acl-2013-General binarization for parsing and translation
Author: Matthias Buchse ; Alexander Koller ; Heiko Vogler
Abstract: Binarization ofgrammars is crucial for improving the complexity and performance of parsing and translation. We present a versatile binarization algorithm that can be tailored to a number of grammar formalisms by simply varying a formal parameter. We apply our algorithm to binarizing tree-to-string transducers used in syntax-based machine translation.
3 0.65655243 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models
Author: Matthew R. Gormley ; Jason Eisner
Abstract: Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle for local optimization methods such as EM or gradient ascent. We explore how one might instead search for a global optimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ?) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum. As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.
4 0.65254402 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars
Author: David Chiang ; Jacob Andreas ; Daniel Bauer ; Karl Moritz Hermann ; Bevan Jones ; Kevin Knight
Abstract: Hyperedge replacement grammar (HRG) is a formalism for generating and transforming graphs that has potential applications in natural language understanding and generation. A recognition algorithm due to Lautemann is known to be polynomial-time for graphs that are connected and of bounded degree. We present a more precise characterization of the algorithm’s complexity, an optimization analogous to binarization of contextfree grammars, and some important implementation details, resulting in an algorithm that is practical for natural-language applications. The algorithm is part of Bolinas, a new software toolkit for HRG processing.
5 0.62021154 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
Author: Trevor Cohn ; Gholamreza Haffari
Abstract: Modern phrase-based machine translation systems make extensive use of wordbased translation models for inducing alignments from parallel corpora. This is problematic, as the systems are incapable of accurately modelling many translation phenomena that do not decompose into word-for-word translation. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. Overall this leads to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems.
6 0.61280102 311 acl-2013-Semantic Neighborhoods as Hypergraphs
7 0.60127485 349 acl-2013-The mathematics of language learning
8 0.58599979 4 acl-2013-A Context Free TAG Variant
9 0.54806429 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation
10 0.54660851 261 acl-2013-Nonparametric Bayesian Inference and Efficient Parsing for Tree-adjoining Grammars
11 0.52880615 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
12 0.52440691 57 acl-2013-Arguments and Modifiers from the Learner's Perspective
13 0.52337158 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach
14 0.52078384 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning
15 0.51480144 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model
16 0.51279324 237 acl-2013-Margin-based Decomposed Amortized Inference
17 0.51180786 331 acl-2013-Stop-probability estimates computed on a large corpus improve Unsupervised Dependency Parsing
18 0.48779795 275 acl-2013-Parsing with Compositional Vector Grammars
19 0.4743844 325 acl-2013-Smoothed marginal distribution constraints for language modeling
20 0.46327877 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics
topicId topicWeight
[(0, 0.065), (6, 0.023), (11, 0.069), (24, 0.025), (26, 0.035), (28, 0.013), (35, 0.067), (42, 0.028), (48, 0.039), (70, 0.467), (88, 0.022), (90, 0.035), (95, 0.037)]
simIndex simValue paperId paperTitle
1 0.98299199 384 acl-2013-Visual Features for Linguists: Basic image analysis techniques for multimodally-curious NLPers
Author: Elia Bruni ; Marco Baroni
Abstract: unkown-abstract
2 0.95114398 89 acl-2013-Computerized Analysis of a Verbal Fluency Test
Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks
Abstract: We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e.g., ‘F’) for one minute. Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. Our system provides an automated alternative. In a pilot study, we applied this system to tests of 55 novice and experienced professional fighters (boxers and mixed martial artists) and found that experienced fighters produced significantly longer chains of phonetically similar words, while no differences were found in the total number of words produced. These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired.
3 0.95068151 296 acl-2013-Recognizing Identical Events with Graph Kernels
Author: Goran Glavas ; Jan Snajder
Abstract: Identifying news stories that discuss the same real-world events is important for news tracking and retrieval. Most existing approaches rely on the traditional vector space model. We propose an approach for recognizing identical real-world events based on a structured, event-oriented document representation. We structure documents as graphs of event mentions and use graph kernels to measure the similarity between document pairs. Our experiments indicate that the proposed graph-based approach can outperform the traditional vector space model, and is especially suitable for distinguishing between topically similar, yet non-identical events.
same-paper 4 0.94624716 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs
Author: Shay B. Cohen ; Mark Johnson
Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.
5 0.92247713 218 acl-2013-Latent Semantic Tensor Indexing for Community-based Question Answering
Author: Xipeng Qiu ; Le Tian ; Xuanjing Huang
Abstract: Retrieving similar questions is very important in community-based question answering(CQA) . In this paper, we propose a unified question retrieval model based on latent semantic indexing with tensor analysis, which can capture word associations among different parts of CQA triples simultaneously. Thus, our method can reduce lexical chasm of question retrieval with the help of the information of question content and answer parts. The experimental result shows that our method outperforms the traditional methods.
6 0.90868235 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation
7 0.90555263 220 acl-2013-Learning Latent Personas of Film Characters
8 0.83813745 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia
9 0.69688851 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities
10 0.67097205 249 acl-2013-Models of Semantic Representation with Visual Attributes
11 0.66163659 380 acl-2013-VSEM: An open library for visual semantics representation
13 0.63479131 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars
14 0.60804504 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus
15 0.60319811 80 acl-2013-Chinese Parsing Exploiting Characters
16 0.59165943 169 acl-2013-Generating Synthetic Comparable Questions for News Articles
17 0.58861572 168 acl-2013-Generating Recommendation Dialogs by Extracting Information from User Reviews
18 0.58703148 165 acl-2013-General binarization for parsing and translation
19 0.58494568 339 acl-2013-Temporal Signals Help Label Temporal Relations
20 0.5608899 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing