jmlr jmlr2010 jmlr2010-53 knowledge-graph by maker-knowledge-mining

53 jmlr-2010-Inducing Tree-Substitution Grammars

Source: pdf

Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater

Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difﬁculty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and ﬁrst order dependency grammars, which are not sufﬁciently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efﬁcacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and ﬁrst order dependency grammars, which are not sufﬁciently expressive to model many common linguistic constructions. [sent-12, score-0.555]

2 We demonstrate the model’s efﬁcacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. [sent-15, score-0.496]

3 Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process 1. [sent-17, score-0.611]

4 Introduction Inducing a grammar from a corpus of strings is one of the central challenges of computational linguistics, as well as one of its most difﬁcult. [sent-18, score-0.386]

5 Perhaps due to the difﬁculty of this unsupervised grammar induction problem, a more recent line of work has focused on a somewhat easier problem, where the input consists of a treebank corpus, usually in phrase-structure format, c 2010 Trevor Cohn, Phil Blunsom and Sharon Goldwater. [sent-20, score-0.533]

6 C OHN , B LUNSOM AND G OLDWATER and the task is to induce a grammar from the treebank that yields better parsing performance than the basic maximum-likelihood probabilistic context free grammar (PCFG). [sent-21, score-0.891]

7 The desire for automatic grammar reﬁnement methods highlights one possible reason why unsupervised grammar induction is so difﬁcult. [sent-26, score-0.783]

8 This is because treebanks are typically not annotated with their TSG derivations—how to decompose a tree into elementary tree fragments—instead the derivation needs to be inferred. [sent-38, score-0.395]

9 Tree adjoining grammar induction (Chiang and Bikel, 2002; Xia, 2002) tackles a similar learning problem in the supervised case. [sent-47, score-0.408]

10 We apply our model to the two grammar induction problems discussed above: • Inducing a TSG from a treebank. [sent-53, score-0.38]

11 As in other recent unsupervised parsing work, we adopt ˇ a dependency grammar (Mel′ cuk, 1988) framework for the unsupervised regime. [sent-61, score-0.664]

12 Our work displays some similarities to previous work on both the grammar reﬁnement and unsupervised grammar induction problems, but also differs in a number of ways. [sent-67, score-0.783]

13 For the unsupervised grammar induction problem we adopt the Dependency Model with Valency (DMV; Klein and Manning, 2004) framework that is currently dominant for grammar induction (Cohen et al. [sent-78, score-0.826]

14 The ﬁrst grammar induction models to surpass a trivial baseline concentrated on the task of inducing unlabelled bracketings for strings and were evaluated against treebank bracketing gold standard (Clark, 2001; Klein and Manning, 2002). [sent-86, score-0.467]

15 Adaptor Grammars have been applied successfully to induce labeled bracketings from strings in the domains of morphology and word segmentation (Johnson, 2008a,b; Johnson and Goldwater, 2009) and very recently for dependency grammar induction (Cohen et al. [sent-99, score-0.481]

16 These extensions are also reported in our recent work on dependency grammar induction (Blunsom and Cohn, 2010), although in this paper we present a more thorough exposition of the model and experimental evaluation. [sent-107, score-0.445]

17 Firstly we present a single generative model capable of both supervised and unsupervised learning, to induce tree substitution grammars from either trees or strings. [sent-111, score-0.525]

18 The local sampler is much simpler but 3056 I NDUCING T REE S UBSTITUTION G RAMMARS is only applicable in the supervised setting, where the trees are observed, whereas the MetropolisHastings sampler can be used in both supervised and unsupervised settings and for parsing. [sent-116, score-0.594]

19 Unsupervised dependency grammar induction experiments are described in Section 7, and we conclude in Section 8. [sent-119, score-0.445]

20 Tree-substitution grammars A Tree Substitution Grammar3 (TSG) is a 4-tuple, G = (T, N, S, R), where T is a set of terminal symbols, N is a set of nonterminal symbols, S ∈ N is the distinguished root nonterminal and R is a set of productions (rules). [sent-121, score-0.433]

21 The productions take the form of elementary trees—tree fragments4 of height ≥ 1—where each internal node is labelled with a nonterminal and each leaf is labelled with either a terminal or a nonterminal. [sent-122, score-0.486]

22 A derivation creates a tree by starting with the root symbol and rewriting (substituting) it with an elementary tree, then continuing to rewrite frontier nonterminals with elementary trees until there are no remaining frontier nonterminals. [sent-126, score-0.803]

23 We can represent derivations as sequences of elementary trees e, where each elementary tree is substituted for the left-most frontier nonterminal of the tree being generated. [sent-127, score-0.872]

24 Unlike Context Free Grammars (CFGs) a syntax tree may not uniquely specify the derivation, as illustrated in Figure 1 which shows two derivations using different elementary trees to produce the same tree. [sent-128, score-0.421]

25 A Probabilistic Tree Substitution Grammar (PTSG), like a PCFG, assigns a probability to each rule in the grammar, denoted P(e|c) where the elementary tree e rewrites nonterminal c. [sent-129, score-0.379]

26 A number of improved algorithms for parsing have been reported, most notably a Monte-Carlo approach for ﬁnding the maximum probability tree (Bod, 1995) and a technique for maximising labelled recall using inside-outside inference in a PCFG reduction grammar (Goodman, 1998). [sent-154, score-0.624]

27 , 1993) as the basis for our supervised grammar induction experiments (grammar reﬁnement). [sent-157, score-0.408]

28 , an S category 3058 I NDUCING T REE S UBSTITUTION G RAMMARS George hates broccoli ROOT Figure 2: An unlabelled dependency analysis for the example sentence George hates broccoli. [sent-162, score-0.443]

29 Much of the recent work on unsupervised grammar induction has therefore taken a different approach, foˇ cusing on inducing dependency grammars (Mel′ cuk, 1988). [sent-165, score-0.621]

30 In applying our model to unsupervised grammar induction we follow this trend by inducing a dependency grammar. [sent-166, score-0.511]

31 Dependency grammars are less ambiguous than phrase-structure grammars since the set of possible constituent labels (heads) is directly observed from the words in the sentence, leaving only the induction challenge of determining the tree structure. [sent-171, score-0.375]

32 Most dependency grammar formalisms also include labels on the links between words, denoting, for example, subject, object, adjunct etc. [sent-172, score-0.402]

33 As detailed in Section 7, the model can be used for dependency grammar induction by using a specially designed phrase-structure grammar to represent dependency links. [sent-176, score-0.847]

34 Model In deﬁning our model, we focus on the unsupervised case, where we are given a corpus of text strings w and wish to learn a tree-substitution grammar G that we can use to infer the parses for our strings and to parse new data. [sent-178, score-0.488]

35 ) Rather than inferring a grammar directly, we go through an intermediate step of inferring a distribution over the derivations used to produce w, that is, a distribution over sequences of elementary trees e that compose to form w as their yield. [sent-180, score-0.646]

36 We will then essentially read the grammar off the elementary trees, as described in Section 5. [sent-181, score-0.508]

37 Note that any sequence of elementary trees uniquely speciﬁes a corresponding sequence of words: those words that can be read off the leaves of the elementary trees in sequence. [sent-183, score-0.562]

38 The distribution over elementary trees with root category c is deﬁned as Gc |ac , bc , PE ∼ PYP(ac , bc , PE (·|c)) e|c, Gc ∼ Gc , (5) where PE (·|c) is a distribution over the inﬁnite space of elementary trees rooted with c, and ac and bc are the PYP hyper-parameters for non-terminal c. [sent-243, score-0.731]

39 Recall that in an elementary tree, each internal node is labelled with a non-terminal category symbol and each frontier (leaf) node is labelled with either a non-terminal or a terminal symbol. [sent-260, score-0.403]

40 Given a probabilistic context-free grammar R, we assume that elementary trees are generated (conditioned on the root non-terminal c) using the following generative process. [sent-261, score-0.655]

41 Note that, just as elementary trees are divided into separate restaurants at the TSG level based on their root categories, CFG rules are divided into separate restaurants at the CFG level based on their left-hand sides. [sent-290, score-0.438]

42 This formulation reﬂects that we now have multiple tied restaurants, and each time an elementary tree opens a new table in a top-level restaurant all its rules are considered to have entered their own respective PC restaurants (according to their root c). [sent-292, score-0.471]

43 Importantly the blocked sampler is more general, being directly applicable to both supervised and unsupervised settings (and for parsing test sentences, which is equivalent to an unsupervised setting) while the local sampler is only applicable for supervised learning, where the trees are observed. [sent-299, score-0.893]

44 3064 I NDUCING T REE S UBSTITUTION G RAMMARS (b) (a) S S VP NP VP,0 NP,1 NP V NP George hates George NP V,0 NP,1 hates broccoli broccoli Figure 5: Gibbs sampler state (b) corresponding to the example derivation (a) (reproduced from Figure 1a). [sent-304, score-0.569]

45 1 Local Sampler The local sampler is designed speciﬁcally for the supervised scenario, and samples a TSG derivation for each tree by sampling local updates at each tree node. [sent-307, score-0.461]

46 Each substitution point forms the root of some elementary tree, as well as a frontier nonterminal of an ancestor node’s elementary tree. [sent-312, score-0.646]

47 Collectively the training trees and substitution variables specify the sequence of elementary trees e that is the current state of the sampler. [sent-314, score-0.49]

48 If d is the node associated with xd , the substitution variable under consideration, then the two possible values of xd deﬁne two options for e: one in which d is internal to some elementary tree eM , and one in which d is the substitution site connecting two smaller trees, eA and eB . [sent-317, score-0.516]

49 To sample a value for xd , we compute the probabilities of eM and (eA , eB ), conditioned on e− : all other elementary trees in the training set that share at most a root or frontier nonterminal with eM , eA , or eB . [sent-319, score-0.486]

50 Moreover it is only applicable to the supervised setting: it cannot be used for unsupervised grammar induction or for parsing test strings. [sent-331, score-0.604]

51 This sampler can make larger moves than the local sampler and is more ﬂexible, in that it can perform inference with both string (unsupervised) or tree (supervised) input. [sent-333, score-0.474]

52 ’s, as we also use a Bayesian prior in a model of grammar induction and consequently face similar problems with directly sampling due to the caching effects of the prior. [sent-356, score-0.408]

53 This can be done by creating a rule for each elementary tree which rewrites its root nonterminal as its frontier. [sent-361, score-0.416]

54 For example, if our base distribution licences the CFG production NP → NP PP then our TSG grammar will contain the inﬁnite set of elementary trees NP → NP PP, NP → (NP NP PP) PP, NP → (NP (NP NP PP) PP) PP, . [sent-364, score-0.654]

55 Thankfully it is possible to transform the inﬁnite MAP-TSG into a ﬁnite CFG, using a method inspired by Goodman (2003), who developed a grammar transform for efﬁcient parsing with an 10. [sent-369, score-0.467]

56 We use a similar technique to encode non-zero count rules in our grammar transformation, described below. [sent-374, score-0.393]

57 The resultant grammar allows for efﬁcient inference, both in unsupervised and supervised training and in parsing (see Section 5). [sent-377, score-0.561]

58 = e − e + c− nc + bc nc + bc P(ei = e|c, z−i , ℓ(z−i )) = count (11) base Grammar A has productions for every elementary tree e with n− ≥ 1, which are assigned as their e probability the count term in Equation 11. [sent-380, score-0.465]

59 To signify the difference between these nonterminal symbols and trees, we use curly braces and hyphens in place of round parentheses and spaces, respectively, for example, the elementary tree (S NP (VP (V hates) NP)) is denoted by the nonterminal symbol {S-NP-{VP-{V-hates}-NP}}. [sent-382, score-0.475]

60 The remaining rules in grammar B correspond to every CFG production in the underlying PCFG base distribution, coupled with the binary decision of whether or not nonterminal children should be substitution sites (frontier nonterminals). [sent-386, score-0.624]

61 This choice affects the rule probability by including an s or 1 − s factor, and child substitution sites also function as a bridge back from grammar B to A. [sent-387, score-0.471]

62 There are often two equivalent paths to reach the same chart cell using the same elementary tree—via grammar A using observed TSG productions and via grammar B using PE backoff—which are summed to yield the desired net probability. [sent-388, score-0.939]

63 Using the transformed grammar we can represent the MAP grammar efﬁciently and draw samples of TSG derivations using the inside algorithm. [sent-390, score-0.702]

64 In an unsupervised setting, that is, given a yield string as input, the grammar transform above can be used directly with the inside algorithm for PCFGs (followed by the reverse transform to map the sampled derivation into TSG elementary trees). [sent-391, score-0.574]

65 sig(ei,n ) 1 − Kc ac +bc n− +bc c Table 1: Grammar transformation rules to map an inﬁnite MAP TSG into an equivalent CFG, separated into three groups for grammar A (top), the bridge between A → B (middle) and grammar B (bottom). [sent-406, score-0.73]

66 3069 C OHN , B LUNSOM AND G OLDWATER S S {S-NP-{VP-{V-hates}-NP}} S’ NP George {VP-{V-hates}-NP} VP’ NP George {V-hates} NP hates broccoli V’ NP hates broccoli Figure 7: Example trees under the grammar transform, which both encode the same TSG derivation from Figure 1a. [sent-413, score-0.835]

67 The left tree encodes that the S → NP (VP (V hates) NP elementary tree was drawn from the cache, while for the right tree this same elementary tree was drawn from the base distribution (the count and base terms in (11), respectively). [sent-414, score-0.79]

68 It is likely that the blocked sampler will have a slower runtime due to its more complicated implementation, particularly in transforming the grammar and inside inference. [sent-419, score-0.621]

69 In our experiments, we initialised the sampler by setting all substitution variables to 0, thus treating every full tree in the training set as an elementary tree. [sent-493, score-0.563]

70 For the parsing methods that require a Monte Carlo integral (MPD, MPT and MER), we sampled 1000 derivations from the MAP approximation grammar which were then input to the Metropolis-Hastings acceptance step before compiling the relevant statistics. [sent-498, score-0.495]

71 93 3500 25746 25339 16168 39758 Table 3: Development results for models trained on section 2 of the Penn treebank, showing labelled constituent F1 and the grammar size. [sent-536, score-0.382]

72 For the TSG models the grammar size reported is the number of CFG productions in the transformed MAP PCFG approximation. [sent-537, score-0.431]

73 With additional rounds of split-merge training the Berkeley grammar grows exponentially larger (200K rules after six iterations). [sent-633, score-0.393]

74 Our TSG grammar is also far smaller than the full DOP grammar induced from this data set, which extracts every possible TSG rule from the training set with no size limit, and has approximately 700K rules. [sent-634, score-0.674]

75 We initialise the sampler using a converged model from the end of a sampling run on the small data set and run the blocked Metropolis Hastings sampler for 20,000 iterations. [sent-637, score-0.493]

76 It would also affect the choice of parsing algorithm because the transformed grammar would no longer be binary. [sent-644, score-0.467]

77 The blocked sampler is more robust to starting point than the local sampler and converges faster in terms of iterations and total time in general. [sent-662, score-0.465]

78 The blocked sampler out-performs the local sampler for all initialisation conditions and has lower lower variance. [sent-671, score-0.524]

79 contrast, the local sampler performs well only with the maximal initialisation, with which it roughly equals the blocked sampler in terms of F1 and log posterior (see Figure 9). [sent-693, score-0.465]

80 Figure 11 shows the grammar statistics for a TSG model trained on the full treebank training set. [sent-696, score-0.424]

81 Despite having an inﬁnite space of grammar productions, the induced grammars are compact, with the majority of probability mass on a small number of individually small productions. [sent-735, score-0.447]

82 Two differences result from the fact that parse trees are not observed during training: ﬁrst, we cannot use the local Gibbs sampler presented above, leaving the blocked sampler as our only option. [sent-739, score-0.611]

83 A ﬁnal difference in our unsupervised experiments is that we now focus on dependency grammar induction rather than phrase-structure induction; as outlined in Section 2. [sent-741, score-0.511]

84 1, dependency grammars are a more feasible formalism for unsupervised grammar learning. [sent-742, score-0.578]

85 This approach induces a dependency grammar using a simple generative model in which the strings are said to be generated by a top-down process which recursively generates child dependents from each parent head word. [sent-745, score-0.523]

86 A dependency grammar can be represented in a CFG by labelling constituents with their head word, and encoding each dependency edge between a head and a child word in the grammar productions. [sent-768, score-1.047]

87 Table 8 shows the CFG grammar for the DMV model (CFG-DMV), while Figure 13 shows the derivation in this grammar for the example sentence in Figure 2. [sent-774, score-0.718]

88 The key insight to understanding the nonterminals in this grammar is that the subscripts encode the terminals at the boundaries of the span dominated by that nonterminal. [sent-775, score-0.395]

89 The time complexity of inference in this grammar is only O (|w|3 ) because each span in the parse chart bounded by terminals A and B can only contain nonterminal labels from the set {LB , L∗ , L1 , A R, A R∗ , A R1 , A MB∗ , A∗ MB , S}. [sent-780, score-0.469]

90 Consequently the grammar constant is ﬁxed rather B B than quadratic in the sentence length. [sent-781, score-0.381]

91 We apply our TSG model to unsupervised dependency grammar induction using the CFG-DMV as the underlying grammar representation. [sent-782, score-0.848]

92 For example by combining LNN → L1 and L1 → LDT DT MNN ∗ rules into an elementary tree the NN NN model can describe that the leftmost child of a noun (NN) is a determiner (DT). [sent-785, score-0.411]

93 Given that we do not observe parse trees in training, we cannot use the local Gibbs sampler as it only allows the sampling of the segmentation of a ﬁxed tree, not the tree itself. [sent-791, score-0.467]

94 In order to parse sentences in the test set we use the Viterbi algorithm to ﬁnd the maximum probability parse under the MAP grammar (see Section 5). [sent-792, score-0.454]

95 S Lhates L1 hates LGeorge hates R 1 hates R George Mhates∗ hates∗ Mbroccoli George R L∗ hates ∗ hates R Lbroccoli Georger Georgel hatesl hatesr broccoli R broccolil broccolir Figure 13: The CFG-DMV derivation for the example sentence George hates broccoli. [sent-828, score-0.938]

96 ROOT The above represents a triumph of either apathy or civility Figure 14: An example induced tree, shown as an elementary tree (a) and as a dependency tree (b). [sent-954, score-0.46]

97 Conclusion In this work we have presented a non-parametric Bayesian model for inducing tree substitution grammars in both supervised and unsupervised settings. [sent-994, score-0.415]

98 Our experimental results indicate that our model holds signiﬁcant potential for a range of grammar induction tasks. [sent-997, score-0.38]

99 Unsupervised induction of tree substitution grammars for dependency parsing. [sent-1011, score-0.429]

100 Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. [sent-1050, score-0.403]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vp', 0.369), ('np', 0.353), ('grammar', 0.337), ('tsg', 0.276), ('sampler', 0.181), ('elementary', 0.171), ('cfg', 0.156), ('pyp', 0.145), ('hates', 0.14), ('parsing', 0.13), ('tree', 0.112), ('trees', 0.11), ('grammars', 0.11), ('pcfg', 0.104), ('blocked', 0.103), ('lunsom', 0.099), ('oldwater', 0.099), ('substitution', 0.099), ('nonterminal', 0.096), ('nducing', 0.095), ('ubstitution', 0.095), ('productions', 0.094), ('treebank', 0.087), ('head', 0.086), ('nn', 0.085), ('ohn', 0.085), ('dop', 0.081), ('pp', 0.073), ('rammars', 0.073), ('frontier', 0.072), ('rb', 0.067), ('unsupervised', 0.066), ('pe', 0.065), ('dependency', 0.065), ('bod', 0.063), ('restaurant', 0.063), ('dt', 0.063), ('prp', 0.062), ('jj', 0.061), ('initialisation', 0.059), ('sbar', 0.059), ('linguistics', 0.059), ('dmv', 0.058), ('nns', 0.058), ('nonterminals', 0.058), ('rules', 0.056), ('vbd', 0.056), ('broccoli', 0.054), ('adaptor', 0.054), ('eb', 0.052), ('ree', 0.051), ('pc', 0.05), ('mpd', 0.05), ('corpus', 0.049), ('ea', 0.048), ('headden', 0.046), ('labelled', 0.045), ('sentences', 0.045), ('adjp', 0.045), ('mnn', 0.045), ('bc', 0.044), ('sentence', 0.044), ('induction', 0.043), ('fragments', 0.042), ('ke', 0.041), ('ptsg', 0.041), ('north', 0.039), ('language', 0.039), ('dir', 0.038), ('viterbi', 0.038), ('george', 0.037), ('root', 0.037), ('noun', 0.037), ('word', 0.036), ('production', 0.036), ('cc', 0.036), ('cohen', 0.036), ('parse', 0.036), ('child', 0.035), ('parser', 0.035), ('nnp', 0.035), ('node', 0.035), ('klein', 0.033), ('verb', 0.033), ('petrov', 0.033), ('phrases', 0.032), ('ldt', 0.032), ('mer', 0.032), ('restaurants', 0.032), ('sharon', 0.032), ('sig', 0.032), ('vbp', 0.032), ('lexicalized', 0.031), ('vbz', 0.031), ('phrase', 0.031), ('vb', 0.03), ('cb', 0.029), ('derivations', 0.028), ('sampling', 0.028), ('supervised', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999969 53 jmlr-2010-Inducing Tree-Substitution Grammars

Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater

2 0.33194542 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

Author: Shay B. Cohen, Noah A. Smith

Abstract: Probabilistic grammars offer great ﬂexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference

3 0.20994784 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

Author: James Henderson, Ivan Titov

Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally speciﬁed model structure. ISBNs are speciﬁcally targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies beneﬁts from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efﬁcient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-ﬁeld approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artiﬁcial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks

4 0.16100487 15 jmlr-2010-Approximate Tree Kernels

Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller

Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are signiﬁcantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels

5 0.13817506 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

Author: Alexander Clark, Rémi Eyraud, Amaury Habrard

Abstract: We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages using the paradigm of positive data and a membership oracle. We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs), which are capable of representing richly structured context-free languages as well as some context sensitive languages. These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning. This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly expressive representations necessary for linguistic representation: it allows the learnability of a large class of context-free languages, that includes all regular languages and those context-free languages that satisfy two simple constraints. The formalism and the algorithm seem well suited to natural language and in particular to the modeling of ﬁrst language acquisition. Preliminary experimental results conﬁrm the effectiveness of this approach. Keywords: grammatical inference, context-free language, positive data only, membership queries

6 0.11579802 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

7 0.075855419 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

8 0.053659152 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

9 0.051920287 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

10 0.0498905 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

11 0.04806279 109 jmlr-2010-Stochastic Composite Likelihood

12 0.043408524 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

13 0.039857335 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

14 0.038143869 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

15 0.037406463 69 jmlr-2010-Lp-Nested Symmetric Distributions

16 0.037330367 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

17 0.036974117 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

18 0.035553988 27 jmlr-2010-Consistent Nonparametric Tests of Independence

19 0.035239495 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

20 0.034561407 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.205), (1, 0.086), (2, -0.349), (3, 0.111), (4, 0.152), (5, -0.479), (6, -0.04), (7, 0.017), (8, 0.006), (9, -0.041), (10, 0.059), (11, -0.019), (12, -0.195), (13, -0.047), (14, -0.072), (15, -0.092), (16, -0.028), (17, -0.01), (18, -0.087), (19, -0.005), (20, -0.006), (21, -0.056), (22, 0.071), (23, 0.039), (24, 0.051), (25, 0.014), (26, 0.025), (27, -0.041), (28, -0.056), (29, 0.01), (30, -0.064), (31, 0.024), (32, 0.012), (33, 0.041), (34, -0.002), (35, -0.019), (36, 0.038), (37, -0.036), (38, -0.042), (39, 0.013), (40, -0.029), (41, 0.008), (42, 0.047), (43, -0.028), (44, -0.044), (45, -0.074), (46, -0.02), (47, -0.037), (48, 0.017), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96034682 53 jmlr-2010-Inducing Tree-Substitution Grammars

Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater

2 0.83813745 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

Author: Shay B. Cohen, Noah A. Smith

3 0.66316295 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

Author: Alexander Clark, Rémi Eyraud, Amaury Habrard

4 0.59283453 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

Author: James Henderson, Ivan Titov

5 0.42898485 15 jmlr-2010-Approximate Tree Kernels

Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller

6 0.33823368 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

7 0.21809952 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

8 0.2112807 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

9 0.1939944 63 jmlr-2010-Learning Instance-Specific Predictive Models

10 0.18562832 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

11 0.17755099 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

12 0.16926371 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

13 0.15956233 69 jmlr-2010-Lp-Nested Symmetric Distributions

14 0.15749483 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

15 0.15422799 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

16 0.1422184 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

17 0.13796923 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

18 0.13589141 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

19 0.13138542 109 jmlr-2010-Stochastic Composite Likelihood

20 0.12090112 27 jmlr-2010-Consistent Nonparametric Tests of Independence

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.018), (4, 0.029), (21, 0.043), (22, 0.085), (32, 0.034), (33, 0.014), (36, 0.041), (37, 0.045), (75, 0.078), (81, 0.012), (85, 0.045), (91, 0.4), (96, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73498875 53 jmlr-2010-Inducing Tree-Substitution Grammars

Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater

2 0.34437725 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

Author: Shay B. Cohen, Noah A. Smith

3 0.33097434 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

Author: Gunnar Carlsson, Facundo Mémoli

Abstract: We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods. Keywords: clustering, hierarchical clustering, stability of clustering, Gromov-Hausdorff distance

4 0.32045451 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

Author: James Henderson, Ivan Titov

5 0.2857064 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efﬁciently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efﬁciency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efﬁcient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing

6 0.27163786 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

7 0.26656285 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

8 0.2657623 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

9 0.26417872 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

10 0.26298785 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

11 0.26097381 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

12 0.26084304 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

13 0.25967833 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

14 0.2594932 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

15 0.25913146 109 jmlr-2010-Stochastic Composite Likelihood

16 0.2576625 102 jmlr-2010-Semi-Supervised Novelty Detection

17 0.25750864 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

18 0.25647795 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

19 0.25643238 69 jmlr-2010-Lp-Nested Symmetric Distributions

20 0.2564097 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes