jmlr jmlr2010 jmlr2010-52 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
Reference: text
sentIndex sentText sentNum sentScore
1 These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. [sent-6, score-0.517]
2 ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. [sent-7, score-0.34]
3 First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. [sent-9, score-0.318]
4 We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. [sent-11, score-0.368]
5 The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. [sent-12, score-0.55]
6 In addition to limiting the scope of this article to parsing problems, we focus on tasks where the training data specifies the output structure, but the labelling of this structure is not fully annotated. [sent-31, score-0.48]
7 While the unannotated labelling may not be evaluated in the task, by assuming incomplete labelling we allow our models to capture generalisation which are not directly reflected in the labelled output structure given for training. [sent-32, score-0.331]
8 For example, the training data for natural language parsing problems is generally assumed to be a tree, but assuming that all generalisations can be expressed in terms of one-level fragments of the tree leads to poor empirical performance. [sent-33, score-0.446]
9 However, these mechanisms are not sufficient to specify a complete probability model for a parsing problem, because we need to specify the statistical dependencies for the complete space of possible output categories. [sent-38, score-0.416]
10 There are well established methods for specifying probabilistic models of parsing problems, based on grammar formalisms, such as probabilistic context-free grammars (PCFGs). [sent-40, score-0.418]
11 The grammar formalism defines how the complete space of possible pairs of an input sequence with an output structure can be specified as a set of sequences of decisions about the input-output pair. [sent-41, score-0.328]
12 Because the context-free assumption is defined in terms of the output structure, not in terms of the 3542 I NCREMENTAL S IGMOID B ELIEF N ETWORKS input sequence, which decisions in the history are relevant depends on the output structure specified by the derivation. [sent-61, score-0.302]
13 This is the fundamental distinction between models of parsing problems and models of sequence labelling problems, and it will be central to our discussions in this article. [sent-64, score-0.394]
14 The most common approach to building probability models for parsing problems is to use PCFGs without any latent variables (e. [sent-65, score-0.458]
15 , 2005) extend the node labels of PCFGs with latent annotations, but previous proposals have successfully induced only a small number of latent annotations. [sent-70, score-0.407]
16 In the model of Henderson (2003), vectors of hidden unit values decorate the positions t in the derivation sequence, and are used to encode features of the unbounded derivation history D1 , . [sent-72, score-0.279]
17 As with LPCFGs, the pattern of interdependencies between layers of hidden units is a function of the output structure, making it appropriate for parsing problems. [sent-76, score-0.382]
18 Each position in the decision sequence has a vector of latent state variables, which are statistically dependent on variables from previous positions via a pattern of edges determined by the previous decisions. [sent-81, score-0.515]
19 This incrementally specified model structure allows ISBNs to capture the generalisations which are only reflected in the output structure, such as the tendency towards correlations which are local in the output structure, which motivates the context-free assumption of PCFGs. [sent-82, score-0.308]
20 Allowing the model structure to depend on the output structure means that the complete model structure is not known until the complete output derivation is known. [sent-83, score-0.43]
21 In the second experiment, we apply both of the approximation models to phrase structure parsing with data from the Wall Street Journal Penn Treebank (Marcus et al. [sent-100, score-0.392]
22 Results of the IMF model are non-significantly worse (less than 1% relative error increase) than the results of one of the best known history-based models of parsing (Charniak, 2000). [sent-103, score-0.318]
23 Incrementally Specifying Model Structure We want to extend SBNs to make them appropriate for modelling parsing problems. [sent-138, score-0.279]
24 The problem with only allowing edges between variables instantiated at positions which are adjacent (or local) in the decision sequence is that this does not allow the model structure to adequately reflect the correlations found in parsing problems. [sent-155, score-0.665]
25 Dotted lines indicate the top of the parser’s stack at each derivation decision in the model structure. [sent-158, score-0.311]
26 ′ We constrain this edge specification so that a decision Dt can only effect the placement of ′ edges whose destination variable is at a position t > t ′ after the decision Dt . [sent-164, score-0.368]
27 For this reason, we call our model an “incremental” model, not just a dynamic model; the structure of the Bayesian network is determined incrementally as the decision sequence proceeds. [sent-167, score-0.331]
28 This incremental specification of the model structure is illustrated in Figure 1 (the directed graphs), along with the partial output structures incrementally specified by the derivation (the trees). [sent-168, score-0.327]
29 In Figure 1, dotted lines associate a position’s instantiated template with the node in the output structure which is on top of the parser’s stack when making that position’s decision. [sent-169, score-0.321]
30 For each labelled position (t − c, r) in this set, label r determines which variables instantiated at that position are linked to which variables instantiated at the current position t, and with which parameters. [sent-177, score-0.399]
31 This limitation does not give us sufficient power to express the kinds of output-conditioned statistical dependencies we need for parsing problems in general. [sent-193, score-0.313]
32 Incremental sigmoid belief networks allow the model structure to depend on the output structure without overly complicating the inference of the desired conditional probabilities P(Dt |D1 , . [sent-195, score-0.376]
33 Given a topdown left-to-right derivation of a phrase structure tree, the dependencies between LPCFG derivation decisions have the same structure as the phrase structure tree, but with LPCFG rules (one-level subtrees) labelling each node of this derivation tree. [sent-229, score-0.684]
34 If we expressed an LPCFG as a graphical model, the model structures would have the same general form as the derivation trees, so the model structure would also be incrementally specified. [sent-231, score-0.276]
35 As illustrated by the above examples, the argument for the incremental specification of model structure can be applied to any Bayesian network architecture, not just sigmoid belief networks. [sent-235, score-0.322]
36 This previous work has shown that the combination of logistic sigmoid hidden units and having a model structure which reflect locality in the output structure results in a powerful form of feature induction. [sent-237, score-0.35]
37 The Probabilistic Model of Structured Prediction In this section we complete the definition of incremental sigmoid belief networks for grammar learning. [sent-240, score-0.331]
38 For example, a decision to create a new node in a labelled output structure can be divided into two elementary decisions: deciding to create a node and deciding which label to assign to it. [sent-258, score-0.425]
39 , stn , representing an intermediate ′ state at position t ′ , and decision variable vectors Dt , representing a decision at position t ′ , where t ′ ≤ t. [sent-263, score-0.39]
40 Variables whose value are given at the current decision (t, k) are shaded in Figure 2; latent and current decision variables are left unshaded. [sent-264, score-0.393]
41 As illustrated by the edges in Figure 2, the probability of each state variable sti depends on all the variables in a finite set of relevant previous state and decision vectors, but there are no direct dependencies between the different variables in a single state vector. [sent-265, score-0.281]
42 For example, it could select the most recent state where the same output structure node was on the top of the automaton’s stack, and a decision variable representing that node’s label. [sent-274, score-0.276]
43 We can write the dependency of a latent variable sti on previous latent variable vectors and a decision history as: ∑ ∑ Jirj stj +∑Brk id ′ P(sti = 1|S1 , . [sent-277, score-0.642]
44 For each relation r, the weight Jirj determines the influence of the jth variable in the ′ related previous latent vector St on the distribution of the ith variable of the considered latent vector t′ St . [sent-282, score-0.358]
45 Similarly, Brkt ′ defines the influence of the past decision dk on the distribution of the considered idk latent vector variable sti . [sent-283, score-0.544]
46 As indicated in Figure 2, the t depends both on the current latent vector St and on the probability of each elementary decision dk t previously chosen elementary action dk−1 from Dt . [sent-289, score-0.572]
47 One difference between LPCFGs and ISBNs is that LPCFGs add latent annotations to the symbols of a grammar, while ISBNs add latent annotations to the states of an automaton. [sent-295, score-0.45]
48 Another distinction between LPCFGs and ISBNs is that LPCFGs use latent annotations to split symbols into multiple atomic symbols, while ISBNs add vectors of latent variables to the existing symbol variables. [sent-298, score-0.404]
49 The structure of the similarities between vectors is much richer than the structure of similarities between split atomic symbols, which gives ISBNs a more structured latent variable space than LPCFGs. [sent-299, score-0.329]
50 To approximate P(dk |h(t, k)) using the value of LV , we have to include the current t decision dk in the set of visible variables, along with the visible variables specified in h(t, k). [sent-324, score-0.291]
51 Thus we need to compute a separate estimate maxQ LV (d) for each possible value of t dk = d: t,k t max LV (d) = max ∑ Q(H t |h(t, k), dk = d) ln Q Q H t P(H t , h(t, k), dk = d) t = d) , Q(H t |h(t, k), dk t where H t = {S1 , . [sent-326, score-0.774]
52 Unfortunately, in general this is not feasible, in particular with labelled t output structures where the number of possible alternative decisions dk can be large. [sent-335, score-0.367]
53 3 t In our modifications of the mean field method, we propose to consider the next decision dk as t , d t |h(t, k)) is stronger than in a hidden variable. [sent-338, score-0.33]
54 Then the assumption of full factorisability of Q(H k the standard mean field theory because the approximate distribution Q is no longer conditioned on t the next decision dk . [sent-339, score-0.291]
55 The approximate fully factorisable distribution Q(H|V ) can be written as: Q(H|V ) = t qtk (dk ) ∏ ′ µti sti ′ ′ 1 − µti 1−sti ′ . [sent-340, score-0.304]
56 t′i ′ where µti is the free parameter which determines the distribution of state variable i at position t ′ , t namely its mean, and qtk (dk ) is the free parameter which determines the distribution over decisions t . [sent-341, score-0.388]
57 Importantly, we use qt (d) to estimate the conditional probability of the next decision: dk k t P(dk = d|h(t, k)) ≈ qtk (d), 3. [sent-342, score-0.414]
58 We conducted preliminary experiments with natural language parsing on very small data sets and even in this setup the method appeared to be very slow and, surprisingly, not as accurate as the modification considered further in this section. [sent-343, score-0.368]
59 The same argument applies to decision variables; the approximate distribution of the next decision qtk (d) is given by qtk (d) = Φh(t,k) (d) exp(∑ j Wd j µtj ) ∑d ′ Φh(t,k) (d ′ ) exp(∑ j Wd ′ j µtj ) . [sent-376, score-0.674]
60 For mean field approximations ′ t to ISBNs, it implies the need to update the latent vector means µti after observing a decision dk , for ′ ≤ t. [sent-382, score-0.506]
61 The use of edges directly from decision variables to subsequent latent vectors is designed to t mitigate this limitation, but such edges cannot in general accurately approximate bottom-up reasoning. [sent-383, score-0.418]
62 The decision distribution qtk (dk ) t as in the feed-forward maximises LV when it has the same dependence on the latent vector means µ j approximation, namely expression (10). [sent-388, score-0.516]
63 ′ t Optimally, after each new decision dk , we should recompute all the means µti for all the latent ′ vectors St , t ′ ≤ t. [sent-390, score-0.47]
64 Instead, after making each decision dk and adding it to the set of visible variables 3553 H ENDERSON AND T ITOV V , we recompute only the means of the current latent vector St . [sent-392, score-0.47]
65 However, in our version of the mean field method the approximate distributions of hidden decision variables qtk are used to compute the data likelihood (8) and, thus, maximising this target function will not automatically imply KL divergence minimisation. [sent-418, score-0.376]
66 As we discussed in Section 3, the use of relations based on the partial output structure makes it possible to take into account statistical interdependencies between decisions closely related in the output structure, but separated by arbitrarily many positions in the input structure. [sent-423, score-0.295]
67 Second, we apply the models to a real problem, parsing of natural language, where we compare our approximations with state-of-the-art models. [sent-432, score-0.315]
68 1 Artificial Experiment In order to have an upper bound for our artificial experiments, we do not consider incremental models, but instead use a dynamic sigmoid belief network, a first order Markovian model, and consider a sequence labelling task. [sent-434, score-0.391]
69 The following generative story corresponds to the random dynamic SBNs: 3556 I NCREMENTAL S IGMOID B ELIEF N ETWORKS root S NP VP N V Bill sells NP J fresh N oranges Figure 5: An example phrase structure tree. [sent-436, score-0.323]
70 2 Natural Language Parsing We compare our two approximations on the natural language phrase structure parsing task. [sent-457, score-0.517]
71 An example of such a tree is presented on Figure 5, where the tree specifies that the adjective (J) fresh and the noun (N) oranges form a noun phrase (NP) “fresh oranges”, which, when combined with the verb (V) sells, forms the verb phrase (VP) “sells fresh oranges”. [sent-479, score-0.316]
72 If this is true, then it suggests that ISBNs are a good abstract model for problems similar to natural language parsing, namely parsing problems which benefit from latent variable induction. [sent-481, score-0.586]
73 The decision ProjectY replaces the current top of the stack X with a new node Y , and specifies that Y is the parent of X in the output structure. [sent-492, score-0.329]
74 The decision Attach removes the current top of the stack X and specifies that element Y under the top of the stack is the parent of X. [sent-494, score-0.325]
75 Though these three types of decisions are sufficient to parse any constituent tree, Henderson (2003) extends the parsing strategy to include a specific treatment of a particular configuration in the parse tree, Chomsky adjunction, using a version of the Attach decision called Modify. [sent-495, score-0.456]
76 As was defined in expression (5), the probability of each state variable stj in the ISBN depends on all the latent variables and previous relevant decisions in a subset of previous relevant positions t ′ : r(t ′ ,t). [sent-496, score-0.345]
77 Stack Context: the last previous position with the same element on top of the stack as at current position t. [sent-499, score-0.285]
78 These relations were motivated by linguistic considerations and many of them have also been found useful in other parsing models (Johnson, 1998; Roark and Johnson, 1999). [sent-506, score-0.279]
79 Increasing the beam size and the branching factor beyond these values did not significantly effect parsing accuracy. [sent-559, score-0.279]
80 Table 1 lists the results of the NN approximation and the IMF approximation,7 along with results of different generative and discriminative parsing methods evaluated in the same experimental setup (Bikel, 2004; Taskar et al. [sent-563, score-0.279]
81 Although no longer one of the most accurate parsing models on the standard WSJ parsing benchmark (including sentences of all lengths), the (Charniak, 2000) parser achieves competitive results (89. [sent-570, score-0.637]
82 Approximate training times on a standard desktop PC for the IMF and NN approximations were 140 and 3 hours, respectively, and parsing times were 3 and 0. [sent-580, score-0.315]
83 Note also that the LPCFG decoding algorithm uses a form of Bayes risk minimisation to optimise for the specific scoring function, whereas our model, as most parsing methods in the literature, output the highest scoring tree (maximum a-posteriori decoding). [sent-589, score-0.418]
84 Even approximations such as those tested here, with a very strong factorisability assumption, allow us to build quite accurate parsing models. [sent-594, score-0.315]
85 When they were originally proposed, latent variable models for natural language parsing were not particularly successful, demonstrating results significantly below the state-of-the-art models (Kurihara and Sato, 2004; Matsuzaki et al. [sent-600, score-0.547]
86 All these approaches considered extensions of a classic PCFG model, which augment non-terminals of the grammar with latent variables (Latent-annotated PCFGs, LPCFGs). [sent-606, score-0.283]
87 3561 H ENDERSON AND T ITOV One important difference between LPCFGs and ISBNs is that in LPCFGs the latent annotations are used to expand the set of atomic labels used in a PCFG, whereas ISBNs directly reason with a vector of latent features. [sent-619, score-0.404]
88 For example, ISBNs have been applied to the dependency parsing problem (Titov and Henderson, 2007) and to joint dependency parsing and semantic role labelling (Henderson et al. [sent-626, score-0.639]
89 The application of LPCFG models to even dependency parsing has required sophisticated grammar transformations (Musillo and Merlo, 2008), to which the split-and-merge training approach has not yet been successfully adapted. [sent-629, score-0.383]
90 (2008) also suggest that the latent annotations of syntactic states are not only useful for syntactic parsing itself but also can be helpful for other tasks. [sent-631, score-0.504]
91 5% when latent annotations for syntactic decision were provided, thereby indicating that the latent annotation of syntactic parsing states helps semantic role labelling. [sent-633, score-0.79]
92 Conclusions This paper proposes a new class of graphical models for structured prediction problems, incremental sigmoid belief networks, and has applied it to natural language grammar learning. [sent-635, score-0.498]
93 This allows the model to directly express regularities that are local in the output structure but not local in the input structure, making ISBNs appropriate for parsing problems. [sent-637, score-0.438]
94 This ability supports the induction of latent variables which augment the grammatical structures annotated in the training data, thereby solving a limited form of grammar induction. [sent-638, score-0.283]
95 Second, both approximations were applied to the natural language parsing task, where the mean field method demonstrated significantly better results. [sent-645, score-0.404]
96 These results are non-significantly different from the results of another history-based probabilistic model of parsing (Charniak, 2000) which is competitive with the state-of-the-art for single-model parsers. [sent-646, score-0.318]
97 All the terms except for dµt,k i dx are trivial to compute: ˆ ∂L(T ) t = ∑ µt,k δdk d − qtk (d) , j ∂Wd j t,k ˆ ∂L(T ) ∂µt,k i t = (1 − δk,Kt +1 ) Wdk i − ∑ qtk (d)Wdi , d 3563 (17) H ENDERSON AND T ITOV dµt,k i where δi j is the Kronecker delta. [sent-659, score-0.46]
98 A latent variable model of synchronous syntactic-semantic parsing for multiple languages. [sent-756, score-0.497]
99 A latent variable model of synchronous parsing for syntactic and semantic dependencies. [sent-765, score-0.497]
100 Scalable discriminative learning for natural language parsing and translation. [sent-897, score-0.368]
wordName wordTfidf (topN-words)
[('isbns', 0.405), ('parsing', 0.279), ('qtk', 0.23), ('lv', 0.195), ('dk', 0.184), ('latent', 0.179), ('henderson', 0.176), ('imf', 0.175), ('sbns', 0.157), ('lpcfgs', 0.147), ('enderson', 0.138), ('itov', 0.138), ('lpcfg', 0.138), ('igmoid', 0.12), ('ncremental', 0.12), ('stack', 0.109), ('wd', 0.109), ('decision', 0.107), ('grammar', 0.104), ('elief', 0.102), ('eld', 0.098), ('sigmoid', 0.096), ('ti', 0.095), ('dt', 0.09), ('language', 0.089), ('position', 0.088), ('labelling', 0.081), ('st', 0.077), ('etworks', 0.075), ('sti', 0.074), ('turian', 0.074), ('charniak', 0.072), ('decisions', 0.07), ('linguistics', 0.07), ('edges', 0.066), ('incremental', 0.066), ('nn', 0.065), ('belief', 0.065), ('sbn', 0.064), ('output', 0.064), ('wdi', 0.063), ('derivatives', 0.061), ('petrov', 0.061), ('dz', 0.059), ('phrase', 0.057), ('automaton', 0.057), ('mary', 0.057), ('derivation', 0.056), ('structure', 0.056), ('brk', 0.055), ('stj', 0.055), ('titov', 0.055), ('saul', 0.054), ('np', 0.053), ('elementary', 0.051), ('isbn', 0.051), ('pcfgs', 0.05), ('node', 0.049), ('dynamic', 0.049), ('labelled', 0.049), ('meeting', 0.049), ('history', 0.048), ('ivan', 0.047), ('jirj', 0.046), ('matsuzaki', 0.046), ('oranges', 0.046), ('annotations', 0.046), ('incrementally', 0.046), ('vp', 0.046), ('parser', 0.044), ('par', 0.043), ('instantiated', 0.043), ('positions', 0.041), ('maximisation', 0.04), ('graphical', 0.04), ('hidden', 0.039), ('tree', 0.039), ('generalisations', 0.039), ('fresh', 0.039), ('root', 0.039), ('model', 0.039), ('james', 0.038), ('tj', 0.038), ('ln', 0.038), ('structured', 0.038), ('merlo', 0.037), ('paola', 0.037), ('sells', 0.037), ('wdk', 0.037), ('dan', 0.037), ('finkel', 0.037), ('approximations', 0.036), ('decoding', 0.036), ('sentences', 0.035), ('variational', 0.035), ('grammars', 0.035), ('dbns', 0.035), ('dependencies', 0.034), ('si', 0.034), ('sequence', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
2 0.20994784 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
3 0.15405238 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
Author: Shay B. Cohen, Noah A. Smith
Abstract: Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference
4 0.11322987 6 jmlr-2010-A Rotation Test to Verify Latent Structure
Author: Patrick O. Perry, Art B. Owen
Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data
5 0.11183199 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
6 0.10986774 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
7 0.083644204 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
8 0.073719546 15 jmlr-2010-Approximate Tree Kernels
9 0.073473595 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
10 0.067876503 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
11 0.058701336 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
12 0.053752195 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
13 0.053527042 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
14 0.051763743 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
15 0.050858472 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
16 0.04821099 10 jmlr-2010-An Exponential Model for Infinite Rankings
17 0.047183972 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?
18 0.044966992 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
19 0.042879611 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
20 0.04085739 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
topicId topicWeight
[(0, -0.235), (1, 0.069), (2, -0.271), (3, -0.005), (4, 0.138), (5, -0.289), (6, 0.002), (7, -0.01), (8, 0.023), (9, 0.026), (10, 0.08), (11, 0.038), (12, -0.09), (13, -0.021), (14, -0.005), (15, -0.079), (16, 0.029), (17, 0.061), (18, -0.057), (19, -0.066), (20, -0.021), (21, 0.051), (22, -0.096), (23, 0.093), (24, 0.028), (25, -0.085), (26, 0.109), (27, 0.081), (28, 0.133), (29, -0.017), (30, -0.064), (31, 0.031), (32, 0.127), (33, -0.093), (34, -0.051), (35, -0.083), (36, 0.083), (37, 0.012), (38, 0.056), (39, 0.085), (40, 0.198), (41, -0.013), (42, 0.074), (43, 0.072), (44, -0.022), (45, -0.068), (46, 0.021), (47, -0.026), (48, -0.173), (49, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.93393159 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
2 0.68489319 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
3 0.56844819 6 jmlr-2010-A Rotation Test to Verify Latent Structure
Author: Patrick O. Perry, Art B. Owen
Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data
4 0.55405581 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
Author: Shay B. Cohen, Noah A. Smith
Abstract: Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference
5 0.40094459 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
6 0.36321408 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
7 0.35908371 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
8 0.34225863 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
9 0.30183336 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
10 0.2626681 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
11 0.26190448 63 jmlr-2010-Learning Instance-Specific Predictive Models
12 0.25887865 10 jmlr-2010-An Exponential Model for Infinite Rankings
13 0.25069723 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
14 0.24392159 40 jmlr-2010-Fast and Scalable Local Kernel Machines
15 0.2414539 15 jmlr-2010-Approximate Tree Kernels
16 0.2352002 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
17 0.22283754 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
18 0.22142194 37 jmlr-2010-Evolving Static Representations for Task Transfer
19 0.21055719 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
20 0.20267293 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
topicId topicWeight
[(3, 0.013), (4, 0.014), (8, 0.014), (21, 0.412), (22, 0.028), (24, 0.013), (32, 0.049), (36, 0.067), (37, 0.048), (75, 0.11), (81, 0.019), (85, 0.048), (91, 0.035), (96, 0.04)]
simIndex simValue paperId paperTitle
1 0.84449959 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate
Author: Gérard Biau, Frédéric Cérou, Arnaud Guyader
Abstract: Bagging is a simple way to combine estimates in order to improve their performance. This method, suggested by Breiman in 1996, proceeds by resampling from the original data set, constructing a predictor from each subsample, and decide by combining. By bagging an n-sample, the crude nearest neighbor regression estimate is turned into a consistent weighted nearest neighbor regression estimate, which is amenable to statistical analysis. Letting the resampling size kn grows appropriately with n, it is shown that this estimate may achieve optimal rate of convergence, independently from the fact that resampling is done with or without replacement. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, adaptation results by data-splitting are presented. Keywords: bagging, resampling, nearest neighbor estimate, rates of convergence
2 0.78556913 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
Author: Tong Zhang
Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a specific multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation
same-paper 3 0.75709176 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
4 0.63566697 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
5 0.44895053 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
6 0.41137457 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
7 0.41034868 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
8 0.40974662 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
9 0.40853778 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
10 0.40716481 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
11 0.39976227 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
12 0.39827138 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
13 0.39756328 109 jmlr-2010-Stochastic Composite Likelihood
14 0.39708865 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
15 0.39639646 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
16 0.39559668 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
17 0.39260948 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
18 0.39218923 25 jmlr-2010-Composite Binary Losses
19 0.39210078 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
20 0.39183447 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models