jmlr jmlr2010 jmlr2010-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shay B. Cohen, Noah A. Smith
Abstract: Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference
Reference: text
sentIndex sentText sentNum sentScore
1 Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. [sent-13, score-0.632]
2 We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. [sent-15, score-0.7]
3 Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. [sent-16, score-0.658]
4 Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference 1. [sent-17, score-1.021]
5 Introduction One of the motivating applications for grammar induction, or unsupervised grammatical structure discovery, is for the syntactic analysis of text data. [sent-18, score-0.646]
6 If the grammar is a probability model, then learning a grammar means selecting a model from a prespecified model family. [sent-22, score-1.014]
7 In much prior work, the family is defined as the set of probabilistic grammar for a fixed set of grammar rules, so that grammar learning amounts to parameter estimation from incomplete data: sentences in the language are yields of hidc 2010 Shay B. [sent-23, score-1.821]
8 Probabilistic grammars are widely used to build models in natural language processing from annotated data, thus allowing easy comparison between unsupervised and supervised techniques. [sent-37, score-0.373]
9 NLP applications of probabilistic grammars and their generalizations include parsing (Collins, 2003; Klein and Manning, 2003; Charniak and Johnson, 2005), machine translation (Wu, 1997; Ding and Palmer, 2005; Chiang, 2005), and question answering (Wang et al. [sent-38, score-0.361]
10 Such bias can be especially important in cases where the sample size is not large or when the grammar is highly non-identifiable, two scenarios that hold with grammar induction (see Cohen and Smith, 2010, for a discussion of the size of sample required for estimation of probabilistic grammars). [sent-44, score-1.144]
11 Bayesian methods have been applied to probabilistic grammars in various ways: specifying priors over the parameters of a PCFG (Johnson et al. [sent-45, score-0.362]
12 The challenge in Bayesian grammar learning is efficiently approximating probabilistic inference. [sent-52, score-0.577]
13 Much of the Bayesian literature and its application to probabilistic grammars has focused on conjugate priors in the form of Dirichlet distributions. [sent-55, score-0.362]
14 We begin by motivating the modeling of covariance among the probabilities of grammar derivation events, and propose the use of logistic normal distributions (Aitchison, 1986; Blei and Lafferty, 2006) over multinomials to build priors over grammars (Section 3). [sent-58, score-1.327]
15 Our motivation relies on the observation that various grammar parameters are expected to be correlated because of the elements in language they represent share linguistic properties. [sent-59, score-0.632]
16 Noting that grammars are built out of a large collection of multinomials, we introduce shared logistic normal distributions to allow arbitrary covariance among any grammar probabilities. [sent-60, score-1.14]
17 We then describe efficient inference techniques to support decoding and learning with (shared) logistic normal priors over grammars (Section 4), facing the challenge of non-conjugacy of the logistic normal prior to the multinomial family. [sent-61, score-1.091]
18 A probabilistic grammar defines a probability distribution over a certain kind of structured object (a derivation of the underlying symbolic grammar) explained step-by-step as a stochastic process. [sent-72, score-0.612]
19 The value of Nk depends on whether the kth multinomial is the starting state multinomial (in which case Nk = s), transition multinomial (Nk = s) or 1. [sent-93, score-0.346]
20 The field of grammatical inference also includes algorithms and methods for learning the structure of a (formal) language generator or grammar (Angluin, 1988; de la Higuera, 2005; Clark and Thollard, 2004; Clark et al. [sent-98, score-0.698]
21 We focus on grammars which generate dependency structures for derivations. [sent-104, score-0.361]
22 The grammars used in our experiments are extremely permissive, allowing every possible dependency structure for a sentence (see Section 2. [sent-108, score-0.406]
23 In addition, if we place a Dirichlet prior on the grammar parameters θ (Section 3. [sent-126, score-0.563]
24 2 Dependency Model with Valence Dependency grammar (Tesni` re, 1959) refers to linguistic theories that posit graphical representae tions of sentences in which words are vertices and the syntax is a directed tree. [sent-132, score-0.648]
25 Our experiments perform unsupervised induction of probabilistic dependency grammars using a model known as “dependency model with valence” (Klein and Manning, 2004). [sent-135, score-0.532]
26 The language of the grammar is context-free, though our models are permissive and allow the derivation of any string in Γ∗ . [sent-142, score-0.621]
27 Hence the goal of grammar induction is to model the distribution of derivations, not to separate grammatical strings or derivations from ungrammatical ones. [sent-172, score-0.675]
28 Klein and Manning’s (2004) dependency model with valence is widely recognized as an effective probabilistic grammar for dependency grammar induction. [sent-173, score-1.3]
29 Many recent studies on dependency grammar induction use it. [sent-174, score-0.675]
30 , 2010a); it has been used to test the efficacy of multilingual learning through dependency grammar induction (Ganchev et al. [sent-177, score-0.734]
31 In the rest of the paper, we assume we have a fixed grammar G for which we estimate the parameters. [sent-185, score-0.507]
32 This contrasts with earlier research that aims to bias the grammar learner with prior information. [sent-204, score-0.563]
33 Bayesian Models over Grammars An attractive way to incorporate prior knowledge about grammars is through a prior distribution over the grammar’s probabilities θ. [sent-208, score-0.365]
34 (6) m=1 y In this setting, it is α, the distribution over grammar parameters, that encodes knowledge about the grammar, and it will be α that we estimate when we perform learning. [sent-239, score-0.507]
35 In “model II,” the grammar parameters are drawn once for all of the sentences in the corpus. [sent-245, score-0.602]
36 We next give an overview of the Dirichlet prior that provides analytical tractability for probabilistic grammars, and then demonstrate the alternative which focuses on the second and third requirements, the logistic normal distribution. [sent-265, score-0.367]
37 , 2007), as it corresponds to eliminating unnecessary grammar rules. [sent-272, score-0.507]
38 (Indeed, learning to exclude rules by setting their probabilities to zero is one way of going about symbolic grammar induction. [sent-273, score-0.507]
39 ) If we use a Dirichlet distribution with a probabilistic grammar, then the hyperparameters for the grammar consist of K vectors with positive elements, the kth of which has length Nk . [sent-274, score-0.635]
40 2 Modeling Covariance with Logistic Normal Distributions When we consider probabilistic grammars for natural languages, especially those over words or word classes like parts of speech, we do expect to see covariance structure. [sent-297, score-0.463]
41 From top to bottom: the partition structure S describes I j which tell how segment a normal expert into parts which are matched to multinomials (“prt. [sent-335, score-0.346]
42 (2008), we demonstrated how the LN distribution is an effective alternative to the Dirichlet for probabilistic dependency grammar induction in the Bayesian setting. [sent-341, score-0.745]
43 However, it can be shown (Aitchison, 1986) that given a Dirichlet distribution with very large α, we can find a logistic normal distribution such that the KL-divergence between the Dirichlet distribution and logistic normal distribution is small. [sent-344, score-0.482]
44 We define a shared logistic normal distribution with N “experts” over a collection of K multinomial distributions: Definition 1 Let ηn ∼ Normal(µn , Σn ) be a set of multivariate normal variables for n ∈ 1:N, where the length of ηn is denoted ℓn . [sent-369, score-0.535]
45 We then say θ distributes according to the shared logistic normal distribution with partition structure S = ({In }N , {Jk }K ) and normal n=1 k=1 experts {(µn , Σn )}N and denote it by θ ∼ SLN(µ, Σ, S). [sent-373, score-0.519]
46 4 Local Log-Linear Models over Parameters We give now another interpretation of the shared logistic normal prior using a feature representation, which is related to recent work by Berg-Kirkpatrick et al. [sent-393, score-0.36]
47 A probabilistic grammar with a shared logistic normal prior can be thought of as a probabilistic grammar where the grammar’s parameters are themselves modeled using a local log-linear model with a Gaussian prior over the weights of this log-linear model. [sent-395, score-1.57]
48 (2010) used the idea of local log-linear models for several natural language processing tasks, including dependency grammar induction and part-of-speech tagging. [sent-407, score-0.754]
49 In the Bayesian setting, decoding might be accomplished using the posterior over derivations, marginalizing out the unknown grammar weights. [sent-427, score-0.66]
50 The loss function we use is dependency attachment error, for the task of dependency grammar induction. [sent-443, score-0.79]
51 MBR decoding in this case works as follows: using θ and the inside-outside algorithm, we compute the posterior probability of each dependency attachment (directed edge in the graph) being present in the grammatical derivation for the sentence. [sent-444, score-0.429]
52 Neither Viterbi nor MBR decoding uses the entire distribution over grammar weights. [sent-446, score-0.626]
53 We suggest “committee decoding,” in which a set of randomly sampled grammar weights are drawn for each sentence to be parsed. [sent-448, score-0.552]
54 The weights are drawn from the learned distribution over grammar weights, parameterized by µ and Σ in the LN case. [sent-449, score-0.507]
55 Note that this decoding mechanism is randomized: we sample a grammar per sentence, and use it to decode. [sent-451, score-0.626]
56 suggest is also rather complicated, while alternatives, such as mean-field variational inference (Wainwright and Jordan, 2008), offer faster convergence and a more intuitive solution to the problem of nonconjugacy of the logistic normal distribution. [sent-463, score-0.406]
57 Variational inference algorithms have been successfully applied to various grammar and syntax learning tasks (Kurihara and Sato, 2006; Liang et al. [sent-464, score-0.553]
58 We give the full technical details of mean-field variational inference for probabilistic grammars with logistic normal priors in Appendix B, and turn to give a brief overview of the main technical details next, under the simplifying assumption that we have a single observation x. [sent-468, score-0.768]
59 For the case of logistic normal priors, an additional approximation will be necessary (a first-order Taylor approximation to the log of the normalization of the logistic normal distribution), because of the lack of conjugacy of the logistic normal priors to the multinomial family (see Appendix B). [sent-476, score-0.858]
60 This makes inference applicable through the use of an inside-outside algorithm with a weighted grammar of the same form as the original model. [sent-478, score-0.553]
61 1) Experiments with dependency grammar induction for English text using the logistic normal distribution. [sent-501, score-0.916]
62 3) Experiments with the shared logistic normal distribution for tying parameters which correspond to the same coarse part-of-speech tag (English, Portuguese, and Turkish). [sent-507, score-0.421]
63 4) Experiments with the shared logistic normal distribution in bilingual settings (English, Portuguese, and Turkish). [sent-510, score-0.389]
64 Following standard practice, sentences were stripped of words and punctuation, leaving part-of-speech tags for the unsupervised induction of dependency structure. [sent-581, score-0.391]
65 The reason could be the fact that the logistic normal distribution, even when permitting only just diagonal covariance matrices (the case with identity covariance matrix initialization is weaker—we only initialize with diagonal matrices) allows to model the variance directly in the parameters. [sent-628, score-0.393]
66 As in the case for English, sentences were stripped of words and punctuation, leaving part-of-speech tags for the unsupervised induction of dependency structure. [sent-674, score-0.391]
67 Note that for Portuguese, the difference is much smaller between the EM baselines and logistic normal variational EM when only short sentences are considered, but there is a wider gap for longer sentences; the LN models appear to generalize better to longer sentences. [sent-678, score-0.487]
68 For Turkish, no method outperforms AttachRight, but there is still a big gap between variational EM with the logistic normal and the other EM baselines. [sent-679, score-0.36]
69 3 SLN with Nouns, Verbs, and Adjectives We now turn to experiments where the partition structure lets parameters across multinomials covary, making use of the expressive power of the shared logistic normal distribution. [sent-688, score-0.515]
70 Sharing information between two models is accomplished by softly tying grammar weights in the two hidden grammars. [sent-890, score-0.583]
71 We then add a normal expert that ties between the parts of speech in the respective partition structures for both grammars together. [sent-892, score-0.423]
72 For each sentence in the corpus, the following two steps are conducted as before (model I): the normal experts are sampled from the SLN distribution and combined into multinomials to parameterize the DMV; a grammar derivation is sampled from the resulting DMV. [sent-902, score-0.943]
73 English grammar induction shows moderate gains when tied with Portuguese and strong gains with Turkish. [sent-904, score-0.567]
74 The runtime of the variational E-step for a sentence x is still cubic in the length of x, as in EM, so that the runtime of the variational E-step scales in the multilingual case the same as it would be if we added an equivalent amount of data in the monolingual case. [sent-912, score-0.38]
75 Some of the improvements in dependency grammar induction are achieved because of techniques which are orthogonal to ours, such as improvements in the underlying grammar (instead of DMV; Headden et al. [sent-917, score-1.182]
76 Discussion We have shown that modeling covariance among grammar weights within a probabilistic grammar’s multinomial distributions, across its distributions, and across grammars in two languages can have benefits for learning dependency structure in an unsupervised empirical Bayesian framework. [sent-924, score-1.258]
77 We believe, however, that more remains to be done to incorporate prior linguistic knowledge into unsupervised grammar induction. [sent-927, score-0.65]
78 The empirical Bayesian paradigm explored here, and the use of variational approximations for coping with non-conjugacy, will be important tools in future work that brings together prior knowledge and unannotated data for grammar induction. [sent-929, score-0.682]
79 To model the covariance, we used the logistic normal distribution as a prior over the grammar parameters. [sent-942, score-0.804]
80 In addition, we extended the logistic normal distribution to a new family of distributions, in order to model covariance across the multinomial family in a probabilistic grammar. [sent-943, score-0.483]
81 We proposed a variational inference algorithm for estimating the parameters of the probabilistic grammar, providing a fast, parallelizable,8 and deterministic alternative to MCMC methods to approximate the posterior over derivations and grammar parameters. [sent-944, score-0.818]
82 3040 C OVARIANCE IN U NSUPERVISED L EARNING OF P ROBABILISTIC G RAMMARS We experimented with grammar induction on six different languages, demonstrating the usefulness of our approach. [sent-947, score-0.567]
83 The focus of the experiments was on dependency grammar induction with the dependency model with valence. [sent-950, score-0.783]
84 Our choice of the DMV is motivated by the fact that it is a widespread grammar for dependency grammar induction (Section 2. [sent-951, score-1.182]
85 2), enabling us to tease apart the problem of estimation of the grammar from the problem of deciding on the grammar structure. [sent-952, score-1.014]
86 Our inference algorithm, though, could be applied to any probabilistic grammar that has an efficient procedure, such as the inside-outside algorithm, for computing sufficient statistics in the form of expected counts of rule firing in grammar derivations. [sent-953, score-1.13]
87 Variational Inference with Logistic Normal Priors We give a derivation of a variational inference algorithm for model I, with the shared logistic normal distribution as a prior. [sent-961, score-0.504]
88 where N qm (ηm , ym ) = Lk ˜ m,k,i ˜ ∏ ∏ qm (ηm,k,i | µm,k,i , σ2 ) × qm (ym ), k=1 i=1 ˜ m,k,i ˜ m,k,i ˜ and qm (ηm,k,i | µm,k,i , σ2 ) is a Gaussian with mean µm,k,i and variance σ2 . [sent-979, score-1.005]
89 The factorized form of Equation 11 implies the following identities: 3042 C OVARIANCE IN U NSUPERVISED L EARNING OF P ROBABILISTIC G RAMMARS Eq log p(ηm,i | µi , Σi ) = Eqm log p(ηm,i | µi , Σi ) , Eq [log p(xm , ym | ηm , S)] = Eqm [log p(xm , ym | ηm , S)] , N H(q) = ∑ H(qm ). [sent-983, score-0.362]
90 m=1 Let ηC be an intermediate variable, denoting the average of the normal experts which appear in k,i the partition structure and determine the value of the ith event in the kth multinomial of the grammar. [sent-984, score-0.369]
91 We now have: m,k,i 3043 C OHEN AND S MITH Algorithm 1: Variational EM for probabilistic grammars with LN prior Input: initial parameters µ(0) , Σ(0) , training data x, and development data x′ Output: learned parameters µ, Σ t ←1; repeat Call E-Step for each training example m = 1, . [sent-997, score-0.379]
92 This means that the terms we are ˜ ˜ interested in maximizing from Equation 14 are the following, after plugging in f˜m,k,i explicitly: L = argmax ∑ qm (ym ) qm (ym ) ym K Nk ˜ ∑ ∑ fm,k,i (xm , ym )ψm,k,i k=1 i=1 3045 + H(qm (ym )). [sent-1020, score-0.774]
93 Then, note that: L = argmin DKL qm (ym ) ˜ rm (ym | xm , eψm ) , (15) qm (ym ) where DKL denotes the KL divergence. [sent-1022, score-0.494]
94 The above lemma demonstrates that the minimizing qm (ym ) has the same form as the probabilistic grammar G, only without having sum-to-one constraints on the weights (leading to the required ˜ normalization constant Zm (ψm )). [sent-1025, score-0.783]
95 As in classic EM with probabilistic grammars, we never need to represent qm (ym ) explicitly; we need only ˜m , which can be calculated as expected feature values f ˜ under rm (ym | xm , eψm ) using dynamic programming. [sent-1026, score-0.358]
96 The main difference is that instead of having variational parameters for each qm (ηm ), we have a single distribution q(η), and the sufficient statistics from the inside-outside algorithm are used altogether to update it during variational inference. [sent-1028, score-0.444]
97 Variational EM for Logistic-Normal Probabilistic Grammars The algorithm for variational inference with probabilistic grammars using logistic normal prior ˜ (t) is defined in Algorithms 1–3. [sent-1030, score-0.785]
98 Two experiments on learning probabilistic dependency grammars from corpora. [sent-1144, score-0.431]
99 Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. [sent-1176, score-0.865]
100 Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. [sent-1529, score-0.376]
wordName wordTfidf (topN-words)
[('grammar', 0.507), ('grammars', 0.253), ('qm', 0.206), ('ym', 0.181), ('multinomials', 0.176), ('nk', 0.144), ('normal', 0.135), ('ie', 0.13), ('ohen', 0.13), ('turkish', 0.122), ('variational', 0.119), ('decoding', 0.119), ('dmv', 0.111), ('mith', 0.111), ('nsupervised', 0.111), ('ovariance', 0.111), ('robabilistic', 0.111), ('dependency', 0.108), ('languages', 0.107), ('dirichlet', 0.106), ('logistic', 0.106), ('rammars', 0.1), ('mbr', 0.099), ('sln', 0.098), ('multinomial', 0.096), ('sentences', 0.095), ('viterbi', 0.088), ('portuguese', 0.087), ('tags', 0.087), ('bilingual', 0.085), ('xm', 0.082), ('treebank', 0.081), ('language', 0.079), ('acl', 0.078), ('vbd', 0.077), ('tying', 0.076), ('covariance', 0.076), ('stop', 0.074), ('smith', 0.071), ('probabilistic', 0.07), ('klein', 0.069), ('eqm', 0.069), ('attachment', 0.067), ('grammatical', 0.066), ('em', 0.065), ('english', 0.065), ('word', 0.064), ('shared', 0.063), ('nn', 0.06), ('induction', 0.06), ('multilingual', 0.059), ('kth', 0.058), ('prior', 0.056), ('blei', 0.053), ('rbr', 0.052), ('spitkovsky', 0.052), ('manning', 0.05), ('cohen', 0.049), ('mle', 0.048), ('ln', 0.046), ('inference', 0.046), ('initializer', 0.046), ('raiffa', 0.046), ('conll', 0.046), ('linguistic', 0.046), ('sentence', 0.045), ('experts', 0.045), ('earning', 0.045), ('jk', 0.044), ('derivations', 0.042), ('tag', 0.041), ('chinese', 0.041), ('unsupervised', 0.041), ('eisner', 0.041), ('bayesian', 0.039), ('desiderata', 0.039), ('priors', 0.039), ('aitchison', 0.038), ('alia', 0.038), ('monolingual', 0.038), ('schaifer', 0.038), ('parsing', 0.038), ('czech', 0.038), ('nouns', 0.038), ('johnson', 0.038), ('head', 0.036), ('committee', 0.035), ('gillenwater', 0.035), ('japanese', 0.035), ('verbs', 0.035), ('partition', 0.035), ('derivation', 0.035), ('posterior', 0.034), ('headden', 0.033), ('inter', 0.033), ('mth', 0.033), ('baselines', 0.032), ('syntactic', 0.032), ('attachments', 0.031), ('datum', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
Author: Shay B. Cohen, Noah A. Smith
Abstract: Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference
2 0.33194542 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
3 0.18945445 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
4 0.18840279 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
Author: Alexander Clark, Rémi Eyraud, Amaury Habrard
Abstract: We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages using the paradigm of positive data and a membership oracle. We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs), which are capable of representing richly structured context-free languages as well as some context sensitive languages. These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning. This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly expressive representations necessary for linguistic representation: it allows the learnability of a large class of context-free languages, that includes all regular languages and those context-free languages that satisfy two simple constraints. The formalism and the algorithm seem well suited to natural language and in particular to the modeling of first language acquisition. Preliminary experimental results confirm the effectiveness of this approach. Keywords: grammatical inference, context-free language, positive data only, membership queries
5 0.15405238 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
6 0.096311659 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
7 0.092620395 15 jmlr-2010-Approximate Tree Kernels
8 0.074263722 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
9 0.065556176 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
10 0.063453369 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
11 0.062429901 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
12 0.054274932 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
13 0.053437371 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
14 0.052725602 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?
15 0.052715961 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
16 0.050510924 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
17 0.047245201 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
18 0.047085088 35 jmlr-2010-Error-Correcting Output Codes Library
19 0.045893718 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
20 0.043581937 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
topicId topicWeight
[(0, -0.24), (1, 0.049), (2, -0.341), (3, 0.095), (4, 0.117), (5, -0.502), (6, 0.073), (7, 0.046), (8, -0.008), (9, -0.047), (10, 0.021), (11, -0.059), (12, -0.174), (13, -0.064), (14, -0.001), (15, -0.059), (16, -0.09), (17, 0.0), (18, -0.003), (19, -0.004), (20, -0.012), (21, -0.078), (22, 0.074), (23, 0.061), (24, -0.015), (25, -0.003), (26, -0.025), (27, -0.018), (28, -0.044), (29, -0.036), (30, 0.06), (31, 0.045), (32, -0.056), (33, 0.036), (34, 0.023), (35, 0.005), (36, 0.036), (37, 0.039), (38, -0.028), (39, -0.009), (40, -0.093), (41, 0.002), (42, 0.025), (43, -0.078), (44, 0.049), (45, 0.006), (46, 0.013), (47, -0.038), (48, 0.07), (49, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.94882321 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
Author: Shay B. Cohen, Noah A. Smith
Abstract: Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference
2 0.86704123 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
3 0.7191034 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
Author: Alexander Clark, Rémi Eyraud, Amaury Habrard
Abstract: We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages using the paradigm of positive data and a membership oracle. We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs), which are capable of representing richly structured context-free languages as well as some context sensitive languages. These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning. This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly expressive representations necessary for linguistic representation: it allows the learnability of a large class of context-free languages, that includes all regular languages and those context-free languages that satisfy two simple constraints. The formalism and the algorithm seem well suited to natural language and in particular to the modeling of first language acquisition. Preliminary experimental results confirm the effectiveness of this approach. Keywords: grammatical inference, context-free language, positive data only, membership queries
4 0.50417531 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
5 0.47801223 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
6 0.29758298 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
7 0.26535705 15 jmlr-2010-Approximate Tree Kernels
8 0.23165742 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
9 0.21485075 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
10 0.19941999 35 jmlr-2010-Error-Correcting Output Codes Library
11 0.19772319 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
12 0.19120045 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
13 0.18400022 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
14 0.18262094 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
15 0.18122236 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
16 0.17820005 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
17 0.17633773 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
18 0.17109881 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
19 0.16898707 61 jmlr-2010-Learning From Crowds
20 0.16770495 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
topicId topicWeight
[(4, 0.018), (8, 0.018), (21, 0.032), (22, 0.439), (24, 0.012), (32, 0.041), (36, 0.046), (37, 0.055), (75, 0.087), (81, 0.011), (85, 0.062), (91, 0.02), (96, 0.064), (97, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.7449677 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
Author: Shay B. Cohen, Noah A. Smith
Abstract: Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference
2 0.72558701 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
Author: Gunnar Carlsson, Facundo Mémoli
Abstract: We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods. Keywords: clustering, hierarchical clustering, stability of clustering, Gromov-Hausdorff distance
3 0.40377 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
4 0.35212976 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
5 0.31030557 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
Author: James Henderson, Ivan Titov
Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks
6 0.30331135 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
7 0.29473913 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
8 0.28801233 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
9 0.28137988 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
10 0.27979735 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
11 0.27891123 109 jmlr-2010-Stochastic Composite Likelihood
12 0.27791932 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
13 0.27742383 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
14 0.27604878 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
16 0.27534112 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
17 0.27445269 63 jmlr-2010-Learning Instance-Specific Predictive Models
18 0.27443412 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
19 0.27427214 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
20 0.27401826 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM