jmlr jmlr2011 jmlr2011-77 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. [sent-12, score-0.313]
2 We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. [sent-13, score-0.334]
3 In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6. [sent-15, score-0.167]
4 Introduction We investigate unsupervised learning methods for dependency parsing models that impose sparsity biases on the types of dependencies. [sent-20, score-0.36]
5 We assume a corpus annotated with part-of-speech (POS) tags, where the task is to induce a dependency model from the tag sequences for corpus sentences. [sent-21, score-0.388]
6 In this setting, the type of a dependency is defined as a simple pair: tag of the dependent (also known as the child), and tag of the head (also known as the parent) for that dependent. [sent-22, score-0.402]
7 Given that POS tags are typically designed to convey information about grammatical relations, it is reasonable to expect that only some of the possible dependency types would be realized for any given language. [sent-23, score-0.218]
8 Previous work in unsupervised grammar induction has mostly focused on achieving sparsity through priors on model parameters. [sent-27, score-0.233]
9 Such priors on parameters encourage a standard generative dependency parsing model (see Section 2) to limit the number of dependent types for each head type. [sent-33, score-0.324]
10 Our experiments (Section 6) show that the more effective sparsity pattern is one that limits the total number of unique head-dependent tag pairs. [sent-38, score-0.196]
11 Section 6 describes the results of dependency parsing experiments across 12 languages and against recent published state-of-the-art results for the English language. [sent-55, score-0.336]
12 For a sentence consisting of POS tags x, the root head POS r(x) is generated first with probability proot (r(x)). [sent-78, score-0.251]
13 After generating the root, the model next generates dependents of the root. [sent-80, score-0.164]
14 In our example, this corresponds to the probability pstop (t | V, r,t), which indicates that the model is done generating right dependents of V . [sent-87, score-0.259]
15 After stopping the generation of right dependents, the model generates left dependents using the mirror image of the right-dependent process. [sent-88, score-0.164]
16 Once the root has generated all of its dependents, the dependents generate their own dependents in the same manner. [sent-89, score-0.328]
17 We follow the convention that the model generates dependents starting with the rightmost one, moving inward (leftward) until all right dependents are added, then it generates the leftmost left dependent and moves inward (rightward) from there. [sent-90, score-0.328]
18 Parent tags are sorted top-to-bottom in descending order by the number of unique child tags they take. [sent-121, score-0.338]
19 The saturation of a square with parent p and child c is determined by the max value of the posterior probability of type c having parent p observed in the entire English training corpus (Marcus et al. [sent-123, score-0.471]
20 If we were asked to parse the tag sequence DT ADJ N V, the dependency tree with V as root, N as its child, and the remaining DT and ADJ as N’s children is almost forced on us. [sent-131, score-0.278]
21 Thus, in this work we attempt to limit grammar ambiguity by inducing a grammar that allows only a sparse set of possible dependency relation types. [sent-134, score-0.334]
22 Notice also that while some parent tags can take many different child tags, some parent tags can take just a few child tags, and some tags cannot be parents; the number of allowed child tags spans a wide range. [sent-137, score-1.022]
23 These empirical properties are not captured by previous attempts to achieve model sparsity with hierarchical Bayesian models, which push each each parent tag to allow only a few child tags. [sent-138, score-0.423]
24 , 2010) to achieve the desired sparsity in grammar induction. [sent-146, score-0.163]
25 In particular, PR allows a natural representation of the dependency sparsity constraints based on the ambiguity measures described below. [sent-153, score-0.215]
26 1 Measures of Ambiguity We now describe precisely how to count dependency types, which will allow us to specify different kinds of dependency sparsity. [sent-157, score-0.206]
27 For each child tag c, let i range over some arbitrary enumeration of all occurrences of c in the corpus, and let p be another tag. [sent-158, score-0.235]
28 The indicator φcpi (X, Y) has value 1 if p is the tag of the parent of the ith occurrence of c, and value 0 otherwise. [sent-159, score-0.246]
29 The number of unique dependency types is then given by: (3) ∑ max φcpi (X, Y), cp i where we sum over child-parent types cp, computing the maximum (logical or) over possible occurrences of c ← p dependencies. [sent-160, score-0.166]
30 Note that there is an asymmetry in this way of counting types: occurrences of the child type c are enumerated with i, but all occurrences of the parent type p are or-ed in φcpi , that is, φcpi is 1 if any occurrence of tag p is the parent of the ith occurrence of tag c. [sent-161, score-0.6]
31 Instead of counting pairs of a child token and a parent type, we could instead have counted pairs of a child token and a parent token by letting p range over all tokens rather than types. [sent-171, score-0.502]
32 cp i, j On actual dependency trees, where each child has a unique parent, PR-AS and PR-S always yield the same value. [sent-175, score-0.274]
33 For PR-AS all parent tokens are collapsed, but we could also consider the case where all child tokens are collapsed. [sent-180, score-0.323]
34 For any distribution q over latent variables, we can define a penalty as the β-norm of the feature expectations: Eq [φ(X, Y)] β , where Y represents an assignment of parse trees for all sentences in the corpus X. [sent-237, score-0.261]
35 More precisely, we will penalize the following quantity: the sum (ℓ1 norm) over c of the maximum (ℓ∞ norm) over occurrences of c of the posterior probability of selecting a parent with tag p for that child. [sent-251, score-0.292]
36 For intermediate values of σ, the constraints work to decrease the confidence of the highest probability parent tags for each child instance. [sent-265, score-0.342]
37 For parent tags that are supported by many high-probability instances, this pressure is distributed among many instances and has little effect. [sent-266, score-0.234]
38 For parent tags that are supported by few high-probability instances however, the probability of these instances is more severely reduced, which can (after several iterations of the algorithm) effectively eliminate that parent tag as a possibility for the given child tag. [sent-267, score-0.588]
39 (2008) use this method for dependency parsing with the DMV and achieve improvements over basic EM. [sent-330, score-0.25]
40 In particular we will show that while it achieves parameter sparsity, this is not the optimal sparsity to aim for in dependency parsing. [sent-343, score-0.172]
41 Intuitively, sparsity of pchild (c | p, d) means requiring that each parent tag has few unique child tags. [sent-344, score-0.553]
42 For example, LN can tie the parameters pchild (c1 | p, d) and pchild (c2 | p, d). [sent-363, score-0.292]
43 In this case, parameters such as pchild (c | p1 , d) and pchild (c | p2 , d) can be tied even though they correspond to PCGF rules with different lefthand sides. [sent-365, score-0.26]
44 The fourth method has not yet been applied to the dependency parsing task we evaluate on in this work, so we defer direct comparison. [sent-371, score-0.25]
45 For example, for dependency parsing Smith and Eisner (2005a) formed neighborhoods by deleting any one word from x(i) , or transposing any two words. [sent-376, score-0.25]
46 One natural way to adapt it to dependency parsing would be to have an integer program that minimizes the number of parent-child tag pairs subject to the constraint that every sentence can still be assigned a complete parse tree. [sent-384, score-0.482]
47 1 E XTENDING S TOP P ROBABILITIES The first extension conditions whether to stop generating dependents in a given direction on a larger set of previous decisions. [sent-395, score-0.164]
48 Specifically, the probability of stopping in a particular direction depends not only on whether there are any dependents in that direction already, but also on how many. [sent-396, score-0.164]
49 In the example of Figure 1, this corresponds to changing pstop ( f | V, r, f ) to pstop ( f | V, r, 0) and similarly for all the other stop probabilities. [sent-397, score-0.19]
50 The 0 in this case indicates that V has no other right dependents when it decides whether to continue generating right dependents. [sent-398, score-0.164]
51 , S − 2, and ≥ S − 1 dependents in a given direction. [sent-402, score-0.164]
52 The basic DMV has maximum stop valency 2 because it distinguishes between having zero dependents and at least one dependent in a given direction. [sent-403, score-0.205]
53 A model with maximum stop valency of 3 would distinguish between having 0, 1, or at least 2 dependents in a particular direction. [sent-404, score-0.205]
54 In this case, when a head generates more dependents in a particular direction after its second dependent, the stopping distribution it draws from will always be the same—for head p and direction d this will be pstop (· | p, d, 2). [sent-405, score-0.349]
55 Again, what condition on is how many other dependents were already generated in the same direction. [sent-410, score-0.164]
56 For the example in Figure 1, this means pchild (N | V, r) becomes pchild (N | V, r, 0) and similarly for all other pchild . [sent-411, score-0.39]
57 In later sections of this paper, when we talk about a model with maximum child valency C, this means we distinguish between having 0, 1, . [sent-412, score-0.149]
58 ,C − 2, and ≥ C − 1 dependents in a particular direction. [sent-415, score-0.164]
59 The basic DMV has maximum child valency 1 because it does not make these distinctions. [sent-416, score-0.149]
60 Thus, the third and final model extension we implement 469 G ILLENWATER , G ANCHEV, G RAÇA , P EREIRA AND TASKAR is to add a backoff for the child probabilities that does not condition on the identity of the parent POS (see Equation 10). [sent-419, score-0.268]
61 With this model extension, the order in which dependents are generated becomes relevant to the probability of an overall parse tree. [sent-420, score-0.212]
62 In cases where the identity of the rightmost and leftmost dependents have a greater influence on the true stop probability than the inner dependents, this ordering will work to the model’s advantage. [sent-422, score-0.164]
63 We do not investigate in this work which languages this holds true for, though changing this ordering might be one additional way to increase parsing accuracy for some languages. [sent-423, score-0.285]
64 To formally define these last four variables, first let Vc denote the model’s maximum child valency and let Vs denote maximum stop valency. [sent-427, score-0.149]
65 Further, let acpd to be the number of y p ’s dependents that are further in direction yd than yc , and axl (axr ) be the total number of dependents of parent x to the left (right). [sent-428, score-0.525]
66 In the third model extension, the backoff for the child probability to a probability not dependent on parent POS, pchild (yc | yd , yvc ), can formally be expressed by: λpchild (yc | y p , yd , yvc ) + (1 − λ)pchild (yc | yd , yvc ), (10) for λ ∈ [0, 1]. [sent-430, score-0.629]
67 To capture the intuition that events seen fewer times should be more strongly smoothed, this prior has hyperparameter value K for the standard child probability and value 2K for the backoff probability, where K is the number of PCFG rules with a particular nonterminal on the left-hand side. [sent-438, score-0.149]
68 5K 3K 3K Tr 28 3K 10K 18K Table 2: Training corpus statistics for sentences with lengths ≤ 10, after stripping punctuation. [sent-446, score-0.177]
69 We conclude this overview of the experiments with two key points that we feel show PR to be a very useful and robust method for improving unsupervised dependency parsing: • All except one of the 60 PR settings we try for English result in higher accuracy than the best SDP setting. [sent-461, score-0.196]
70 Following Smith and Eisner (2006), we stripped punctuation from the sentences and kept only those sentences of length ≤ 10. [sent-476, score-0.196]
71 Figure 7 shows the accuracies on the English corpus broken down by POS tag category. [sent-515, score-0.206]
72 The plot shows that sharp changes in overall accuracy are in fact caused by even sharper changes in the attachment accuracies of the tag categories. [sent-516, score-0.21]
73 The problem continues to be very underspecified, and without knowing the “true” sparsity pattern of a language, we can ultimately only achieve limited parsing accuracy. [sent-518, score-0.216]
74 100 overall Noun Det Number Adj Conj Prt accuracy (En) 80 60 40 20 0 170 180 190 200 210 220 L1L∞ Figure 7: The accuracy overall and for different POS tag types in the English corpus as a function of ℓ1 /ℓ∞ as we vary the constraint strength. [sent-588, score-0.31]
75 (In all experiments, we held out the last 100 sentences of each training corpus for development; the numbers in Table 2 correspond to this reduced training set size. [sent-594, score-0.177]
76 Our hypothesis was that entropy of the distribution over contexts for each constituent should be small when parsing accuracy was high. [sent-608, score-0.199]
77 This might not select the ideal parameters for any particular language, but provides a more realistic test setting: a user has available a labeled corpus in one language, and would like to induce grammars for other languages of interest. [sent-683, score-0.203]
78 In this case, the strength, σx , for a particular language was found by the following formula: σx = σen ∗ |tokensen |/|tokensx |, where σen is the best strength for English, |tokensen | is the number of tokens of the English corpus, and |tokensx | is the number of tokens in language x. [sent-687, score-0.149]
79 Figure 13: Comparing DMV grammar ambiguities on the training data by computing the average number of parent tags per child tag (ℓ1 /ℓ∞ divided by number of child tags) and normalizing it by the theoretical maximum for each language. [sent-744, score-0.671]
80 PR tends not to make this er478 P OSTERIOR S PARSITY IN U NSUPERVISED D EPENDENCY PARSING Una papelera es un ob jeto d nc vs d nc civilizado aq 0. [sent-754, score-0.189]
81 49 Una papelera es un ob jeto d nc vs d nc civilizado aq 0. [sent-762, score-0.189]
82 99 Una papelera es un ob jeto d nc vs d nc civilizado aq Figure 14: Posterior edge probabilities for an example sentence from the Spanish test corpus. [sent-769, score-0.246]
83 For example, consider the sentence “Lleva tiempo entenderlos” (translation: “It takes time to understand (them)”) with tags “main-verb common-noun main-verb”. [sent-773, score-0.172]
84 So, if the key is pt /pg → c, then accuracy is: acc = # of pt → c in Viterbi parses . [sent-779, score-0.176]
85 Presented with such scenarios, where there is no TO present to be the parent of VB, PR chooses the first VB as the parent of the second. [sent-970, score-0.238]
86 It maintains this preference for making the first VB a parent of the second when encountered with “VB TO VB” sequences, such as “used to eliminate”, because it would have to pay an additional penalty to make TO the parent of the second VB. [sent-971, score-0.274]
87 Thus, if PR chose CD as parent of NN, it would have to pay an additional penalty to select another parent for NN in sentences where no CDs exist. [sent-975, score-0.372]
88 The tag M stands for “numeral” in the Bulgarian corpus, so this correction is similar to the English correction involving the tag CD. [sent-995, score-0.314]
89 The tag C stands for “conjunction” in the Bulgarian corpus, so this correction means the model is realizing verbs should usually be sentence roots rather than children of conjunctions. [sent-997, score-0.214]
90 Following the same reasoning about PR that we used before, we note that sentences with verbs but no conjunctions are very common, so if PR chose C as the parent of V, it would have to pay a penalty to give V a different parent in such sentences. [sent-998, score-0.372]
91 In contrast with previous approaches that impose a sparsity bias on the model parameters using sparsifying Dirichlet distributions, we impose a sparsity bias on the model posteriors. [sent-1007, score-0.185]
92 The regularization strength is strongly dependent on the corpus, both on the number of parent-child pairs being constrained as well as on the number of tokens for each parent and child. [sent-1023, score-0.22]
93 (2009) achieve significant improvements by conditioning the edge probabilities on the parent word together with the parent POS. [sent-1034, score-0.238]
94 That is, we would like to use POS tags induced in an unsupervised manner, instead of assuming gold POS tags, and see how robust our method is under these conditions. [sent-1037, score-0.156]
95 Further, it would be even more interesting to see how our method performs if we applied it to aid in the more complex task of joint induction of POS tags and dependency parses. [sent-1041, score-0.218]
96 7 Table 5: Directed attachment accuracy results on the test corpus (for sentences of lengths ≤ 10, no punctuation). [sent-1099, score-0.26]
97 The Baby Steps (BS) method starts by training the model on sentences of length 1, then the parameters of this model are used to initialize a training run over sentences of length 2, and so on. [sent-1125, score-0.196]
98 The second method, Less is More (LsM), uses information from the BS method to pick a sentence length that includes enough sentences to train a model with good predictive power, but leaves out longer sentences that do not add much information. [sent-1126, score-0.253]
99 The results are similar to the best shared logistic normal prior when tested on sentences of length up to ten, but when tested on longer sentences the PR trained models perform significantly better then all other approaches. [sent-1150, score-0.196]
100 Improving unsupervised dependency parsing with richer contexts and smoothing. [sent-1584, score-0.291]
wordName wordTfidf (topN-words)
[('dmv', 0.505), ('pr', 0.245), ('em', 0.236), ('sdp', 0.196), ('headden', 0.177), ('dependents', 0.164), ('parsing', 0.147), ('cpi', 0.136), ('pchild', 0.13), ('tag', 0.127), ('pos', 0.125), ('anchev', 0.123), ('ereira', 0.123), ('illenwater', 0.123), ('parent', 0.119), ('ependency', 0.116), ('nsupervised', 0.116), ('english', 0.116), ('tags', 0.115), ('child', 0.108), ('dependency', 0.103), ('sentences', 0.098), ('pstop', 0.095), ('grammar', 0.094), ('gra', 0.089), ('osterior', 0.089), ('languages', 0.086), ('prp', 0.082), ('corpus', 0.079), ('nn', 0.077), ('parsity', 0.076), ('ganchev', 0.073), ('ra', 0.071), ('sparsity', 0.069), ('cohen', 0.068), ('sln', 0.068), ('vs', 0.067), ('taskar', 0.067), ('vb', 0.064), ('qt', 0.063), ('cp', 0.063), ('adj', 0.061), ('bulgarian', 0.061), ('eq', 0.06), ('sentence', 0.057), ('dirichlet', 0.057), ('smith', 0.056), ('dev', 0.055), ('strength', 0.053), ('accuracy', 0.052), ('initialization', 0.049), ('bg', 0.048), ('tokens', 0.048), ('randomp', 0.048), ('parse', 0.048), ('dt', 0.048), ('sparsifying', 0.047), ('posterior', 0.046), ('head', 0.045), ('nc', 0.044), ('cz', 0.043), ('pt', 0.043), ('ambiguity', 0.043), ('se', 0.042), ('yc', 0.042), ('unsupervised', 0.041), ('interjection', 0.041), ('sda', 0.041), ('spitkovsky', 0.041), ('yvc', 0.041), ('valency', 0.041), ('backoff', 0.041), ('iii', 0.04), ('acc', 0.038), ('annealing', 0.038), ('grammars', 0.038), ('posteriors', 0.036), ('penalty', 0.036), ('yd', 0.036), ('viterbi', 0.035), ('cative', 0.035), ('spanish', 0.035), ('eqt', 0.034), ('pools', 0.034), ('proot', 0.034), ('aq', 0.034), ('vm', 0.034), ('particle', 0.034), ('kl', 0.033), ('nl', 0.033), ('tie', 0.032), ('gains', 0.032), ('abbreviation', 0.031), ('attachment', 0.031), ('correction', 0.03), ('priors', 0.029), ('nv', 0.029), ('eisner', 0.029), ('mcclosky', 0.029), ('noun', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
2 0.17092445 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
Author: Dorota Głowacka, John Shawe-Taylor, Alex Clark, Colin de la Higuera, Mark Johnson
Abstract: Grammar induction refers to the process of learning grammars and languages from data; this finds a variety of applications in syntactic pattern recognition, the modeling of natural language acquisition, data mining and machine translation. This special topic contains several papers presenting some of recent developments in the area of grammar induction and language learning, as applied to various problems in Natural Language Processing, including supervised and unsupervised parsing and statistical machine translation. Keywords: machine translation, Bayesian inference, grammar induction, natural language parsing
3 0.14234102 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
4 0.085017465 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
5 0.07626155 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
6 0.049509998 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
7 0.046190076 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
8 0.045299791 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
9 0.044349156 55 jmlr-2011-Learning Multi-modal Similarity
10 0.042144686 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
11 0.039664388 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
12 0.039639901 58 jmlr-2011-Learning from Partial Labels
13 0.038278703 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
14 0.036994763 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
15 0.035306223 6 jmlr-2011-A Simpler Approach to Matrix Completion
16 0.035224609 59 jmlr-2011-Learning with Structured Sparsity
17 0.034158695 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
18 0.034022965 54 jmlr-2011-Learning Latent Tree Graphical Models
19 0.033691306 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
20 0.032740518 61 jmlr-2011-Logistic Stick-Breaking Process
topicId topicWeight
[(0, 0.191), (1, -0.1), (2, -0.035), (3, 0.027), (4, -0.168), (5, 0.052), (6, 0.062), (7, 0.139), (8, 0.054), (9, -0.141), (10, -0.231), (11, -0.303), (12, -0.118), (13, -0.045), (14, 0.109), (15, 0.028), (16, 0.169), (17, 0.015), (18, -0.073), (19, 0.019), (20, -0.01), (21, 0.018), (22, 0.017), (23, -0.099), (24, 0.092), (25, 0.03), (26, 0.013), (27, -0.094), (28, 0.11), (29, 0.095), (30, 0.211), (31, 0.023), (32, -0.035), (33, 0.099), (34, 0.023), (35, -0.013), (36, -0.024), (37, 0.016), (38, -0.062), (39, -0.052), (40, -0.088), (41, -0.059), (42, 0.103), (43, -0.046), (44, 0.052), (45, -0.019), (46, -0.119), (47, 0.047), (48, 0.045), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.95632356 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
2 0.63964075 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
Author: Dorota Głowacka, John Shawe-Taylor, Alex Clark, Colin de la Higuera, Mark Johnson
Abstract: Grammar induction refers to the process of learning grammars and languages from data; this finds a variety of applications in syntactic pattern recognition, the modeling of natural language acquisition, data mining and machine translation. This special topic contains several papers presenting some of recent developments in the area of grammar induction and language learning, as applied to various problems in Natural Language Processing, including supervised and unsupervised parsing and statistical machine translation. Keywords: machine translation, Bayesian inference, grammar induction, natural language parsing
3 0.60886401 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
4 0.39825362 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
5 0.34701815 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
6 0.29802212 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
7 0.28866559 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
8 0.24413124 6 jmlr-2011-A Simpler Approach to Matrix Completion
9 0.22233215 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
10 0.20781094 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
11 0.18727328 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
12 0.186138 58 jmlr-2011-Learning from Partial Labels
13 0.1845586 12 jmlr-2011-Bayesian Co-Training
14 0.18388113 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
15 0.1806684 61 jmlr-2011-Logistic Stick-Breaking Process
16 0.17986132 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
17 0.17929944 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
18 0.17688262 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
19 0.17664376 55 jmlr-2011-Learning Multi-modal Similarity
20 0.17028469 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
topicId topicWeight
[(4, 0.046), (7, 0.083), (9, 0.024), (10, 0.027), (24, 0.056), (31, 0.09), (32, 0.028), (41, 0.025), (48, 0.255), (60, 0.015), (67, 0.01), (71, 0.019), (73, 0.033), (78, 0.056), (82, 0.049), (90, 0.027), (96, 0.016), (99, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.73024142 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
2 0.47380245 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
3 0.46712503 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
4 0.44135743 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
5 0.43891484 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
Author: Dorota Głowacka, John Shawe-Taylor, Alex Clark, Colin de la Higuera, Mark Johnson
Abstract: Grammar induction refers to the process of learning grammars and languages from data; this finds a variety of applications in syntactic pattern recognition, the modeling of natural language acquisition, data mining and machine translation. This special topic contains several papers presenting some of recent developments in the area of grammar induction and language learning, as applied to various problems in Natural Language Processing, including supervised and unsupervised parsing and statistical machine translation. Keywords: machine translation, Bayesian inference, grammar induction, natural language parsing
6 0.42721671 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
7 0.42363116 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
8 0.41985959 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
9 0.41927126 12 jmlr-2011-Bayesian Co-Training
10 0.41901317 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
11 0.41663048 16 jmlr-2011-Clustering Algorithms for Chains
12 0.41465336 91 jmlr-2011-The Sample Complexity of Dictionary Learning
13 0.41409209 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
14 0.41371334 48 jmlr-2011-Kernel Analysis of Deep Networks
15 0.41331899 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
16 0.41292503 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
17 0.41132042 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
18 0.40857083 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
19 0.40771839 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
20 0.40635899 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models