jmlr jmlr2011 jmlr2011-77 knowledge-graph by maker-knowledge-mining

77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Source: pdf

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve signiﬁcant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. [sent-12, score-0.313]

2 We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. [sent-13, score-0.334]

3 In experiments with 12 different languages, we achieve signiﬁcant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6. [sent-15, score-0.167]

4 Introduction We investigate unsupervised learning methods for dependency parsing models that impose sparsity biases on the types of dependencies. [sent-20, score-0.36]

5 We assume a corpus annotated with part-of-speech (POS) tags, where the task is to induce a dependency model from the tag sequences for corpus sentences. [sent-21, score-0.388]

6 In this setting, the type of a dependency is deﬁned as a simple pair: tag of the dependent (also known as the child), and tag of the head (also known as the parent) for that dependent. [sent-22, score-0.402]

7 Given that POS tags are typically designed to convey information about grammatical relations, it is reasonable to expect that only some of the possible dependency types would be realized for any given language. [sent-23, score-0.218]

8 Previous work in unsupervised grammar induction has mostly focused on achieving sparsity through priors on model parameters. [sent-27, score-0.233]

9 Such priors on parameters encourage a standard generative dependency parsing model (see Section 2) to limit the number of dependent types for each head type. [sent-33, score-0.324]

10 Our experiments (Section 6) show that the more effective sparsity pattern is one that limits the total number of unique head-dependent tag pairs. [sent-38, score-0.196]

11 Section 6 describes the results of dependency parsing experiments across 12 languages and against recent published state-of-the-art results for the English language. [sent-55, score-0.336]

12 For a sentence consisting of POS tags x, the root head POS r(x) is generated ﬁrst with probability proot (r(x)). [sent-78, score-0.251]

13 After generating the root, the model next generates dependents of the root. [sent-80, score-0.164]

14 In our example, this corresponds to the probability pstop (t | V, r,t), which indicates that the model is done generating right dependents of V . [sent-87, score-0.259]

15 After stopping the generation of right dependents, the model generates left dependents using the mirror image of the right-dependent process. [sent-88, score-0.164]

16 Once the root has generated all of its dependents, the dependents generate their own dependents in the same manner. [sent-89, score-0.328]

17 We follow the convention that the model generates dependents starting with the rightmost one, moving inward (leftward) until all right dependents are added, then it generates the leftmost left dependent and moves inward (rightward) from there. [sent-90, score-0.328]

18 Parent tags are sorted top-to-bottom in descending order by the number of unique child tags they take. [sent-121, score-0.338]

19 The saturation of a square with parent p and child c is determined by the max value of the posterior probability of type c having parent p observed in the entire English training corpus (Marcus et al. [sent-123, score-0.471]

20 If we were asked to parse the tag sequence DT ADJ N V, the dependency tree with V as root, N as its child, and the remaining DT and ADJ as N’s children is almost forced on us. [sent-131, score-0.278]

21 Thus, in this work we attempt to limit grammar ambiguity by inducing a grammar that allows only a sparse set of possible dependency relation types. [sent-134, score-0.334]

22 Notice also that while some parent tags can take many different child tags, some parent tags can take just a few child tags, and some tags cannot be parents; the number of allowed child tags spans a wide range. [sent-137, score-1.022]

23 These empirical properties are not captured by previous attempts to achieve model sparsity with hierarchical Bayesian models, which push each each parent tag to allow only a few child tags. [sent-138, score-0.423]

24 , 2010) to achieve the desired sparsity in grammar induction. [sent-146, score-0.163]

25 In particular, PR allows a natural representation of the dependency sparsity constraints based on the ambiguity measures described below. [sent-153, score-0.215]

26 1 Measures of Ambiguity We now describe precisely how to count dependency types, which will allow us to specify different kinds of dependency sparsity. [sent-157, score-0.206]

27 For each child tag c, let i range over some arbitrary enumeration of all occurrences of c in the corpus, and let p be another tag. [sent-158, score-0.235]

28 The indicator φcpi (X, Y) has value 1 if p is the tag of the parent of the ith occurrence of c, and value 0 otherwise. [sent-159, score-0.246]

29 The number of unique dependency types is then given by: (3) ∑ max φcpi (X, Y), cp i where we sum over child-parent types cp, computing the maximum (logical or) over possible occurrences of c ← p dependencies. [sent-160, score-0.166]

30 Note that there is an asymmetry in this way of counting types: occurrences of the child type c are enumerated with i, but all occurrences of the parent type p are or-ed in φcpi , that is, φcpi is 1 if any occurrence of tag p is the parent of the ith occurrence of tag c. [sent-161, score-0.6]

31 Instead of counting pairs of a child token and a parent type, we could instead have counted pairs of a child token and a parent token by letting p range over all tokens rather than types. [sent-171, score-0.502]

32 cp i, j On actual dependency trees, where each child has a unique parent, PR-AS and PR-S always yield the same value. [sent-175, score-0.274]

33 For PR-AS all parent tokens are collapsed, but we could also consider the case where all child tokens are collapsed. [sent-180, score-0.323]

34 For any distribution q over latent variables, we can deﬁne a penalty as the β-norm of the feature expectations: Eq [φ(X, Y)] β , where Y represents an assignment of parse trees for all sentences in the corpus X. [sent-237, score-0.261]

35 More precisely, we will penalize the following quantity: the sum (ℓ1 norm) over c of the maximum (ℓ∞ norm) over occurrences of c of the posterior probability of selecting a parent with tag p for that child. [sent-251, score-0.292]

36 For intermediate values of σ, the constraints work to decrease the conﬁdence of the highest probability parent tags for each child instance. [sent-265, score-0.342]

37 For parent tags that are supported by many high-probability instances, this pressure is distributed among many instances and has little effect. [sent-266, score-0.234]

38 For parent tags that are supported by few high-probability instances however, the probability of these instances is more severely reduced, which can (after several iterations of the algorithm) effectively eliminate that parent tag as a possibility for the given child tag. [sent-267, score-0.588]

39 (2008) use this method for dependency parsing with the DMV and achieve improvements over basic EM. [sent-330, score-0.25]

40 In particular we will show that while it achieves parameter sparsity, this is not the optimal sparsity to aim for in dependency parsing. [sent-343, score-0.172]

41 Intuitively, sparsity of pchild (c | p, d) means requiring that each parent tag has few unique child tags. [sent-344, score-0.553]

42 For example, LN can tie the parameters pchild (c1 | p, d) and pchild (c2 | p, d). [sent-363, score-0.292]

43 In this case, parameters such as pchild (c | p1 , d) and pchild (c | p2 , d) can be tied even though they correspond to PCGF rules with different lefthand sides. [sent-365, score-0.26]

44 The fourth method has not yet been applied to the dependency parsing task we evaluate on in this work, so we defer direct comparison. [sent-371, score-0.25]

45 For example, for dependency parsing Smith and Eisner (2005a) formed neighborhoods by deleting any one word from x(i) , or transposing any two words. [sent-376, score-0.25]

46 One natural way to adapt it to dependency parsing would be to have an integer program that minimizes the number of parent-child tag pairs subject to the constraint that every sentence can still be assigned a complete parse tree. [sent-384, score-0.482]

47 1 E XTENDING S TOP P ROBABILITIES The ﬁrst extension conditions whether to stop generating dependents in a given direction on a larger set of previous decisions. [sent-395, score-0.164]

48 Speciﬁcally, the probability of stopping in a particular direction depends not only on whether there are any dependents in that direction already, but also on how many. [sent-396, score-0.164]

49 In the example of Figure 1, this corresponds to changing pstop ( f | V, r, f ) to pstop ( f | V, r, 0) and similarly for all the other stop probabilities. [sent-397, score-0.19]

50 The 0 in this case indicates that V has no other right dependents when it decides whether to continue generating right dependents. [sent-398, score-0.164]

51 , S − 2, and ≥ S − 1 dependents in a given direction. [sent-402, score-0.164]

52 The basic DMV has maximum stop valency 2 because it distinguishes between having zero dependents and at least one dependent in a given direction. [sent-403, score-0.205]

53 A model with maximum stop valency of 3 would distinguish between having 0, 1, or at least 2 dependents in a particular direction. [sent-404, score-0.205]

54 In this case, when a head generates more dependents in a particular direction after its second dependent, the stopping distribution it draws from will always be the same—for head p and direction d this will be pstop (· | p, d, 2). [sent-405, score-0.349]

55 Again, what condition on is how many other dependents were already generated in the same direction. [sent-410, score-0.164]

56 For the example in Figure 1, this means pchild (N | V, r) becomes pchild (N | V, r, 0) and similarly for all other pchild . [sent-411, score-0.39]

57 In later sections of this paper, when we talk about a model with maximum child valency C, this means we distinguish between having 0, 1, . [sent-412, score-0.149]

58 ,C − 2, and ≥ C − 1 dependents in a particular direction. [sent-415, score-0.164]

59 The basic DMV has maximum child valency 1 because it does not make these distinctions. [sent-416, score-0.149]

60 Thus, the third and ﬁnal model extension we implement 469 G ILLENWATER , G ANCHEV, G RAÇA , P EREIRA AND TASKAR is to add a backoff for the child probabilities that does not condition on the identity of the parent POS (see Equation 10). [sent-419, score-0.268]

61 With this model extension, the order in which dependents are generated becomes relevant to the probability of an overall parse tree. [sent-420, score-0.212]

62 In cases where the identity of the rightmost and leftmost dependents have a greater inﬂuence on the true stop probability than the inner dependents, this ordering will work to the model’s advantage. [sent-422, score-0.164]

63 We do not investigate in this work which languages this holds true for, though changing this ordering might be one additional way to increase parsing accuracy for some languages. [sent-423, score-0.285]

64 To formally deﬁne these last four variables, ﬁrst let Vc denote the model’s maximum child valency and let Vs denote maximum stop valency. [sent-427, score-0.149]

65 Further, let acpd to be the number of y p ’s dependents that are further in direction yd than yc , and axl (axr ) be the total number of dependents of parent x to the left (right). [sent-428, score-0.525]

66 In the third model extension, the backoff for the child probability to a probability not dependent on parent POS, pchild (yc | yd , yvc ), can formally be expressed by: λpchild (yc | y p , yd , yvc ) + (1 − λ)pchild (yc | yd , yvc ), (10) for λ ∈ [0, 1]. [sent-430, score-0.629]

67 To capture the intuition that events seen fewer times should be more strongly smoothed, this prior has hyperparameter value K for the standard child probability and value 2K for the backoff probability, where K is the number of PCFG rules with a particular nonterminal on the left-hand side. [sent-438, score-0.149]

68 5K 3K 3K Tr 28 3K 10K 18K Table 2: Training corpus statistics for sentences with lengths ≤ 10, after stripping punctuation. [sent-446, score-0.177]

69 We conclude this overview of the experiments with two key points that we feel show PR to be a very useful and robust method for improving unsupervised dependency parsing: • All except one of the 60 PR settings we try for English result in higher accuracy than the best SDP setting. [sent-461, score-0.196]

70 Following Smith and Eisner (2006), we stripped punctuation from the sentences and kept only those sentences of length ≤ 10. [sent-476, score-0.196]

71 Figure 7 shows the accuracies on the English corpus broken down by POS tag category. [sent-515, score-0.206]

72 The plot shows that sharp changes in overall accuracy are in fact caused by even sharper changes in the attachment accuracies of the tag categories. [sent-516, score-0.21]

73 The problem continues to be very underspeciﬁed, and without knowing the “true” sparsity pattern of a language, we can ultimately only achieve limited parsing accuracy. [sent-518, score-0.216]

74 100 overall Noun Det Number Adj Conj Prt accuracy (En) 80 60 40 20 0 170 180 190 200 210 220 L1L∞ Figure 7: The accuracy overall and for different POS tag types in the English corpus as a function of ℓ1 /ℓ∞ as we vary the constraint strength. [sent-588, score-0.31]

75 (In all experiments, we held out the last 100 sentences of each training corpus for development; the numbers in Table 2 correspond to this reduced training set size. [sent-594, score-0.177]

76 Our hypothesis was that entropy of the distribution over contexts for each constituent should be small when parsing accuracy was high. [sent-608, score-0.199]

77 This might not select the ideal parameters for any particular language, but provides a more realistic test setting: a user has available a labeled corpus in one language, and would like to induce grammars for other languages of interest. [sent-683, score-0.203]

79 Figure 13: Comparing DMV grammar ambiguities on the training data by computing the average number of parent tags per child tag (ℓ1 /ℓ∞ divided by number of child tags) and normalizing it by the theoretical maximum for each language. [sent-744, score-0.671]

80 PR tends not to make this er478 P OSTERIOR S PARSITY IN U NSUPERVISED D EPENDENCY PARSING Una papelera es un ob jeto d nc vs d nc civilizado aq 0. [sent-754, score-0.189]

81 49 Una papelera es un ob jeto d nc vs d nc civilizado aq 0. [sent-762, score-0.189]

82 99 Una papelera es un ob jeto d nc vs d nc civilizado aq Figure 14: Posterior edge probabilities for an example sentence from the Spanish test corpus. [sent-769, score-0.246]

83 For example, consider the sentence “Lleva tiempo entenderlos” (translation: “It takes time to understand (them)”) with tags “main-verb common-noun main-verb”. [sent-773, score-0.172]

84 So, if the key is pt /pg → c, then accuracy is: acc = # of pt → c in Viterbi parses . [sent-779, score-0.176]

85 Presented with such scenarios, where there is no TO present to be the parent of VB, PR chooses the ﬁrst VB as the parent of the second. [sent-970, score-0.238]

86 It maintains this preference for making the ﬁrst VB a parent of the second when encountered with “VB TO VB” sequences, such as “used to eliminate”, because it would have to pay an additional penalty to make TO the parent of the second VB. [sent-971, score-0.274]

87 Thus, if PR chose CD as parent of NN, it would have to pay an additional penalty to select another parent for NN in sentences where no CDs exist. [sent-975, score-0.372]

88 The tag M stands for “numeral” in the Bulgarian corpus, so this correction is similar to the English correction involving the tag CD. [sent-995, score-0.314]

89 The tag C stands for “conjunction” in the Bulgarian corpus, so this correction means the model is realizing verbs should usually be sentence roots rather than children of conjunctions. [sent-997, score-0.214]

90 Following the same reasoning about PR that we used before, we note that sentences with verbs but no conjunctions are very common, so if PR chose C as the parent of V, it would have to pay a penalty to give V a different parent in such sentences. [sent-998, score-0.372]

91 In contrast with previous approaches that impose a sparsity bias on the model parameters using sparsifying Dirichlet distributions, we impose a sparsity bias on the model posteriors. [sent-1007, score-0.185]

92 The regularization strength is strongly dependent on the corpus, both on the number of parent-child pairs being constrained as well as on the number of tokens for each parent and child. [sent-1023, score-0.22]

93 (2009) achieve signiﬁcant improvements by conditioning the edge probabilities on the parent word together with the parent POS. [sent-1034, score-0.238]

94 That is, we would like to use POS tags induced in an unsupervised manner, instead of assuming gold POS tags, and see how robust our method is under these conditions. [sent-1037, score-0.156]

95 Further, it would be even more interesting to see how our method performs if we applied it to aid in the more complex task of joint induction of POS tags and dependency parses. [sent-1041, score-0.218]

96 7 Table 5: Directed attachment accuracy results on the test corpus (for sentences of lengths ≤ 10, no punctuation). [sent-1099, score-0.26]

97 The Baby Steps (BS) method starts by training the model on sentences of length 1, then the parameters of this model are used to initialize a training run over sentences of length 2, and so on. [sent-1125, score-0.196]

98 The second method, Less is More (LsM), uses information from the BS method to pick a sentence length that includes enough sentences to train a model with good predictive power, but leaves out longer sentences that do not add much information. [sent-1126, score-0.253]

99 The results are similar to the best shared logistic normal prior when tested on sentences of length up to ten, but when tested on longer sentences the PR trained models perform signiﬁcantly better then all other approaches. [sent-1150, score-0.196]

100 Improving unsupervised dependency parsing with richer contexts and smoothing. [sent-1584, score-0.291]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dmv', 0.505), ('pr', 0.245), ('em', 0.236), ('sdp', 0.196), ('headden', 0.177), ('dependents', 0.164), ('parsing', 0.147), ('cpi', 0.136), ('pchild', 0.13), ('tag', 0.127), ('pos', 0.125), ('anchev', 0.123), ('ereira', 0.123), ('illenwater', 0.123), ('parent', 0.119), ('ependency', 0.116), ('nsupervised', 0.116), ('english', 0.116), ('tags', 0.115), ('child', 0.108), ('dependency', 0.103), ('sentences', 0.098), ('pstop', 0.095), ('grammar', 0.094), ('gra', 0.089), ('osterior', 0.089), ('languages', 0.086), ('prp', 0.082), ('corpus', 0.079), ('nn', 0.077), ('parsity', 0.076), ('ganchev', 0.073), ('ra', 0.071), ('sparsity', 0.069), ('cohen', 0.068), ('sln', 0.068), ('vs', 0.067), ('taskar', 0.067), ('vb', 0.064), ('qt', 0.063), ('cp', 0.063), ('adj', 0.061), ('bulgarian', 0.061), ('eq', 0.06), ('sentence', 0.057), ('dirichlet', 0.057), ('smith', 0.056), ('dev', 0.055), ('strength', 0.053), ('accuracy', 0.052), ('initialization', 0.049), ('bg', 0.048), ('tokens', 0.048), ('randomp', 0.048), ('parse', 0.048), ('dt', 0.048), ('sparsifying', 0.047), ('posterior', 0.046), ('head', 0.045), ('nc', 0.044), ('cz', 0.043), ('pt', 0.043), ('ambiguity', 0.043), ('se', 0.042), ('yc', 0.042), ('unsupervised', 0.041), ('interjection', 0.041), ('sda', 0.041), ('spitkovsky', 0.041), ('yvc', 0.041), ('valency', 0.041), ('backoff', 0.041), ('iii', 0.04), ('acc', 0.038), ('annealing', 0.038), ('grammars', 0.038), ('posteriors', 0.036), ('penalty', 0.036), ('yd', 0.036), ('viterbi', 0.035), ('cative', 0.035), ('spanish', 0.035), ('eqt', 0.034), ('pools', 0.034), ('proot', 0.034), ('aq', 0.034), ('vm', 0.034), ('particle', 0.034), ('kl', 0.033), ('nl', 0.033), ('tie', 0.032), ('gains', 0.032), ('abbreviation', 0.031), ('attachment', 0.031), ('correction', 0.03), ('priors', 0.029), ('nv', 0.029), ('eisner', 0.029), ('mcclosky', 0.029), ('noun', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

2 0.17092445 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

Author: Dorota Głowacka, John Shawe-Taylor, Alex Clark, Colin de la Higuera, Mark Johnson

Abstract: Grammar induction refers to the process of learning grammars and languages from data; this ﬁnds a variety of applications in syntactic pattern recognition, the modeling of natural language acquisition, data mining and machine translation. This special topic contains several papers presenting some of recent developments in the area of grammar induction and language learning, as applied to various problems in Natural Language Processing, including supervised and unsupervised parsing and statistical machine translation. Keywords: machine translation, Bayesian inference, grammar induction, natural language parsing

3 0.14234102 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

Abstract: We propose a uniﬁed neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-speciﬁc engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks

4 0.085017465 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The ﬁrst stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justiﬁes the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised

5 0.07626155 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classiﬁcation, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a ﬁxed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefﬁcient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefﬁcient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefﬁcient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

6 0.049509998 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

7 0.046190076 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

8 0.045299791 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

9 0.044349156 55 jmlr-2011-Learning Multi-modal Similarity

10 0.042144686 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

11 0.039664388 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

12 0.039639901 58 jmlr-2011-Learning from Partial Labels

13 0.038278703 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

14 0.036994763 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

15 0.035306223 6 jmlr-2011-A Simpler Approach to Matrix Completion

16 0.035224609 59 jmlr-2011-Learning with Structured Sparsity

17 0.034158695 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

18 0.034022965 54 jmlr-2011-Learning Latent Tree Graphical Models

19 0.033691306 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

20 0.032740518 61 jmlr-2011-Logistic Stick-Breaking Process

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.191), (1, -0.1), (2, -0.035), (3, 0.027), (4, -0.168), (5, 0.052), (6, 0.062), (7, 0.139), (8, 0.054), (9, -0.141), (10, -0.231), (11, -0.303), (12, -0.118), (13, -0.045), (14, 0.109), (15, 0.028), (16, 0.169), (17, 0.015), (18, -0.073), (19, 0.019), (20, -0.01), (21, 0.018), (22, 0.017), (23, -0.099), (24, 0.092), (25, 0.03), (26, 0.013), (27, -0.094), (28, 0.11), (29, 0.095), (30, 0.211), (31, 0.023), (32, -0.035), (33, 0.099), (34, 0.023), (35, -0.013), (36, -0.024), (37, 0.016), (38, -0.062), (39, -0.052), (40, -0.088), (41, -0.059), (42, 0.103), (43, -0.046), (44, 0.052), (45, -0.019), (46, -0.119), (47, 0.047), (48, 0.045), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95632356 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

2 0.63964075 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

Author: Dorota Głowacka, John Shawe-Taylor, Alex Clark, Colin de la Higuera, Mark Johnson

3 0.60886401 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

4 0.39825362 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

5 0.34701815 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

6 0.29802212 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

7 0.28866559 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

8 0.24413124 6 jmlr-2011-A Simpler Approach to Matrix Completion

9 0.22233215 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

10 0.20781094 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

11 0.18727328 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

12 0.186138 58 jmlr-2011-Learning from Partial Labels

13 0.1845586 12 jmlr-2011-Bayesian Co-Training

14 0.18388113 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

15 0.1806684 61 jmlr-2011-Logistic Stick-Breaking Process

16 0.17986132 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

17 0.17929944 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

18 0.17688262 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

19 0.17664376 55 jmlr-2011-Learning Multi-modal Similarity

20 0.17028469 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.046), (7, 0.083), (9, 0.024), (10, 0.027), (24, 0.056), (31, 0.09), (32, 0.028), (41, 0.025), (48, 0.255), (60, 0.015), (67, 0.01), (71, 0.019), (73, 0.033), (78, 0.056), (82, 0.049), (90, 0.027), (96, 0.016), (99, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73024142 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

2 0.47380245 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

3 0.46712503 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

4 0.44135743 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

5 0.43891484 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

Author: Dorota Głowacka, John Shawe-Taylor, Alex Clark, Colin de la Higuera, Mark Johnson

6 0.42721671 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

7 0.42363116 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

8 0.41985959 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

9 0.41927126 12 jmlr-2011-Bayesian Co-Training

10 0.41901317 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

11 0.41663048 16 jmlr-2011-Clustering Algorithms for Chains

12 0.41465336 91 jmlr-2011-The Sample Complexity of Dictionary Learning

13 0.41409209 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination