nips nips2012 nips2012-156 knowledge-graph by maker-knowledge-mining

156 nips-2012-Identifiability and Unmixing of Latent Parse Trees


Source: pdf

Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Kakade Microsoft Research Percy Liang Stanford University Abstract This paper explores unsupervised learning of parsing models along two directions. [sent-2, score-0.332]

2 We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. [sent-4, score-0.795]

3 EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. [sent-6, score-0.664]

4 We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. [sent-7, score-0.256]

5 1 Introduction Generative parsing models, which define joint distributions over sentences and their parse trees, are one of the core techniques in computational linguistics. [sent-8, score-0.462]

6 We are interested in the unsupervised learning of these models [2–6], where the goal is to estimate the model parameters given only examples of sentences. [sent-9, score-0.101]

7 In doing so, we confront a central challenge of parsing models—that the topology of the parse tree is unobserved and varies across sentences. [sent-12, score-0.86]

8 This is in contrast to standard phylogenetic models [8] and other latent tree models for which there is a single fixed global tree across all examples [9]. [sent-13, score-0.482]

9 A classic result of Kruskal [10] has been employed to prove the identifiability of a wide class of latent variable models, including hidden Markov models and certain restricted mixtures of latent tree models [11–13]. [sent-15, score-0.407]

10 However, these techniques cannot be directly applied to parsing models since the tree topology varies over an exponential set of possible topologies. [sent-16, score-0.832]

11 Instead, we turn to techniques from algebraic geometry [14–17]; we show that a simple numerical procedure can be used to check identifiability for a wide class of models in NLP. [sent-17, score-0.14]

12 Using this tool, we discover that probabilistic context-free grammars (PCFGs) are non-identifiable, but that simpler PCFG variants and dependency models are identifiable. [sent-18, score-0.3]

13 The most common way to estimate unsupervised parsing models is by using local techniques such as EM [18] or MCMC sampling [19], but these methods can suffer from local optima and slow mixing. [sent-19, score-0.391]

14 Meanwhile, recent work [1,20–23] has shown that spectral methods can be used to estimate mixture models and HMMs with provable guarantees. [sent-20, score-0.107]

15 These techniques express low-order moments of the observable distribution as a product of matrix parameters and use eigenvalue decomposition to recover these matrices. [sent-21, score-0.428]

16 However, these methods are not directly applicable to parsing models because the tree topology again varies non-trivially. [sent-22, score-0.8]

17 The main idea is to express moments of the observable distribution as a mixture over the possible topologies. [sent-24, score-0.278]

18 For restricted parsing models, the moments for a fixed tree structure can E-mail: dahsu@microsoft. [sent-25, score-0.686]

19 (a) A constituency tree consists of a hierarchical grouping of the words with a latent state zv for each node v. [sent-30, score-0.549]

20 (b) A dependency tree consists of a collection of directed edges between the words. [sent-31, score-0.333]

21 Importantly, our unmixing technique does not require the training sentences be annotated with the tree topologies a priori, in contrast to recent extensions of [21] to learning PCFGs [24] and dependency trees [25, 26], which work on a fixed topology. [sent-34, score-0.891]

22 2 Notation def For a positive integer n, define [n] = {1, . [sent-35, score-0.311]

23 3 Parsing models A sentence is a sequence of L words, x = (x1 , . [sent-45, score-0.149]

24 , xL ), where each word xi ∈ d is one of d possible word types. [sent-48, score-0.098]

25 A (generative) parsing model defines a joint distribution Pθ (x, z) over a sentence x and its parse tree z (to be made precise later), where θ are the model parameters (a collection of multinomials). [sent-49, score-0.642]

26 Each parse tree z has a topology Topology(z) ∈ Topologies, which is both unobserved and varying across sentences. [sent-50, score-0.56]

27 Two important classes of models of natural language syntax are constituency models, which represent a hierarchical grouping and labeling of the phrases of a sentence (e. [sent-52, score-0.42]

28 , Figure 1(a)), and dependency models, which represent pairwise relationships between the words of a sentence (e. [sent-54, score-0.316]

29 1 Constituency models A constituency tree z = (V, s) consists of a set of nodes V and a collection of hidden states s = {sv }v∈V . [sent-58, score-0.529]

30 These nodes form a binary tree as follows: the root node is [0 : L] ∈ V , and for each node [i : j] ∈ V with j − i > 1, there exists a unique m with i < m < j defining the two children nodes [i : m] ∈ V and [m : j] ∈ V . [sent-61, score-0.301]

31 Perhaps the most well-known constituency parsing model is the probabilistic context-free grammar (PCFG). [sent-64, score-0.527]

32 The parameters of a PCFG are θ = (π, B, O), where π ∈ Rk specifies the initial 2 state distribution, B ∈ Rk ×k specifies the binary production distributions, and O ∈ Rd×k specifies the emission distributions. [sent-65, score-0.098]

33 This generative process defines a joint probability over a sentence x and a parse tree z: L Pθ (x, z) = | Topologies |−1 π s[0:L] (s[i:m] ⊗k s[m:j] ) Bs[i:j] xi Os[i−1:i] , (1) i=1 [i:m],[m:j]∈V We will also consider two variants of the PCFG with additional restrictions: PCFG-I. [sent-67, score-0.396]

34 2 Dependency tree models In contrast to constituency trees, which posit internal nodes with latent states, dependency trees connect the words directly. [sent-72, score-0.803]

35 A dependency tree z is a set of directed edges (i, j), where i, j ∈ [L] are distinct positions in the sentence. [sent-73, score-0.333]

36 We consider only projective dependency trees [27]: z is projective if for every path from i to j to k in z, we have that j and k are on the same side of i (that is, j − i and k − i have the same sign). [sent-75, score-0.321]

37 The generative process is as follows: choose a topology Topology(z) uniformly at random, generate the root word using π, and recursively generate argument words to the left to the right given the parent word using A and A , respectively. [sent-80, score-0.501]

38 Usually a PCFG induces a topology via a state-dependent probability of choosing a binary production versus an emission. [sent-86, score-0.347]

39 3 4 Identifiability Our goal is to estimate model parameters θ0 ∈ Θ given only access to sentences x ∼ Pθ0 . [sent-88, score-0.092]

40 We ask a basic question: in the limit of infinite data, is it informationdef theoretically possible to identify θ0 from the observed moments µ(θ0 ) = Eθ0 [φ(x)]? [sent-90, score-0.312]

41 To be more precise, define the equivalence class of θ0 to be the set of parameters θ that yield the same observed moments: SΘ (θ0 ) = {θ ∈ Θ : µ(θ) = µ(θ0 )}. [sent-91, score-0.086]

42 , PCFGs where all states have the same emission distribution O are indistinguishable regardless of the production distributions B. [sent-101, score-0.099]

43 Note that A diag(π) is an arbitrary d × d matrix whose entries sum to 1, which has d2 − 1 degrees of freedom, whereas µ(θ) is a symmetric matrix whose entries sum to 1, which has d+1 − 1 degrees of freedom. [sent-110, score-0.14]

44 The observed moments with respect to φ(x) = x1 ⊗ x2 is a d × d matrix, which places d2 constraints on the k 2 + (d − 1)k parameters. [sent-115, score-0.348]

45 1 Observation functions An observation function φ(x) and its associated observed moments µ(θ0 ) = Eθ0 [φ(x)] reveals aspects of the distribution Pθ0 (x). [sent-119, score-0.352]

46 There is a tradeoff: Higher-order moments provide more information, but are harder to estimate reliably given finite data, and are also computationally more expensive. [sent-121, score-0.278]

47 2 Automatically checking identifiability One immediate goal is to determine which models in Section 3 are identifiable from which of the observed moments (Section 4. [sent-126, score-0.389]

48 A powerful analytic tool that has been succesfully applied in 4 previous work is Kruskal’s theorem [10, 11], but (i) it is does not immediately apply to models with random topologies, and (ii) only gives sufficient conditions for identifiability, and cannot be used to determine non-identifiability. [sent-128, score-0.079]

49 We can write these constraints as hθ0 (θ) = 0, where def hθ0 (θ) = µ(θ) − µ(θ0 ) is a vector of m polynomials in θ. [sent-133, score-0.344]

50 Assume the parameter space Θ is a nonempty open connected subset of [0, 1]n ; and the observed moments µ : Rn → Rm , with respect to observation function φ, is a polynomial map. [sent-148, score-0.352]

51 The rows of J correspond to ∂Eθ [φj (x)]/∂θ and can be computed efficiently by adapting dynamic programs used in the E-step of an EM algorithm for parsing models. [sent-154, score-0.256]

52 Specifically, we adapt the CKY algorithm for constituency models and the algorithm of [27] for dependency models. [sent-156, score-0.499]

53 5 Model \ Observation function PCFG PCFG-I / PCFG-IE DEP-I DEP-IE / DEP-IES φ12 No No φ∗∗ φ123e1 φ123 φ∗∗∗e1 No, even from φall for L ∈ {3, 4, 5} Yes iff L ≥ 4 Yes iff L ≥ 3 Yes iff L ≥ 3 Yes iff L ≥ 3 φ∗∗∗ Figure 2: Local identifiability of parsing models. [sent-165, score-0.48]

54 4 Identifiability of constituency and dependency tree models We checked the identifiability status of various constituency and dependency tree models using our implementation of C HECK I DENTIFIABILITY. [sent-178, score-1.333]

55 For these classes, we found that the sentence length L and choice of observation function can influence identifiability: Both models are identifiable for large enough L (e. [sent-183, score-0.189]

56 The dependency models, DEP-I and DEP-IE, were all found to be identifiable for L ≥ 3 from second-order moments φ∗∗ . [sent-188, score-0.459]

57 The conditions for identifiability are less stringent than their constituency counterparts (PCFG-I and PCFG-IE), which is natural since dependency models are simpler without the latent states. [sent-189, score-0.55]

58 Note that in all identifiable models, second-order moments suffice to determine the distribution—this is good news because low-order moments are easier to estimate. [sent-190, score-0.556]

59 5 Unmixing algorithms Having established which parsing models are identifiable, we now turn to parameter estimation for these models. [sent-191, score-0.303]

60 For instance, for an HMM, the sliced third-order moments µ123η (θ) can be written as a product of parameter matrices in θ, and each matrix can be recovered by decomposing the product [1]. [sent-196, score-0.313]

61 For parsing models, the challenge is that the topology is random, so the moments is not a single product, but a mixture over products. [sent-197, score-0.835]

62 We will first present the general idea of unmixing (Section 5. [sent-199, score-0.213]

63 1 General case We assume the observation function φ(x) consists of a collection of observation matrices {φo (x)}o∈O (e. [sent-204, score-0.08]

64 Given an observation matrix φo (x) and a topology t ∈ Topologies, consider the mapping that computes the observed moment conditioned on Note that these subclasses occupy measure zero subsets of the PCFG parameter space, which is expected given the non-identifiability of the general PCFG. [sent-207, score-0.471]

65 We will develop our algorithms assuming true moments (u = µ(θ0 )). [sent-208, score-0.278]

66 The empirical moments converge 1 to the true moments at Op (n− 2 ), and matrix perturbation arguments (e. [sent-209, score-0.591]

67 We call these mappings compound parameters, denoted {Ψp }p∈P . [sent-214, score-0.137]

68 Now write the observed moments as a weighted sum: µo (θ) = P(Ψo,Topology = Ψp ) Ψp p∈P for all o ∈ O, (4) def = Mop where we have defined Mop to be the probability mass over tree topologies that yield compound parameter Ψp . [sent-215, score-1.063]

69 Note that (4) defines a system of equations µ = M Ψ, where the variables are the compound parameters and the constraints are the observed moments. [sent-217, score-0.274]

70 In a sense, we have replaced the original system of polynomial equations (in θ) with a system of linear equations (in Ψ). [sent-218, score-0.084]

71 The key to the utility of this technique is that the number of compound parameters can be polynomial in L even when the number of possible topologies is exponential in L. [sent-219, score-0.372]

72 Previous analytic techniques [13] based on Kruskal’s theorem [10] cannot be applied here because the possible topologies are too many and too varied. [sent-220, score-0.213]

73 Note that the mixing equation µ = M Ψ holds for each sentence length L, but many compound parameters p appear in the equations of multiple L. [sent-221, score-0.362]

74 Therefore, we can combine the equations across all observed sentence lengths, yielding a more constrained system than if we considered the equations of each L separately. [sent-222, score-0.22]

75 The following proposition shows how we can recover θ by unmixing the observed moments µ: Proposition 1 (Unmixing). [sent-223, score-0.583]

76 Suppose that there exists an efficient base algorithm to recover θ from some subset of compound parameters {Ψp (θ) : p ∈ P0 }, and that ep is in the row space of M for each p ∈ P0 . [sent-224, score-0.22]

77 Retrieve the compound parameters Ψp (θ) = (M † µ)p for each p ∈ P0 . [sent-228, score-0.162]

78 For all our parsing models, M can be computed efficiently using dynamic programming (Appendix C. [sent-231, score-0.256]

79 For any η ∈ Rd , we can express the observed moments as a sum over the two possible topologies in Figure 1(a): def def µ123η = E[x1 ⊗ x2 (η x3 )] = 0. [sent-237, score-1.055]

80 5Ψ2;η , Ψ1;η = A diag(T diag(π)A η)A , def def µ132η = E[x1 ⊗ x3 (η x2 )] = 0. [sent-239, score-0.562]

81 5Ψ2;η , Ψ2;η = A diag(π)T diag(A η)A , def def µ231η = E[x2 ⊗ x3 (η x1 )] = 0. [sent-241, score-0.562]

82 5Ψ1;η , Ψ3;η = A diag(A η)T diag(π)A , or compactly in matrix form: µ123η µ132η µ231η observed moments µη = 0. [sent-243, score-0.347]

83 compound parameters Ψη Let us observe µη at two different values of η, say at η = 1 and η = τ for some random τ . [sent-250, score-0.162]

84 Since the mixing matrix M is invertible, we can obtain the compound parameters Ψ2;1 = (M −1 µ1 )2 and Ψ2;τ = (M −1 µτ )2 . [sent-251, score-0.253]

85 7 Now we will recover θ from Ψ2;1 and Ψ2;τ by first extracting A = OT via an eigenvalue decomposition, and then recovering π, T , and O in turn (all up to the same unknown permutation) via elementary matrix operations. [sent-252, score-0.093]

86 Let M1 , M2 ∈ Rd×k have full column rank and D be a diagonal matrix with distinct diagonal entries. [sent-254, score-0.127]

87 This produces AΠS for some permutation matrix Π and diagonal scaling S. [sent-264, score-0.101]

88 To exploit sentences of different lengths, we can compute a mixing matrix M which includes constraints from sentences of length 1 ≤ L ≤ Lmax up to some upper bound Lmax . [sent-272, score-0.261]

89 We can retrieve the same compound parameters (Ψ2;1 and Ψ2;τ ) from the pseudoinverse of M and as proceed as before. [sent-274, score-0.2]

90 Our goal is to recover the parameters θ = (π, A). [sent-277, score-0.083]

91 def µ1 = E[x1 ] = π, def µ12 = E[x1 ⊗ x2 ] = 7−1 (DA + DA + DA A + AD + ADA + AD + DA ), def µ13 = E[x1 ⊗ x3 ] = 7−1 (DA + DA A + DA + ADA + AD + AAD + AD), def ˜ µ12 = E[x1 ⊗ x2 ] = 2−1 (DA + AD), ˜ ˜ where E[·] is taken with respect to length 2 sentences. [sent-279, score-1.124]

92 By selectively combining the moments above, we can compute AA + A = [7(µ13 − µ12 ) + 2˜12 ] diag(µ1 )−1 . [sent-281, score-0.278]

93 6 Discussion In this work, we have shed some light on the identifiability of standard generative parsing models using our numerical identifiability checker. [sent-289, score-0.365]

94 Given the ease with which this checker can be applied, we believe it should be a useful tool for analyzing more sophisticated models [6], as well as developing new ones which are expressive yet identifiable. [sent-290, score-0.079]

95 We have made some progress on closing it with our unmixing technique, which can deal with models where the tree topology varies non-trivially. [sent-292, score-0.757]

96 A method of moments for mixture models and hidden Markov models. [sent-298, score-0.358]

97 Two experiments on learning probabilistic dependency grammars from corpora. [sent-308, score-0.253]

98 Corpus-based induction of syntactic structure: Models of dependency and constituency. [sent-325, score-0.21]

99 Identifiability of parameters in latent structure models with many observed variables. [sent-358, score-0.157]

100 Three new probabilistic models for dependency parsing: An exploration. [sent-467, score-0.228]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('topology', 0.301), ('def', 0.281), ('moments', 0.278), ('constituency', 0.271), ('parsing', 0.256), ('identi', 0.231), ('unmixing', 0.213), ('pcfg', 0.212), ('heck', 0.189), ('topologies', 0.181), ('dependency', 0.181), ('tree', 0.152), ('dentifiability', 0.142), ('compound', 0.137), ('diag', 0.134), ('ability', 0.116), ('pcfgs', 0.115), ('yes', 0.107), ('parse', 0.107), ('jacobian', 0.103), ('sentence', 0.102), ('da', 0.095), ('grammars', 0.072), ('ecompose', 0.071), ('lmax', 0.071), ('mop', 0.071), ('trees', 0.068), ('sentences', 0.067), ('kruskal', 0.063), ('spectral', 0.06), ('recover', 0.058), ('mixing', 0.056), ('iff', 0.056), ('ad', 0.056), ('hsu', 0.055), ('latent', 0.051), ('word', 0.049), ('allman', 0.047), ('rhodes', 0.047), ('models', 0.047), ('production', 0.046), ('ot', 0.046), ('varies', 0.044), ('able', 0.043), ('equations', 0.042), ('node', 0.042), ('ada', 0.042), ('sv', 0.041), ('observation', 0.04), ('rd', 0.04), ('xl', 0.038), ('retrieve', 0.038), ('acl', 0.036), ('projective', 0.036), ('constraints', 0.036), ('rm', 0.036), ('matrix', 0.035), ('generative', 0.035), ('degrees', 0.035), ('root', 0.034), ('permutation', 0.034), ('kakade', 0.034), ('algebraic', 0.034), ('observed', 0.034), ('words', 0.033), ('subclasses', 0.033), ('nlp', 0.033), ('phylogenetic', 0.033), ('freedom', 0.033), ('hidden', 0.033), ('tool', 0.032), ('diagonal', 0.032), ('techniques', 0.032), ('anandkumar', 0.032), ('appendix', 0.032), ('children', 0.031), ('checked', 0.031), ('integer', 0.03), ('foster', 0.03), ('checking', 0.03), ('technique', 0.029), ('unsupervised', 0.029), ('syntactic', 0.029), ('rank', 0.028), ('moment', 0.028), ('klein', 0.028), ('liang', 0.027), ('polynomials', 0.027), ('optima', 0.027), ('equivalence', 0.027), ('emission', 0.027), ('collins', 0.027), ('numerical', 0.027), ('multiply', 0.026), ('aa', 0.026), ('parents', 0.026), ('mixtures', 0.026), ('rk', 0.026), ('states', 0.026), ('parameters', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1

2 0.21376412 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

Author: Michael Collins, Shay B. Cohen

Abstract: We describe an approach to speed-up inference with latent-variable PCFGs, which have been shown to be highly effective for natural language parsing. Our approach is based on a tensor formulation recently introduced for spectral estimation of latent-variable PCFGs coupled with a tensor decomposition algorithm well-known in the multilinear algebra literature. We also describe an error bound for this approximation, which gives guarantees showing that if the underlying tensors are well approximated, then the probability distribution over trees will also be well approximated. Empirical evaluation on real-world natural language parsing data demonstrates a significant speed-up at minimal cost for parsing performance. 1

3 0.18247375 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1

4 0.15642057 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1

5 0.12155712 180 nips-2012-Learning Mixtures of Tree Graphical Models

Author: Anima Anandkumar, Furong Huang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable is hidden and each mixture component can have a potentially different Markov graph structure and parameters over the observed variables. We propose a novel method for estimating the mixture components with provable guarantees. Our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. The sample and computational requirements for our method scale as poly(p, r), for an r-component mixture of pvariate graphical models, for a wide class of models which includes tree mixtures and mixtures over bounded degree graphs. Keywords: Graphical models, mixture models, spectral methods, tree approximation.

6 0.11105148 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

7 0.10015754 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

8 0.090682529 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

9 0.084312573 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

10 0.083643444 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

11 0.078215033 260 nips-2012-Online Sum-Product Computation Over Trees

12 0.07500869 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

13 0.069753237 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

14 0.066081345 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

15 0.062668212 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

16 0.062431842 206 nips-2012-Majorization for CRFs and Latent Likelihoods

17 0.062363561 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

18 0.058109712 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

19 0.056247711 346 nips-2012-Topology Constraints in Graphical Models

20 0.055814255 267 nips-2012-Perceptron Learning of SAT


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, 0.027), (2, 0.018), (3, -0.045), (4, -0.088), (5, -0.03), (6, -0.02), (7, -0.049), (8, -0.139), (9, 0.054), (10, 0.071), (11, 0.054), (12, 0.043), (13, -0.063), (14, -0.123), (15, 0.057), (16, 0.069), (17, 0.113), (18, 0.122), (19, 0.035), (20, -0.023), (21, 0.107), (22, -0.016), (23, -0.07), (24, -0.038), (25, 0.094), (26, -0.022), (27, -0.126), (28, 0.072), (29, 0.062), (30, 0.113), (31, 0.017), (32, -0.001), (33, 0.025), (34, -0.017), (35, -0.048), (36, -0.094), (37, -0.069), (38, 0.125), (39, 0.06), (40, -0.134), (41, -0.088), (42, -0.089), (43, -0.055), (44, 0.039), (45, -0.144), (46, 0.049), (47, -0.035), (48, 0.127), (49, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94484037 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1

2 0.74547666 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

Author: Michael Collins, Shay B. Cohen

Abstract: We describe an approach to speed-up inference with latent-variable PCFGs, which have been shown to be highly effective for natural language parsing. Our approach is based on a tensor formulation recently introduced for spectral estimation of latent-variable PCFGs coupled with a tensor decomposition algorithm well-known in the multilinear algebra literature. We also describe an error bound for this approximation, which gives guarantees showing that if the underlying tensors are well approximated, then the probability distribution over trees will also be well approximated. Empirical evaluation on real-world natural language parsing data demonstrates a significant speed-up at minimal cost for parsing performance. 1

3 0.5834983 180 nips-2012-Learning Mixtures of Tree Graphical Models

Author: Anima Anandkumar, Furong Huang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable is hidden and each mixture component can have a potentially different Markov graph structure and parameters over the observed variables. We propose a novel method for estimating the mixture components with provable guarantees. Our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. The sample and computational requirements for our method scale as poly(p, r), for an r-component mixture of pvariate graphical models, for a wide class of models which includes tree mixtures and mixtures over bounded degree graphs. Keywords: Graphical models, mixture models, spectral methods, tree approximation.

4 0.5542863 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1

5 0.54876769 267 nips-2012-Perceptron Learning of SAT

Author: Alex Flint, Matthew Blaschko

Abstract: Boolean satisfiability (SAT) as a canonical NP-complete decision problem is one of the most important problems in computer science. In practice, real-world SAT sentences are drawn from a distribution that may result in efficient algorithms for their solution. Such SAT instances are likely to have shared characteristics and substructures. This work approaches the exploration of a family of SAT solvers as a learning problem. In particular, we relate polynomial time solvability of a SAT subset to a notion of margin between sentences mapped by a feature function into a Hilbert space. Provided this mapping is based on polynomial time computable statistics of a sentence, we show that the existance of a margin between these data points implies the existance of a polynomial time solver for that SAT subset based on the Davis-Putnam-Logemann-Loveland algorithm. Furthermore, we show that a simple perceptron-style learning rule will find an optimal SAT solver with a bounded number of training updates. We derive a linear time computable set of features and show analytically that margins exist for important polynomial special cases of SAT. Empirical results show an order of magnitude improvement over a state-of-the-art SAT solver on a hardware verification task. 1

6 0.52613413 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

7 0.50540048 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

8 0.48838377 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

9 0.48097196 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

10 0.47110906 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

11 0.46924618 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

12 0.46839938 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

13 0.46074533 260 nips-2012-Online Sum-Product Computation Over Trees

14 0.41151494 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

15 0.41111678 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

16 0.40455922 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

17 0.40207484 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

18 0.39768544 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

19 0.38461637 206 nips-2012-Majorization for CRFs and Latent Likelihoods

20 0.36812806 34 nips-2012-Active Learning of Multi-Index Function Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.066), (9, 0.3), (14, 0.021), (21, 0.012), (38, 0.092), (39, 0.016), (42, 0.021), (54, 0.014), (55, 0.022), (74, 0.057), (76, 0.125), (80, 0.109), (92, 0.056), (98, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80805337 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

Author: Lloyd Elliott, Yee W. Teh

Abstract: We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by the splitting and merging of their clusters. Our model can be thought of as a discrete analogue of the continuous fragmentation-coagulation process [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining accuracies comparable to other state-of-the-art genotype imputation methods. 1

same-paper 2 0.74103034 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1

3 0.71276009 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

Author: Argyris Kalogeratos, Aristidis Likas

Abstract: Learning the number of clusters is a key problem in data clustering. We present dip-means, a novel robust incremental method to learn the number of data clusters that can be used as a wrapper around any iterative clustering algorithm of k-means family. In contrast to many popular methods which make assumptions about the underlying cluster distributions, dip-means only assumes a fundamental cluster property: each cluster to admit a unimodal distribution. The proposed algorithm considers each cluster member as an individual ‘viewer’ and applies a univariate statistic hypothesis test for unimodality (dip-test) on the distribution of distances between the viewer and the cluster members. Important advantages are: i) the unimodality test is applied on univariate distance vectors, ii) it can be directly applied with kernel-based methods, since only the pairwise distances are involved in the computations. Experimental results on artificial and real datasets indicate the effectiveness of our method and its superiority over analogous approaches.

4 0.67489696 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

Author: Minjie Xu, Jun Zhu, Bo Zhang

Abstract: We present a probabilistic formulation of max-margin matrix factorization and build accordingly a nonparametric Bayesian model which automatically resolves the unknown number of latent factors. Our work demonstrates a successful example that integrates Bayesian nonparametrics and max-margin learning, which are conventionally two separate paradigms and enjoy complementary advantages. We develop an efficient variational algorithm for posterior inference, and our extensive empirical studies on large-scale MovieLens and EachMovie data sets appear to justify the aforementioned dual advantages. 1

5 0.66481602 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

Author: Abner Guzmán-rivera, Dhruv Batra, Pushmeet Kohli

Abstract: We address the problem of generating multiple hypotheses for structured prediction tasks that involve interaction with users or successive components in a cascaded architecture. Given a set of multiple hypotheses, such components/users typically have the ability to retrieve the best (or approximately the best) solution in this set. The standard approach for handling such a scenario is to first learn a single-output model and then produce M -Best Maximum a Posteriori (MAP) hypotheses from this model. In contrast, we learn to produce multiple outputs by formulating this task as a multiple-output structured-output prediction problem with a loss-function that effectively captures the setup of the problem. We present a max-margin formulation that minimizes an upper-bound on this lossfunction. Experimental results on image segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this type of scenario and leads to substantial improvements in prediction accuracy. 1

6 0.57874978 75 nips-2012-Collaborative Ranking With 17 Parameters

7 0.57427526 197 nips-2012-Learning with Recursive Perceptual Representations

8 0.57120079 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

9 0.5677864 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

10 0.56740868 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

11 0.56618768 168 nips-2012-Kernel Latent SVM for Visual Recognition

12 0.56356215 200 nips-2012-Local Supervised Learning through Space Partitioning

13 0.56331748 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

14 0.56169975 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

15 0.56153315 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

16 0.56143188 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

17 0.56041801 65 nips-2012-Cardinality Restricted Boltzmann Machines

18 0.55935764 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

19 0.55854785 188 nips-2012-Learning from Distributions via Support Measure Machines

20 0.55771595 279 nips-2012-Projection Retrieval for Classification