nips nips2011 nips2011-267 knowledge-graph by maker-knowledge-mining

267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

Source: pdf

Author: Animashree Anandkumar, Kamalika Chaudhuri, Daniel J. Hsu, Sham M. Kakade, Le Song, Tong Zhang

Abstract: This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efﬁcient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our ﬁnite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i. [sent-13, score-0.33]

2 , the graph of how the underlying hidden variables are connected to each other and to the observed variables). [sent-15, score-0.244]

3 We propose the Spectral Recursive Grouping algorithm, an efﬁcient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. [sent-16, score-0.259]

4 Our ﬁnite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. [sent-17, score-0.246]

5 At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics. [sent-19, score-1.1]

6 While the body of techniques for probabilistic inference in graphical models is rather rich [1], current methods for tackling the more challenging problems of parameter and structure estimation are less developed and understood, especially in the presence of latent (hidden) variables. [sent-23, score-0.16]

7 1 z1 z2 z3 h g z4 {{z1 , z2 }, {z3 , z4 }} (a) z1 z3 z2 h g z1 z4 z4 {{z1 , z3 }, {z2 , z4 }} (b) z2 h g z3 {{z1 , z4 }, {z2 , z3 }} (c) z1 z4 z2 h z3 {{z1 , z2 , z3 , z4 }} (d) Figure 1: The four possible (undirected) tree topologies over leaves {z1 , z2 , z3 , z4 }. [sent-29, score-0.247]

8 This work focuses on learning the structure of multivariate latent tree graphical models. [sent-30, score-0.388]

9 Here, the underlying graph is a directed tree (e. [sent-31, score-0.219]

10 , hidden Markov model, binary evolutionary tree), and only samples from a set of (multivariate) observed variables (the leaves of the tree) are available for learning the structure. [sent-33, score-0.346]

11 Generally speaking, methods for learning latent tree structure exploit structural properties afforded by the tree that are revealed through certain statistical tests over every choice of four variables in the tree. [sent-35, score-0.675]

12 Some early methods for learning tree structure are based on the use of exact correlation statistics or distance measurements (e. [sent-37, score-0.248]

13 Subsequent work in the area of mathematical phylogenetics has focused on the sample complexity of evolutionary tree reconstruction [17, 15, 18, 19]. [sent-42, score-0.309]

14 Finally, recent work in machine learning has developed structure learning methods for latent tree graphical models that extend beyond the discrete distributions of evolutionary trees [21], thereby widening their applicability to other problem domains. [sent-44, score-0.485]

15 This work extends beyond previous studies, which have focused on latent tree models with either discrete or scalar Gaussian variables, by directly addressing the multivariate setting where hidden and observed nodes may be random vectors rather than scalars. [sent-45, score-0.581]

16 , the random vector associated with a node need not follow a distribution that corresponds to a tree model), as well as other characteristics of the node distributions (e. [sent-48, score-0.238]

17 , some nodes in the tree could have discrete state spaces and others continuous, as in a Gaussian mixture model). [sent-50, score-0.264]

18 We propose the Spectral Recursive Grouping algorithm for learning multivariate latent tree structure. [sent-51, score-0.312]

19 The algorithm has at its core a multivariate spectral quartet test, which extends the classical quartet tests for scalar variables by applying spectral techniques from multivariate statistics (speciﬁcally canonical correlation analysis [22, 23]). [sent-52, score-1.361]

20 We use the spectral quartet test in a simple modiﬁcation of the recursive grouping algorithm of [21] to perform the tree reconstruction. [sent-54, score-0.897]

21 The algorithm is essentially a robust method for reasoning about the results of quartet tests (viewed simply as hypothesis tests); the tests either conﬁrm or reject hypotheses about the relative topology over quartets of variables. [sent-55, score-0.675]

22 By carefully choosing which tests to consider and properly interpreting their results, the algorithm is able to recover the correct latent tree structure (with high probability) in a provably efﬁcient manner, in terms of both computational and sample complexity. [sent-56, score-0.411]

23 The recursive grouping procedure is similar to the short quartet method from phylogenetics [15], which also guarantees efﬁcient reconstruction in the context of evolutionary trees. [sent-57, score-0.661]

24 Finally, we note that while we do not directly address the question of parameter estimation, provable parameter estimation methods may derived using the spectral techniques from [24, 25]. [sent-59, score-0.14]

25 1 Preliminaries Latent variable tree models Let T be a connected, directed tree graphical model with leaves Vobs := {x1 , x2 , . [sent-61, score-0.54]

26 The leaves are termed the observed variables and the internal nodes hidden variables. [sent-68, score-0.319]

27 Each observed variable x ∈ Vobs is modeled as random vector in Rd , and each hidden variable h ∈ Vhid as a random vector in Rk . [sent-71, score-0.229]

28 The joint distribution over all the variables VT := Vobs ∪ Vhid is assumed satisfy conditional independence properties speciﬁed by the tree structure over the variables. [sent-72, score-0.288]

30 We refer to the class of tree graphical models satisfying Condition 1 as linear tree models. [sent-79, score-0.443]

31 Such models include a variety of continuous and discrete tree distributions (as well as hybrid combinations of the two, such as Gaussian mixture models) which are widely used in practice. [sent-80, score-0.247]

32 Continuous linear tree models include linear-Gaussian models and Kalman ﬁlters. [sent-81, score-0.228]

33 In the discrete case, suppose that the observed variables take on d values, and hidden variables take k values. [sent-82, score-0.354]

34 Thus, discrete linear tree models include discrete hidden Markov models [25] and Markovian evolutionary trees [24]. [sent-84, score-0.535]

35 In addition to the linearity, the following conditions are assumed in order to recover the hidden tree structure. [sent-85, score-0.319]

36 For all h ∈ Vhid and hidden child g ∈ ChildrenT (h) ∩ Vhid , A(g|h) has rank k. [sent-94, score-0.219]

37 The rank condition is a generalization of parameter identiﬁability conditions in latent variable models [28, 24, 25] which rules out various (provably) hard instances in discrete variable settings [24]. [sent-97, score-0.265]

38 Furthermore, there exists ρ2 > 0 such that for each pair of distinct hidden variables h, g ∈ Vhid , max det(E[hg ])2 ≤ ρ2 < 1. [sent-101, score-0.284]

39 max det(E[hh ]) det(E[gg ]) The requirement for each hidden node to have three neighbors is natural; otherwise, the hidden node can be eliminated. [sent-102, score-0.371]

40 First, note that ρmax ≤ 1, and that if ρmax = 1 is achieved with some h and g, then h and g are completely correlated, implying the existence of a deterministic map between hidden nodes h and g; hence simply merging the two nodes into a single node h (or g) resolves this issue. [sent-104, score-0.23]

41 Therefore the non-redundancy condition simply means that any two hidden nodes h and g cannot be further reduced to a single node. [sent-105, score-0.23]

42 Clearly, this condition is necessary for the goal of identifying the correct tree structure, and it is satisﬁed as soon as h and g have limited correlation in just a single direction. [sent-106, score-0.326]

43 Previous works [13, 29] show that an analogous condition ensures identiﬁability for general latent tree models (and in fact, the conditions are identical in the Gaussian case). [sent-107, score-0.334]

44 Our learning guarantees also require a correlation condition that generalize the explicit depth conditions considered in the phylogenetics literature [15, 24]. [sent-109, score-0.199]

45 To state this condition, ﬁrst deﬁne Fh to be the set of subtrees of that remain after a hidden variable h ∈ Vhid is removed from T (see Figure 2). [sent-110, score-0.237]

46 Also, for any subtree T of T, let Vobs [T ] ⊆ Vobs be the observed variables in T . [sent-111, score-0.196]

47 There exists γmin > 0 such that for all hidden variables h ∈ Vhid and all triples of subtrees {T1 , T2 , T3 } ⊆ Fh in the forest obtained if h is removed from T, max min x1 ∈Vobs [T1 ],x2 ∈Vobs [T2 ],x3 ∈Vobs [T3 ] {i,j}⊂{1,2,3} σk (E[xi x ]) ≥ γmin . [sent-113, score-0.339]

48 j The quantity γmin is related to the effective depth of T, which is the maximum graph distance between a hidden variable and its closest observed variable [15, 21]. [sent-114, score-0.278]

49 The effective depth is at most logarithmic in the number of variables (as achieved by a complete binary tree), though it can also be a constant if every hidden variable is close to an observed variable (e. [sent-115, score-0.349]

50 , in a hidden Markov model, the effective depth is 1, even though the true depth, or diameter, is m + 1). [sent-117, score-0.18]

51 Finally, also deﬁne γmax := max {x1 ,x2 }⊆Vobs {σ1 (E[x1 x ])} 2 to be the largest spectral norm of any second-moment matrix between observed variables. [sent-119, score-0.217]

52 In this work, the Euclidean norm of a vector x is denoted by x, and the (induced) spectral norm of a matrix A is denoted by A, i. [sent-121, score-0.14]

53 ˆ Input: For each pair {i, j} ⊂ {1, 2, 3, 4}, an empirical estimate Σi,j of the second-moment matrix E[zi zj ] and a corresponding conﬁdence parameter ∆i,j > 0. [sent-125, score-0.362]

54 Output: Either a pairing {{zi , zj }, {zi , zj }} or ⊥. [sent-126, score-0.817]

55 1: if there exists a partition of {z1 , z2 , z3 , z4 } = {zi , zj } ∪ {zi , zj } such that k s=1 ˆ ˆ [σs (Σi,j ) − ∆i,j ]+ [σs (Σi ,j ) − ∆i ,j ]+ > then return the pairing {{zi , zj }, {zi , zj }}. [sent-127, score-1.569]

56 3 Spectral quartet tests This section describes the core of our learning algorithm, a spectral quartet test that determines topology of the subtree induced by four observed variables {z1 , z2 , z3 , z4 }. [sent-129, score-1.382]

57 Our quartet test either returns the correct induced subtree among possibilities in Figure 1(a)–(c); or it outputs ⊥ to indicate abstinence. [sent-131, score-0.652]

58 If the test returns ⊥, then no guarantees are provided on the induced subtree topology. [sent-132, score-0.22]

59 If it does return a subtree, then the output is guaranteed to be the correct induced subtree (with high probability). [sent-133, score-0.252]

60 The quartet test proposed is described in Algorithm 1 (SpectralQuartetTest). [sent-134, score-0.416]

61 The quartet test is deﬁned with respect to four observed variables Z := {z1 , z2 , z3 , z4 }. [sent-139, score-0.55]

62 The output of the test is either ⊥ or a pairing of the variables {{zi , zj }, {zi , zj }}. [sent-143, score-0.94]

63 For example, if the output is the pairing is {{z1 , z2 }, {z3 , z4 }}, then Figure 1(a) is the output topology. [sent-144, score-0.185]

64 Even though the conﬁguration in Figure 1(d) is a possibility, the spectral quartet test never returns {{z1 , z2 , z3 , z4 }}, as there is no correct pairing of Z. [sent-145, score-0.797]

65 The topology {{z1 , z2 , z3 , z4 }} can be viewed as a degenerate case of {{z1 , z2 }, {z3 , z4 }} (say) where the hidden variables h and g are deterministically identical, and Condition 3 fails to hold with respect to h and g. [sent-146, score-0.288]

66 1 Properties of the spectral quartet test With exact second moments: The spectral quartet test is motivated by the following lemma, which shows the relationship between the singular values of second-moment matrices of the zi ’s and the k induced topology among them in the latent tree. [sent-148, score-1.653]

67 Let detk (M ) := s=1 σs (M ) denote the product of the k largest singular values of a matrix M . [sent-149, score-0.408]

68 Suppose that the observed variables Z = {z1 , z2 , z3 , z4 } have the true induced tree topology shown in Figure 1(a), and the tree model satisﬁes Condition 1 and Condition 2. [sent-151, score-0.626]

69 Then detk (E[z1 z3 ])detk (E[z2 z4 ]) detk (E[z1 z4 ])detk (E[z2 z3 ]) det(E[hg ])2 = = ≤1 det(E[hh ]) det(E[gg ]) detk (E[z1 z2 ])detk (E[z3 z4 ]) detk (E[z1 z2 ])detk (E[z3 z4 ]) (1) and detk (E[z1 z3 ])detk (E[z2 z4 ]) = detk (E[z1 z4 ])detk (E[z2 z3 ]). [sent-152, score-2.226]

70 If for all pairs {zi , zj } ⊂ Z and all s ∈ [k], σs (Σi,j ) − ∆i,j ≤ ˆi,j ) + ∆i,j , and if SpectralQuartetTest returns a pairing {{zi , zj }, {zi , zj }}, σs (E[zi zj ]) ≤ σs (Σ then {{zi , zj }, {zi , zj }} = {{z1 , z2 }, {z3 , z4 }}. [sent-157, score-2.251]

71 In other words, the spectral quartet test never returns an incorrect pairing as long as the singular ˆ values of E[zi zj ] lie in an interval of length 2∆i,j around the singular values of Σi,j . [sent-158, score-1.165]

72 The lemma below shows how to set the ∆i,j s as a function of N , δ and properties of the distributions of zi and zj so that this required event holds with probability at least 1 − δ. [sent-159, score-0.656]

73 If each empirical second-moment matrix Σi,j is computed using N iid copies of zi and zj , and if E[zi 2 zj 2 ] − tr(E[zi zj ]E[zi zj ] ) ¯ , ti,j := 1. [sent-164, score-1.324]

74 55 ln(24di,j /δ), max{E[zj 2 zi zi ], E[zi 2 zj zj ]} 2 max E[zj 2 zi zi ], E[zi 2 zj zj ] ti,j Mi Mj ti,j ∆i,j ≥ + , N 3N then with probability 1 − δ, for all pairs {zi , zj } ⊂ Z and all s ∈ [k], ˆ ˆ σs (Σi,j ) − ∆i,j ≤ σs (E[zi z ]) ≤ σs (Σi,j ) + ∆i,j . [sent-165, score-2.883]

75 ¯ di,j := j (2) Conditions for returning a correct pairing: The conditions under which SpectralQuartetTest returns an induced topology (as opposed to ⊥) are now provided. [sent-166, score-0.239]

76 An important quantity in this analysis is the level of non-redundancy between the hidden variables h and g. [sent-167, score-0.202]

77 (3) det(E[hh ]) det(E[gg ]) If Figure 1(a) is the correct induced topology among {z1 , z2 , z3 , z4 }, then the smaller ρ is, the greater the gap between detk (E[z1 z2 ])detk (E[z3 z4 ]) and either of detk (E[z1 z3 ])detk (E[z2 z4 ]) and detk (E[z1 z4 ])detk (E[z2 z3 ]). [sent-169, score-1.295]

78 Therefore, ρ also governs how small the ∆i,j need to be for the quartet test to return a correct pairing; this is quantiﬁed in Lemma 4. [sent-170, score-0.534]

79 Suppose that (i) the observed variables Z = {z1 , z2 , z3 , z4 } have the true induced tree topology shown in Figure 1(a); (ii) the tree model satisﬁes Condition 1, Condition 2, and ρ < 1 (where ρ is deﬁned in (3)), and (iii) the conﬁdence bounds in (2) hold for all {i, j} and all s ∈ [k]. [sent-173, score-0.626]

80 If 1 1 ∆i,j < · min 1, − 1 · min{σk (E[zi zj ])} 8k ρ {i,j} for each pair {i, j}, then SpectralQuartetTest returns the correct pairing {{z1 , z2 }, {z3 , z4 }}. [sent-174, score-0.603]

81 4 The Spectral Recursive Grouping algorithm The Spectral Recursive Grouping algorithm, presented as Algorithm 2, uses the spectral quartet test discussed in the previous section to estimate the structure of a multivariate latent tree distribution from iid samples of the observed leaf variables. [sent-175, score-1.005]

82 In particular, we assume the spectral quartet tests use these quantities. [sent-177, score-0.612]

83 1: let R := Vobs , and for all x ∈ R, T [x] := rooted single-node tree x and L[x] := {x}. [sent-181, score-0.226]

84 5: if result = “siblings” then 6: Create a new variable h, create subtree T [h] rooted at h by joining T [u] and T [v] to h with edges {h, u} and {h, v}, and set L[h] := L[u] ∪ L[v]. [sent-185, score-0.193]

85 8: else if result = “u is parent of v” then 9: Modify subtree T [u] by joining T [v] to u with an edge {u, v}, and modify L[u] := L[u] ∪ L[v]. [sent-187, score-0.218]

86 RG builds the tree in a bottom-up fashion, where the initial working set of variables are the observed variables. [sent-192, score-0.353]

87 The variables in the working set always correspond to roots of disjoint subtrees of T discovered by the algorithm. [sent-193, score-0.224]

88 If the variables are combined as siblings, then a new hidden variable is introduced as their parent and is added to the working set, and its children are removed. [sent-196, score-0.337]

89 If the variables are combined as neighbors (parent/child), then the child is removed from the working set. [sent-197, score-0.211]

90 The process repeats until the entire tree is constructed. [sent-198, score-0.188]

91 Our modiﬁcation of RG uses the spectral quartet tests from Section 3 to decide which subtree roots in the current working set to combine. [sent-199, score-0.77]

92 The subroutine is also used by the subroutine Relationship (Algorithm 4) which determines whether a candidate pair of variables should be merged as neighbors (parent/child) or as siblings: essentially, to check if u is a parent of v, it checks if v is a sibling of each child of u. [sent-202, score-0.361]

93 The use of unreliable estimates of long-range correlations is avoided by only considering highly-correlated variables as candidate pairs to merge (where correlation is measured using observed variables in their corresponding subtrees as proxies). [sent-203, score-0.314]

94 This leads to a sample-efﬁcient algorithm for recovering the hidden tree structure. [sent-204, score-0.319]

95 Assume the directed tree graphical model T over variables (random vectors) VT = Vobs ∪ Vhid satisﬁes Conditions 1, 2, 3, and 4. [sent-208, score-0.337]

96 Input: Set of nodes R; leaf sets L[v] for all v ∈ R; rooted subtrees T [v] for all v ∈ R; distinct u, v ∈ R. [sent-217, score-0.195]

97 7: else if “u → v” was asserted then return “v is parent of u” (“v → u”). [sent-229, score-0.173]

98 If N> 200 · k 2 · B · t 2 γmin · (1 − ρmax ) γmax 2 + 7 · k · M2 · t 2 γmin · (1 − ρmax ) γmax , then with probability at least 1 − η, the Spectral Recursive Grouping algorithm returns a tree T with the same undirected graph structure as T. [sent-233, score-0.274]

99 The theorem reveals that the sample complexity of the algorithm depends solely on intrinsic spectral properties of the distribution. [sent-235, score-0.14]

100 Identiﬁability of parameters in latent structure models with many observed variables. [sent-430, score-0.155]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('quartet', 0.387), ('detk', 0.371), ('vobs', 0.355), ('zj', 0.339), ('zi', 0.283), ('vhid', 0.242), ('tree', 0.188), ('spectral', 0.14), ('pairing', 0.139), ('hidden', 0.131), ('mergeable', 0.113), ('siblings', 0.097), ('spectralquartettest', 0.097), ('topology', 0.086), ('tests', 0.085), ('subtree', 0.083), ('childrent', 0.081), ('grouping', 0.078), ('det', 0.078), ('subtrees', 0.078), ('recursive', 0.075), ('variables', 0.071), ('evolutionary', 0.064), ('latent', 0.064), ('child', 0.064), ('condition', 0.062), ('subroutine', 0.062), ('multivariate', 0.06), ('returns', 0.057), ('phylogenetics', 0.057), ('txi', 0.057), ('hh', 0.055), ('parent', 0.055), ('assert', 0.052), ('working', 0.052), ('induced', 0.051), ('return', 0.05), ('depth', 0.049), ('graphical', 0.047), ('correct', 0.045), ('joining', 0.044), ('mxi', 0.043), ('maxxi', 0.043), ('leaf', 0.042), ('observed', 0.042), ('steel', 0.039), ('discrete', 0.039), ('rooted', 0.038), ('leaves', 0.038), ('singular', 0.037), ('nodes', 0.037), ('gg', 0.037), ('else', 0.036), ('max', 0.035), ('trees', 0.034), ('lemma', 0.034), ('hg', 0.033), ('vt', 0.033), ('asserted', 0.032), ('kely', 0.032), ('quartets', 0.032), ('directed', 0.031), ('dence', 0.031), ('correlation', 0.031), ('rg', 0.03), ('structural', 0.029), ('test', 0.029), ('structure', 0.029), ('kamalika', 0.028), ('variable', 0.028), ('anandkumar', 0.026), ('bxi', 0.026), ('markov', 0.026), ('node', 0.025), ('logs', 0.024), ('mossel', 0.024), ('rutgers', 0.024), ('neighbors', 0.024), ('iid', 0.024), ('rank', 0.024), ('exists', 0.024), ('hsu', 0.024), ('chaudhuri', 0.023), ('enjoyed', 0.023), ('governs', 0.023), ('fh', 0.023), ('roots', 0.023), ('pair', 0.023), ('output', 0.023), ('xj', 0.022), ('kakade', 0.022), ('erd', 0.022), ('pennsylvania', 0.021), ('siddiqi', 0.021), ('con', 0.021), ('four', 0.021), ('pairs', 0.021), ('models', 0.02), ('relationship', 0.02), ('sz', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

Author: Animashree Anandkumar, Kamalika Chaudhuri, Daniel J. Hsu, Sham M. Kakade, Le Song, Tong Zhang

2 0.19773071 276 nips-2011-Structured sparse coding via lateral inhibition

Author: Arthur D. Szlam, Karol Gregor, Yann L. Cun

Abstract: This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modiﬁcations of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efﬁcient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-speciﬁed. 1

3 0.15109302 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

Author: Le Song, Eric P. Xing, Ankur P. Parikh

Abstract: Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efﬁcient inference. Experiments on simulated and real data show the advantage of our proposed approach. 1

4 0.1264199 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li

Abstract: We present a novel approach to efﬁciently learn a label tree for large scale classiﬁcation with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classiﬁers for each node in the tree. This approach also allows ﬁne grained control over the efﬁciency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classiﬁcation with 10184 classes and 9 million images. We demonstrate signiﬁcant improvements in test accuracy and efﬁciency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1

5 0.11725656 288 nips-2011-Thinning Measurement Models and Questionnaire Design

Author: Ricardo Silva

Abstract: Inferring key unobservable features of individuals is an important task in the applied sciences. In particular, an important source of data in ﬁelds such as marketing, social sciences and medicine is questionnaires: answers in such questionnaires are noisy measures of target unobserved features. While comprehensive surveys help to better estimate the latent variables of interest, aiming at a high number of questions comes at a price: refusal to participate in surveys can go up, as well as the rate of missing data; quality of answers can decline; costs associated with applying such questionnaires can also increase. In this paper, we cast the problem of reﬁning existing models for questionnaire data as follows: solve a constrained optimization problem of preserving the maximum amount of information found in a latent variable model using only a subset of existing questions. The goal is to ﬁnd an optimal subset of a given size. For that, we ﬁrst deﬁne an information theoretical measure for quantifying the quality of a reduced questionnaire. Three different approximate inference methods are introduced to solve this problem. Comparisons against a simple but powerful heuristic are presented. 1 Contribution A common goal in the applied sciences is to measure concepts of interest that are not directly observable (Bartholomew et al., 2008). Such is the case in the social sciences, medicine, economics and other ﬁelds, where quantifying key attributes such as “consumer satisfaction,” “anxiety” and “recession” requires the development of indicators: observable variables that are postulated to measure the target latent variables up to some measurement error (Bollen, 1989; Carroll et al., 1995). In a probabilistic framework, this often boils down to a latent variable model (Bishop, 1998). One common setup is to assume each observed indicator Yi as being generated independently given the set of latent variables X. Conditioning on any given observed data point Y gives information about the distribution of the latent vector X, which can then be used for ranking, clustering, visualization or smoothing, among other tasks. Figure 1 provides an illustration. Questionnaires from large surveys are sometimes used to provide such indicators, each Yi recording an answer that typically corresponds to a Bernoulli or ordinal variable. For instance, experts can be given questions concerning whether there is freedom of press in a particular nation, as a way of measuring its democratization level (Bollen, 1989; Palomo et al., 2007). Nations can then be clustering or ranked within an interpretable latent space. Long questionnaires have nevertheless drawbacks, as summarized by Stanton et al. (2002) in the context of psychometric studies: Longer surveys take more time to complete, tend to have more missing data, and have higher refusal rates than short surveys. Arguably, then, techniques to reducing the length of scales while maintaining psychometric quality are wortwhile. 1 Factor scores: countries in the latent space Y2 Y3 Y4 Y5 0 Y1 5 X2 (Democratization) X1 (Industrialization) Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) (a) (b) Figure 1: (a) A graphical representation of a latent variable model. Notice that in general latent variables will be dependent. Here, the question is how to quantify democratization and industrialization levels of nations given observed indicators Y such as freedom of press and gross national product, among others (Bollen, 1989; Palomo et al., 2007). (b) An example of a result implied by the model (adapted from Palomo et al. (2007)): barplots of the conditional distribution of democratization levels given the observed indicators at two time points, ordered by the posterior mean industrialization level. The distribution of the latent variables given the observations is the basis of the analysis. Our contribution is a methodology for choosing which indicators to preserve (e.g., which items to keep in a questionnaire) given: i.) a latent variable model speciﬁcation of the domain of interest; ii.) a target number of indicators that should be preserved. To accomplish this, we provide: i.) a target objective function that quantiﬁes the amount of information preserved by a choice of a subset of indicators, with respect to the full set; ii.) algorithms for optimizing this choice of subset with respect to the objective function. The general idea is to start with a target posterior distribution of latent variables, deﬁned by some latent variable measurement model M (i.e., PM (X | Y)). We want to choose a subset Yz ⊂ Y so that the resulting conditional distribution PM (X | Yz ) is as close as possible to the original one according to some metric. Model M is provided either by expertise or by numerous standard approaches that can be applied to learn it from data (e.g., methods in Bishop, 2009). We call this task measurement model thinning. Notice that the size of Yz is a domain-dependent choice. Assuming M is a good model for the data, choosing a subset of indicators will incur some information loss. It is up to the analyst to choose a trade-off between loss of information and the design of simpler, cheaper ways of measuring latent variables. Even if a shorter questionnaire is not to be deployed, the outcome of measurement model thinning provides a formal sensitivity analysis of the target latent distribution with respect to the available indicators. The result is useful to generate different insights into the domain. This paper is organized as follows: Section 2 deﬁnes a formal criterion to quantify how appropriate a subset Yz is. Section 3 describes different approaches in which this criterion can be optimized. Related work is brieﬂy discussed in Section 4. Experiments with synthetic and real data are discussed in Section 5, followed by the conclusion. 2 An Information-Theoretical Criterion Our focus is on domains where latent variables are not a by-product of a dimensionality reduction technique, but the target of the analysis as in the example of Figure 1. That is, measurement error problems where the variables to be recorded are designed speciﬁcally to obtain information concerning such unknowns (Carroll et al., 1995; Bartholomew et al., 2008). As such, we postulate that the outcome of any analysis should be a functional of PM (X | Y), the conditional distribution of unobservables X given observables Y within a model M. It is assumed that M speciﬁes the joint PM (X, Y). We further assume that observed variables are conditionally independent given X, i.e. p PM (X, Y) = PM (X) i=1 PM (Yi | X), with p being the number of observed indicators. 2 If z ≡ (z1 , z2 , . . . , zp ) is a binary vector of the same dimensionality as Y, and Yz is the subset of Y corresponding the non-zero entries of z, we can assess z by the KL divergence PM (X | Y) dX KL(PM (X | Y) || PM (X | Yz )) ≡ PM (X | Y) log PM (X | Yz ) This is well-deﬁned, since both distributions lie in the same sample space despite the difference of dimensionality between Y and Yz . Moreover, since Y is itself a random vector, our criterion becomes the expected KL divergence KL(PM (X | Y) || PM (X | Yz )) PM (Y) where · denotes expectation. Our goal is to minimize this function with respect to z. Rearranging this expression to drop all constants that do not depend on z, and multiplying it by −1 to get a maximization problem, we obtain the problem of ﬁnding z⋆ such that z⋆ = argmaxz log(PM (Yz | X)) PM (X,Yz ) − log(PM (Yz )) PM (Yz ) p zi log(PM (Yi | X)) = argmaxz ≡ + HM (Yz ) i=1 argmaxz FM (z) PM (X,Yi ) p i=1 zi = K for a choice of K, and zi ∈ {0, 1}. HM (·) denotes here the entropy of subject to a distribution parameterized by M. Notice we used the assumption that indicators are mutually independent given X. There is an intuitive appeal of having a joint entropy term to reward not only marginal relationships between indicators and latent variables, but also selections that are jointly diverse. Notice that optimizing this objective function turns out to be equivalent to minimizing the conditional entropy of latent variables given Yz . Motivating conditional entropy from a more fundamental principle illustrates that other functions can be obtained by changing the divergence. 3 Approaches for Approximate Optimization The problem of optimizing FM (z) subject to the constraints p zi = K, zi ∈ {0, 1}, is hard i=1 not only for its combinatorial nature, but due to the entropy term. This needs to be approximated, and the nature of the approximation should depend on the form taken by M. We will assume that it is possible to efﬁciently compute any marginals of PM (Y) of modest dimensionality (say, 10 dimensions). This is the case, for instance, in the probit model for binary data: X ∼ N (0, Σ), Yi⋆ ∼ N (ΛT X + λi;0 , 1), i Yi = 1, if Yi⋆ > 0, and 0 otherwise where N (m, S) is the multivariate Gaussian distribution with mean m and covariance matrix S. The probit model is one of the most common latent variable models for questionnaire data (Bartholomew et al., 2008), with a straigthforward extension to ordinal data. In this model, marginals for a few dozen variables can be obtained efﬁciently since this corresponds to calculating multivariate Gaussian probabilities (Genz, 1992). Parameters can be ﬁt by a variety of methods (Hahn et al., 2010). We also assume that M allows for the computation of log(PM (Yi | X)) PM (X,Yi ) at little cost. Again, in the binary probit model this is simple, since this requires integrating away a single binary variable Yi and a univariate Gaussian ΛT X. i 3.1 Gaussian Entropy One approximation to FM (z) is to replace its entropy term by the corresponding entropy from some Gaussian distribution PN (Yz ). The entropy of a Gaussian distribution is proportional to the logarithm of the determinant of its covariance matrix, and hence can be computed in O(p3 ) steps. This Gaussian can be chosen as the one closest to PM (Yz ) in a KL(PM || PN ) sense: that is, the one with the same ﬁrst and second moments as PM (Yz ). In our case, computing these moments can be done deterministically (up to numerical error) using standard bivariate quadrature methods. No expectation-propagation (Minka, 2001) is necessary. The corresponding objective function is p zi log(PM (Yi | X)) FM;N (z) ≡ i=1 3 PM (X,Yi ) + 0.5 log |Σz | where Σz is the covariance matrix of Yz – which for binary and ordinal data has a sensible interpretation. This function is also an upper bound on the exact function, FM (z), since the Gaussian is the distribution with the largest entropy for a given mean vector and covariance matrix. The resulting function is non-linear in z. In our experiments, we optimize for z using a greedy scheme: for all possible pairs (i, j) such that zi = 1 and zj = 0, we swap its values (so that i zi is always K). We choose the pair with the highest increase in FM;N (z) and repeat the process until convergence. 3.2 Entropy with Bounded Neighborhoods An alternative bound can be derived from a standard fact in information theory: H(Y | S) ≤ H(Y | S ′ ) for S ′ ⊆ S, where H(· | ·) denotes conditional entropy. This was exploited by Globerson and Jaakkola (2007) to deﬁne an upper bound in the entropy of a distribution as follows: consider a permutation e of the set {1, 2, . . . , p}, with e(i) being the i-th element of e. Denote by e(1 : i) the ﬁrst i elements of this permutation (an empty set if i < 1). Moreover, let N (e, i) be a subset of e(1 : i − 1). For a given set variables Y = {Y1 , Y2 , . . . , Yp } the following bound holds: p n H(Ye(i) | YN (e,i) ) H(Ye(i) | Ye(1:i−1) ) ≤ H(Y1 , Y2 , . . . Yp ) = (1) i=1 i=1 If each set N (e, i) is no larger than some constant D, then this bound can be computed in O(p · 2D ) steps for binary probit models. The bound holds for any choice of e, but we want it to be as tight as possible so that it gets weighted in a reasonable way against the other terms in FM (·). Since the entropy function is decomposable as a sum of functions that depend on i and N (e, i) only, one can minimize this bound with respect to e by using permutation optimization methods such as (Jaakkola et al., 2010). In our implementation, we use a method similar to Teyssier and Koller (2005) that shufﬂes neighboring entries of e to generate candidates, chooses the optimal N (e, i) for each i given the candidate e, and picks as the next permutation the candidate e with the greatest decrease in the bound. Notice that a permutation choice e and neighborhood choices N (e, i) deﬁne a Bayesian network where N (e, i) are the parents of Ye(i) . Therefore, if this Bayesian network model provides a good approximation to PM (Y), the bound will be reasonably tight. Given e, we will further relax this bound with the goal of obtaining an integer programming formulation for the problem of optimizing an upper bound to FM (z). For any given z, we deﬁne the local term HL (z, i) as    HL (z, i) ≡ HM (Ye(i) | Yz ∩N (e, i)) = S∈P (N (e,i))  j∈S zj   k∈N (e,i)\S (1 − zk ) HM (Ye(i) | S) (2) where P (·) denotes the power set of a set. The new approximate objective function becomes p p zi log(PM (Yi | X)) FM;D (z) ≡ PM (X,Yi ) ze(i) HL (z, i) + (3) i=1 i=1 Notice that HL (z, i) is still an upper bound on HM (Ye(i) | Ye(1:i−1) ). The intuition is that we are bounding HM (Yz ) by the entropy of a Bayesian network where a vertex Ye(i) is included if ze(i) = 1, with corresponding parents given by Yz ∩ N (e, i). This is a well-deﬁned Bayesian network for any choice of z. The shortcoming is that ideally we would like this Bayesian network to be the actual marginal of the model given by e and N (e, i). It is not: if the network implied by e and N (e, i) was, for instance, Y1 → Y2 → Y3 , the choice of z = (1, 0, 1) would result on the entropy of the disconnected graph {Y1 , Y3 }, while the true marginal would correspond instead to the graph Y1 → Y3 . However, our simpliﬁed marginalization has the advantage of avoiding an intractable problem. Moreover, it allows us to redeﬁne the problem as an integer linear program (ILP). Each product ze(i) j zj k (1−zk ) appearing in (3) results in a sum of O(2D ) terms, each of which has (up to a sign) the form qM ≡ m∈M zm for some set M . It is still the case that qM ∈ {0, 1}. Therefore, objective function (3) can be interpreted as being linear on a set of binary variables {{z}, {q}}. We need further to enforce the constraints coming from qM = 1 ⇒ {∀m ∈ M, zm = 1}; qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} 4 It is well-known (Glover and Woolsey, 1974) that this corresponds to the linear constraints qM = 1 ⇒ {∀m ∈ M, zm = 1} ⇔ ∀m ∈ M, qM − zm ≤ 0 qM = 0 ⇒ {∃m ∈ M s.t. zm = 0} ⇔ m∈M zm − qM ≤ |M | − 1 p which combined with the linear constraint i=1 zi = K implies that optimizing FM;D (z) is an ILP with O(p · 2D ) variables and O(p2 · 2D ) constraints. In our experiments in Section 5, we were able to solve essentially all of such ILPs exactly using linear programming relaxations with branch-and-bound. 3.3 Entropy with Tree-Structured Bounds The previous bound simpliﬁes marginalization, which might badly overestimate entropies where the corresponding Yz are uniformly spread out in permutation e. We now propose a different type of bound which treats different marginalizations on an equal footing. It comes from the following observation: since H(Ye(i) | Ye(1:i−1) ) is less than or equal to any conditional entropy H(Ye(i) | Yj ) for j ∈ e(1 : i − 1), we have that the tighest bound given by singleton conditioning sets is H(Ye(i) | Ye(1:i−1) ) ≤ min j∈e(1:i−1) HM (Ye(i) | Yj ), resulting in the objective function p p zi log(PM (Yi | X)) FM;tree (z) ≡ ze(i) · PM (X,Yi ) + i=1 i=1 min {Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (4) where min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) ≡ H(Ye(i) ) if Ye(1:i−1) ∩ Yz = ∅. The intuition is that we are bounding the exact entropy using the entropy of a directed tree rooted at Yez (1) , the ﬁrst element of Yz according to e. That is, all variables are marginally dependent in the approximation regardless of what z is, and for a ﬁxed z the tree is, by construction, the one obtained by the usual greedy algorithm of adding edges corresponding to the next legal pair of vertices with maximum mutual information (following an ordering, in this case). It turns out we can also write (4) as a linear objective function of a polynomial number of 0\1 (i−1) (2) (1) be the values of set variables and constraints. Let zi ≡ 1 − zi . Let Hi , Hi , . . . , Hi ¯ (i−1) (1) be{HM (Ye(i) | Ye(1) ), . . . , HM (Ye(i) | Ye(i−1) )} sorted in ascending order, with zi , . . . , zi ing the corresponding permutation of {ze(1) , . . . , ze(i−1) }. We have min{Yj ∈Ye(1:i−1) ∩Yz } H(Ye(i) | Yj ) (j) (j) j−1 (1) (1) (1) (2) (2) (1) (2) (3) (3) ¯ ¯ ¯ = z i Hi + z i z i H i + z i z i z i H i + . . . (i−2) (i−1) (i−1) (1) i−1 (j) + j=1 zi HM (Ye(i) ) ¯ Hi zi ¯ zi . . . zi ¯ (i) i−1 (j) (j) + qi HM (Ye(i) ) ≡ j=1 qi Hi (k) ¯ where qi ≡ zi k=1 zi , and also a binary 0\1 variable. Plugging this expression into (4) gives a linear objective function in this extended variable space. The corresponding constraints are (j−1) (j) (1) (j) , zi }, zm = 1} ¯ z qi = 1 ⇒ {∀zm ∈ {¯i , . . . , zi (j−1) (j) (1) (j) , zi } s.t. zm = 0} ¯ z qi = 0 ⇒ {∃zm ∈ {¯i , . . . , zi which, as shown in the previous section, can be written as linear constraints (substituting each zi ¯ by 1 − zi ). The total number of constraints is however O(p3 ), which can be expensive, and often a linear relaxation procedure with branch-and-bound fails to provide guarantees of optimality. 3.4 The Reliability Score Finally, it is important to design cheap, effective criteria whose maxima correlate with the maxima of FM (·). Empirically, we have found high quality selections in binary probit models using the solution to the problem p p wi zi , subject to zi ∈ {0, 1}, maximize FM;R (z) = zi = K i=1 i=1 5 where wi = ΛT ΣΛi . This can be solved by picking the corresponding indicators with the highest i K weights wi . Assuming a probit model where the measurement error for each Yi⋆ has the same variance of 1, this score is related to the “reliability” of an indicator. Simply put, the reliability Ri of an indicator is the proportion of its variance that is due to the latent variables (Bollen, 1989, Chapter 6): Ri = wi /(wi + 1) for each Yi⋆ . There is no current theory linking this solution to the problem of maximizing FM (·): since there is no entropy term, we can set an adversarial problem to easily defeat this method. For instance, this happens in a model where the K indicators of highest reliability all measure the same latent variable Xi and nothing else – much information about Xi would be preserved, but little about other variables. In any case, we found this criterion to be fairly competitive even if at times it produces extreme failures. An honest account of more sophisticated selection mechanisms cannot be performed without including it, as we do in Section 5. 4 Related Work The literature on survey analysis, in the context of latent variable models, contains several examples of guidelines on how to simplify questionnaires (sometimes described as providing “shortened versions” of scales). Much of the literature, however, consists of describing general guidelines and rules-of-thumb to accomplish this task (e.g, Richins, 2004; Stanton et al., 2002). One possible exception is Leite et al. (2008), which uses different model ﬁtness criteria with respect to a given dataset to score candidate solutions, along with an expensive combinatorial optimization method. This conﬂates model selection and questionnaire thinning, and there is no theory linking the score functions to the amount of information preserved. In the machine learning and statistics literature, there is a large body of research in active learning, which is related to our task. One of the closest approaches is the one by Liang et al. (2009), which casts the classical problem of measurement selection within a Bayesian graphical model perspective. In that work, one has to choose which measurements to add. This is done sequentially, partially motivated by problems where collecting new measurements can be done relatively fast and cheap (say, by paying graduate students to annotate text data), and so the choice of next measurement can make use of fresh data. In our case, it not might be realistic to expect we can perform a large number of iterations of data collection – and as such the task of reducing the number of measurements from a large initial collection might be more relevant in practice. Liang et al. also focus on (multivariate) supervised learning instead of purely unsupervised learning. In statistics there is also a considerable body of literature on sufﬁcient dimension reduction and its sparse variants (e.g., Chen et al., 2010). Such techniques create a bottleneck between two sets of variables in a regression problem (say, the mapping from Y to X) while eliminating some of the input variables. In principle one might want to adapt such models to take a latent variable model M as the target mapping. Besides some loss of interpretability, the computational implications might be problematic, though. Moreover, this framework has another free parameter corresponding to the dimensionality of the bottleneck that has to be set. It is not clear how this parameter, along with a choice of sparsity level, would interact with a ﬁxed choice K of indicators to be kept. 5 Experiments In this section, we ﬁrst describe some synthetic experiments to provide insights about the different methods, followed by one brief description of a case study. In all of the experiments, the target models M are binary probit. We set the neighborhood parameter for FM;N (·) to 9. The ordering e for the tree-structured method is obtained by the same greedy search of Section 3.2, where now the score is the average of all H(Yi | Yj ) for all Yj preceding Yi . Finally, all ordering optimization methods were initialized by sorting indicators in a descending order according to their reliability scores, and the initial solution for all entropy-based optimization methods was given by the reliability score solution of Section 3.4. The integer program solver G UROBI 4.02 was used in all experiments. 5.1 Synthetic studies We start with a batch of synthetic experiments. We generated 80 models with 40 indicators and 10 latent variables1 . We further preprocess such models into two groups: in 40 of them, we select a 1 Details on the model generation: we generate 40 models by sampling the latent covariance matrix from an inverse Wishart distribution with 10 degrees of freedom and scale matrix 10I, I being the identity matrix. 6 Improvement ratio: high signal Mean error: high signal Improvement ratio: low signal Mean error: low signal 0.6 0.5 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.25 Reliability score Reliability score 0.4 0.2 0.15 0.25 0.2 0.15 −0.1 −0.1 0.1 N/R T/R G/R N/S T/S G/S N/R T/R G/R N/S T/S 0.1 G/S 0.1 0.15 0.2 0.25 0.3 0.1 0.15 0.2 Tree bound (a) (c) (b) 0.25 0.3 Tree bound (d) Figure 2: (a) A comparison of the bounded neighborhood (N ), tree-based (T ) and Gaussian (G) methods with respect to a random solution (R) and the reliability score (S). (b) A similar comparison for models where indicators are more weakly correlated to the latent variables than in (a). (c) and (d) Scatterplots of the average absolute deviance for the tree-based method (horizontal axis) against the reliability method (vertical axis). The bottom-left clouds correspond to the K = 32 trials. target reliability ri for each indicator Yi , uniformly at random from the interval [0.4 0.7]. We then rescale coefﬁcients Λi such that the reliability (deﬁned in Section 3.4) of the respective Yi⋆ becomes ri . For the remaining 40 models, we sample ri uniformly at random from the interval [0.2 0.4]. We perform two choices of subsets: sets Yz of size 20 and 32 (50% and 80% of the total number of indicators). Our evaluation is as follows: since the expected value is perhaps the most common functional of the posterior distribution PM (X | Y), we calculate the expected value of the latent variables for a sample {y(1) , y(2) , . . . , y(1000) } of size 1000 taken from the respective synthetic models. This is done for the full set of 40 indicators, and for each set chosen by our four criteria: for each data point i and each objective function F , we evaluate the average distance (i) (i) (i) (i) ˆ ˆ dF ≡ 10 |ˆj − xj;F |/10. In this case, xj is the expected value of Xj obtained by conditioning j=1 x (i) on all indicators, while xj;F is the one obtained with the subset selected by optimizing F . We denote ˆ (1) (2) (1000) by mF the average of {dF , dF , . . . , dF }. Finally, we compare the three main methods with respect to the reliability score method using the improvement ratio statistic sF = 1 − mF /mFM;R , the proportion of average error decrease with respect to the reliability score. In order to provide a sense of scale on the difﬁculty of each problem, we compute the same ratios with respect to a random selection, obtained by choosing K = 20 and K = 32 indicators uniformly at random. Figure 2 provides a summary of the results. In Figure 2(a), each boxplot shows the distribution over the 40 probit models where reliabilities were sampled between [0.4 0.7] (the “high signal” models). The ﬁrst three boxplots show the scores sF of the bounded neighborhood, tree-structured and Gaussian methods, respectively, compared against random selections. The last three boxplots are comparisons against the reliability heuristic. The tree-based method easily beats the Gaussian method, with about 75% of its outcomes being better than the median Gaussian outcome. The Gaussian approach is also less reliable, with results showing a long lower tail. Although the reliability score is on average a good approach, in only a handful of cases it was better than the tree-based method, and by considerably smaller magnitudes compared to the upper tails in the tree-based outcome distribution. A separate panel (Figure 2(b)) is shown for the 40 models with lower reliabilities. In this case, all methods show stronger improvements over the reliability score, although now there is a less clear difference between the tree method and the Gaussian one. Finally, in panels (c) and (d) we present scatterplots for the average deviances mF of the tree-based method against the reliability score. The two clouds correspond to the solutions with 20 and 32 indicators. Notice that in the vast majority of the cases the tree-based method does better. We then rescale the matrix to make all variances equal to 1. We also generate 40 models using as the inverse Wishart scale matrix the correlation matrix will all off-diagonal entries set to 0.5. Coefﬁcients linking indicators to latent variables were set to zero with probability 0.8, and sampled from a standard Gaussian otherwise. If some latent variable ends up with no child, or an indicator ends up with no parent, we uniformly choose one child/parent to be linked to it. Code to fully replicate the synthetic experiments is available at HTTP :// WWW. HOMEPAGES . UCL . AC . UK /∼ UCGTRBD /. 7 5.2 Case study The National Health Service (NHS) is the public health system of the United Kingdom. In 2009, a major survey called the National Health Service National Staff Survey was deployed with the goal of “collect(ing) staff views about working in their local NHS trust” (Care Quality Comission and Aston University, 2010). A questionnaire of 206 items was ﬁlled by 156, 951 respondents. We designed a measurement model based on the structure of the questionnaire and ﬁt it by the posterior expected value estimator. Gaussian and inverse Wishart priors were used, along with Gibbs sampling and a random subset of 50, 000 respondents. See the Supplementary Material for more details. Several items in this survey asked for the NHS staff member to provide degrees of agreement in a Likert scale (Bartholomew et al., 2008) to questions such as • • • • . . . have you ever come to work despite not feeling well enough to perform . . . ? Have you felt pressure from your manager to come to work? Have you felt pressure from colleagues to come to work? Have you put yourself under pressure to come to work? as different probes into an unobservable self-assessed level of work pressure. We preprocessed and binarized the data to ﬁrst narrow it down to 63 questions. We compare selections of 32 (50%) and 50 (80%) items using the same statistics of the previous section. sF ;D sF ;tree sF ;N sF ;random mF ;tree mF ;R K = 32 K = 50 7.8% 10.5% 6.3% 11.9% 0.01% 7.6% −16.0% −0.05% 0.238 0.123 0.255 0.140 Although gains were relatively small (as measured by the difference between reconstruction errors mF ;tree − mF ;R and the good performance of a random selection), we showed that: i.) we do improve results over a popular measure of indicator quality; ii.) we do provide some guarantees about the diversity of the selected items via a information-theoretical measure with an entropy term, with theoretically sound approximations to such a measure. For more details on the preprocessing, and more insights into the different selections, please refer to the Supplementary Material. 6 Conclusion There are problems where one posits that the relevant information is encoded in the posterior distribution of a set of latent variables. Questionnaires (and other instruments) can be used as evidence to generate this posterior, but there is a cost associated with complex questionnaires. One problem is how to simplify such instruments of measurement. To the best of our knowledge, we provide the ﬁrst formal account on how to solve it. Nevertheless, we would like to stress there is no substitute for common sense. While the tools we provide here can be used for a variety of analyses – from deploying simpler questionnaires to sensitivity analysis – the value and cost of keeping particular indicators can go much beyond the information contained in the latent posterior distribution. How to combine this criterion with other domain-dependent criteria is a matter of future research. Another problem of importance is how to deal with model speciﬁcation and transportability across studies. A measurement model built for a very speciﬁc population of respondents might transfer poorly to another group, and therefore taking into account model uncertainty will be important. The Bayesian setup discussed by Liang et al. (2009) might provide some directions on this issue. Also, there is further structure in real-world questionnaires we are not exploiting in the current work. Namely, it is not uncommon to have questionnaires with branching questions and other dynamic behaviour more commonly associated with Web based surveys and/or longitudinal studies. Finally, hybrid approaches combining the bounded neighborhood and the tree-structured methods, along with more sophisticated ordering optimization procedures and the use of other divergence measures and determinant-based criteria (e.g. Kulesza and Taskar, 2011), will also be studied in the future. Acknowledgments The author would like to thank James Cussens and Simon Lacoste-Julien for helpful discussions, as well as the anonymous reviewers for further comments. 8 References D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. Analysis of Multivariate Social Science Data, 2nd edition. Chapman & Hall, 2008. C. Bishop. Latent variable models. In M. Jordan (editor), Learning in Graphical Models, pages 371–403, 1998. C. Bishop. Pattern Recognition and Machine Learning. Springer, 2009. K. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989. R. Carroll, D. Ruppert, and L. Stefanski. Measurement Error in Nonlinear Models. Chapman & Hall, 1995. X. Chen, C. Zou, and R. Cook. Coordinate-independent sparse sufﬁcient dimension reduction and variable selection. Annals of Statistics, 38:3696–3723, 2010. Care Quality Commission and Aston University. Aston Business School, National Health Service National Staff Survey, 2009 [computer ﬁle]. Colchester, Essex: UK Data Archive [distributor], October 2010. Available at HTTPS :// WWW. ESDS . AC . UK, SN: 6570, 2010. A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. A. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions. Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2007), pages 141–149, 2007. F. Glover and E. Woolsey. Converting the 0-1 polynomial programming problem to a 0-1 linear program. Operations Research, 22:180–182, 1974. P. Hahn, J. Scott, and C. Carvalho. A sparse factor-analytic probit model for congressional voting patterns. Duke University Department of Statistical Science, Discussion Paper 2009-22, 2010. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using LP relaxations. Proceedings of the 13th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), pages 366–373, 2010. A. Kulesza and B. Taskar. k-DPPs: ﬁxed-size determinantal point processes. Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1193–1200, 2011. W. Leite, I-C. Huang, and G. Marcoulides. Item selection for the development of short forms of scales using an ant colony optimization algorithm. Multivariate Behavioral Research, 43:411– 431, 2008. P. Liang, M. Jordan, and D. Klein. Learning from measurements in exponential families. Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), 2009. T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, Massachussets Institute of Technology, 2001. J. Palomo, D. Dunson, and K. Bollen. Bayesian structural equation modeling. In Sik-Yum Lee (ed.), Handbook of Latent Variable and Related Models, pages 163–188, 2007. M. Richins. The material values scale: Measurement properties and development of a short form. The Journal of Consumer Research, 31:209–219, 2004. J. Stanton, E. Sinar, W. Balzer, and P. Smith. Issues and strategies for reducing the length of selfreported scales. Personnel Psychology, 55:167–194, 2002. M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. Proceedings of the Twenty-ﬁrst Conference on Uncertainty in AI (UAI ’05), pages 584–590, 2005. 9

6 0.10407554 226 nips-2011-Projection onto A Nonnegative Max-Heap

7 0.10224251 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations

8 0.087075554 186 nips-2011-Noise Thresholds for Spectral Clustering

9 0.073008232 54 nips-2011-Co-regularized Multi-view Spectral Clustering

10 0.06496612 157 nips-2011-Learning to Search Efficiently in High Dimensions

11 0.064736798 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

12 0.064622268 177 nips-2011-Multi-armed bandits on implicit metric spaces

13 0.064187072 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

14 0.063540161 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

15 0.062147014 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

16 0.057471201 240 nips-2011-Robust Multi-Class Gaussian Process Classification

17 0.05658282 249 nips-2011-Sequence learning with hidden units in spiking neural networks

18 0.055817701 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

19 0.055224579 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions

20 0.054827459 258 nips-2011-Sparse Bayesian Multi-Task Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, 0.021), (2, -0.035), (3, -0.047), (4, -0.041), (5, -0.097), (6, -0.02), (7, -0.042), (8, 0.065), (9, -0.194), (10, -0.008), (11, -0.009), (12, -0.12), (13, 0.032), (14, -0.009), (15, 0.035), (16, -0.066), (17, -0.058), (18, -0.011), (19, -0.046), (20, 0.048), (21, -0.115), (22, -0.097), (23, 0.091), (24, -0.081), (25, 0.032), (26, 0.06), (27, 0.002), (28, 0.021), (29, 0.027), (30, -0.097), (31, -0.055), (32, 0.108), (33, -0.156), (34, -0.021), (35, -0.091), (36, 0.026), (37, 0.014), (38, -0.043), (39, 0.009), (40, -0.095), (41, -0.152), (42, 0.097), (43, -0.022), (44, 0.072), (45, -0.039), (46, 0.008), (47, -0.1), (48, 0.023), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94713646 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

Author: Animashree Anandkumar, Kamalika Chaudhuri, Daniel J. Hsu, Sham M. Kakade, Le Song, Tong Zhang

2 0.71242744 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations

Author: Flavio Chierichetti, David Liben-nowell, Jon M. Kleinberg

Abstract: Motivated by the spread of on-line information in general and on-line petitions in particular, recent research has raised the following combinatorial estimation problem. There is a tree T that we cannot observe directly (representing the structure along which the information has spread), and certain nodes randomly decide to make their copy of the information public. In the case of a petition, the list of names on each public copy of the petition also reveals a path leading back to the root of the tree. What can we conclude about the properties of the tree we observe from these revealed paths, and can we use the structure of the observed tree to estimate the size of the full unobserved tree T ? Here we provide the ﬁrst algorithm for this size estimation task, together with provable guarantees on its performance. We also establish structural properties of the observed tree, providing the ﬁrst rigorous explanation for some of the unusual structural phenomena present in the spread of real chain-letter petitions on the Internet. 1

3 0.66229236 226 nips-2011-Projection onto A Nonnegative Max-Heap

Author: Jun Liu, Liang Sun, Jieping Ye

Abstract: We consider the problem of computing the Euclidean projection of a vector of length p onto a non-negative max-heap—an ordered tree where the values of the nodes are all nonnegative and the value of any parent node is no less than the value(s) of its child node(s). This Euclidean projection plays a building block role in the optimization problem with a non-negative maxheap constraint. Such a constraint is desirable when the features follow an ordered tree structure, that is, a given feature is selected for the given regression/classiﬁcation task only if its parent node is selected. In this paper, we show that such Euclidean projection problem admits an analytical solution and we develop a top-down algorithm where the key operation is to ﬁnd the so-called maximal root-tree of the subtree rooted at each node. A naive approach for ﬁnding the maximal root-tree is to enumerate all the possible root-trees, which, however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for eﬃciently ﬁnding the maximal roottree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and O(p2 ) for a general tree. We report simulation results showing the eﬀectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1. 1

4 0.61424613 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

Author: Le Song, Eric P. Xing, Ankur P. Parikh

5 0.60937458 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

Author: Ali Tofigh, Erik Sj̦lund, Mattias H̦glund, Jens Lagergren

Abstract: Cancer has complex patterns of progression that include converging as well as diverging progressional pathways. Vogelstein’s path model of colon cancer was a pioneering contribution to cancer research. Since then, several attempts have been made at obtaining mathematical models of cancer progression, devising learning algorithms, and applying these to cross-sectional data. Beerenwinkel et al. provided, what they coined, EM-like algorithms for Oncogenetic Trees (OTs) and mixtures of such. Given the small size of current and future data sets, it is important to minimize the number of parameters of a model. For this reason, we too focus on tree-based models and introduce Hidden-variable Oncogenetic Trees (HOTs). In contrast to OTs, HOTs allow for errors in the data and thereby provide more realistic modeling. We also design global structural EM algorithms for learning HOTs and mixtures of HOTs (HOT-mixtures). The algorithms are global in the sense that, during the M-step, they ﬁnd a structure that yields a global maximum of the expected complete log-likelihood rather than merely one that improves it. The algorithm for single HOTs performs very well on reasonable-sized data sets, while that for HOT-mixtures requires data sets of sizes obtainable only with tomorrow’s more cost-efﬁcient technologies. 1

6 0.57257813 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm

7 0.56646025 288 nips-2011-Thinning Measurement Models and Questionnaire Design

8 0.51976013 276 nips-2011-Structured sparse coding via lateral inhibition

9 0.50573295 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

10 0.49707237 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

11 0.4937776 177 nips-2011-Multi-armed bandits on implicit metric spaces

12 0.48377696 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

13 0.4455463 157 nips-2011-Learning to Search Efficiently in High Dimensions

14 0.43750638 274 nips-2011-Structure Learning for Optimization

15 0.42500004 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

16 0.4152281 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

17 0.41003036 111 nips-2011-Hashing Algorithms for Large-Scale Learning

18 0.35850215 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

19 0.35377902 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

20 0.34667194 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (4, 0.048), (20, 0.028), (26, 0.026), (31, 0.084), (33, 0.025), (43, 0.09), (45, 0.073), (48, 0.015), (57, 0.064), (74, 0.05), (83, 0.042), (96, 0.273), (99, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74083626 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

Author: Animashree Anandkumar, Kamalika Chaudhuri, Daniel J. Hsu, Sham M. Kakade, Le Song, Tong Zhang

2 0.55054176 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

Author: Patrick O. Perry, Michael W. Mahoney

Abstract: Recently, Mahoney and Orecchia demonstrated that popular diffusion-based procedures to compute a quick approximation to the ﬁrst nontrivial eigenvector of a data graph Laplacian exactly solve certain regularized Semi-Deﬁnite Programs (SDPs). In this paper, we extend that result by providing a statistical interpretation of their approximation procedure. Our interpretation will be analogous to the manner in which 2 -regularized or 1 -regularized 2 -regression (often called Ridge regression and Lasso regression, respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on the coefﬁcient vector of the regression problem. Our framework will imply that the solutions to the MahoneyOrecchia regularized SDP can be interpreted as regularized estimates of the pseudoinverse of the graph Laplacian. Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e.g., the fast diffusion-based PageRank procedure for computing an approximation to the ﬁrst nontrivial eigenvector of the graph Laplacian. Empirical results are also provided to illustrate the manner in which approximate eigenvector computation implicitly performs statistical regularization, relative to running the corresponding exact algorithm. 1

3 0.54604053 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

Author: Le Song, Eric P. Xing, Ankur P. Parikh

4 0.54497665 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

Abstract: Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We ﬁnd that for large neural populations carrying a ﬁnite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufﬁcient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we ﬁnd that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched ﬁring rates. We argue that our ﬁndings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design.

5 0.5437175 258 nips-2011-Sparse Bayesian Multi-Task Learning

Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau

Abstract: We propose a new sparse Bayesian model for multi-task regression and classiﬁcation. The model is able to capture correlations between tasks, or more speciﬁcally a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classiﬁcation tasks it achieves competitive predictive performance compared to previously proposed methods. 1

6 0.54306936 102 nips-2011-Generalised Coupled Tensor Factorisation

7 0.54303658 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

8 0.537157 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

9 0.53672653 66 nips-2011-Crowdclustering

10 0.53516006 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

11 0.5345847 219 nips-2011-Predicting response time and error rates in visual search

12 0.53433561 75 nips-2011-Dynamical segmentation of single trials from population neural data

13 0.53371096 24 nips-2011-Active learning of neural response functions with Gaussian processes

14 0.53280848 186 nips-2011-Noise Thresholds for Spectral Clustering

15 0.53273553 301 nips-2011-Variational Gaussian Process Dynamical Systems

16 0.53175724 55 nips-2011-Collective Graphical Models

17 0.53160077 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

18 0.53081161 276 nips-2011-Structured sparse coding via lateral inhibition

19 0.53080755 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

20 0.53076369 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems