jmlr jmlr2005 jmlr2005-43 knowledge-graph by maker-knowledge-mining

43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach


Source: pdf

Author: Gal Elidan, Nir Friedman

Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. [sent-11, score-1.057]

2 As such, hidden variables can simplify the network structure and consequently lead to better generalization. [sent-28, score-0.526]

3 In particular, in hard real-life learning problems, structural EM will perform poorly unless some prior knowledge of the interaction between the hidden and observed variables exists or if the cardinality of the hidden variables is not (at least approximately) known. [sent-47, score-1.042]

4 We show how to pose the learning problem within the multivariate information bottleneck framework and derive a target Lagrangian for the hidden variables. [sent-61, score-0.628]

5 We generalize our information bottleneck expectation maximization (IB-EM) framework for multiple hidden variables and any Bayesian network structure. [sent-70, score-0.658]

6 The first operator can adapt the cardinality of a hidden variable. [sent-78, score-0.61]

7 Specifically, the cardinality of a hidden variable can increase during the continuation process, increasing the likelihood as long as it is beneficial to do so. [sent-79, score-0.882]

8 The second operator introduces new hidden variables into the network structure. [sent-80, score-0.489]

9 Intuitively, a hidden variable is introduced as a parent of a subset of nodes whose interactions are poorly explained by the current model. [sent-81, score-0.516]

10 We then show how our operator for enriching the network structure with new hidden variables leads to significantly superior models, for several complex real-life problems. [sent-85, score-0.579]

11 In Section 7, we show how our method can be combined with the structural EM algorithm to learn the structure of a network with hidden variables. [sent-94, score-0.522]

12 In Section 8, we take advantage of emergent structure during the continuation process, and present a method for learning the cardinality of the hidden variables. [sent-95, score-0.906]

13 This involves empirical sufficient statistics in the form of joint counts N(xi , pai ) = ∑ 1 {Xi [m] = xi , Pai [m] = pai }, (1) m where 1 {} is the indicator function. [sent-129, score-0.654]

14 In the learned network, the hidden variables will serve to summarize some part of the data while retaining the relevant information on (some) of the observed variables X. [sent-240, score-0.48]

15 We start by considering this task for the case of a single hidden variable T and then, in Section 5, extend the framework to several hidden variables. [sent-246, score-0.798]

16 1 The Information Bottleneck EM Lagrangian If we were only interested in the training data and the cardinality of the hidden variable allows it, each state of the hidden variable would have been assigned to a different instance. [sent-248, score-1.024]

17 We now formalize this idea for the task of learning a generative model over the variables X and the hidden variable T . [sent-256, score-0.482]

18 Naively, we could allow a large cardinality for the hidden variable, set γ to a high value and find the solution of the bottleneck problem. [sent-336, score-0.755]

19 Second, we often do not know the cardinality that should be assigned to the hidden variable. [sent-339, score-0.556]

20 Multiple Hidden Variables The framework we described in the previous sections can easily accommodate learning networks with multiple hidden variables simply by treating T as a vector of hidden variables. [sent-448, score-0.79]

21 95 E LIDAN AND F RIEDMAN T0 Y X1 T0 Tk Xn T1 Tk X1 Xn Gin = Q Y Gout = P Figure 4: Definition of networks for the multivariate information bottleneck framework with multiple hidden variables. [sent-458, score-0.597]

22 It is easy to see that when a single hidden variable is considered, and EP (ti , y) ≡ log P(x[y],t), the two forms coincide. [sent-484, score-0.514]

23 5 0 20 40 60 80 100 Percentage of random runs (a) (b) Figure 5: (a) A quadrant based hierarchy structure with 21 hidden variables for modeling 16 × 16 images in the Digit domain. [sent-504, score-0.648]

24 In each architecture, we consider networks with hidden variables of different cardinality, where for now we use the same cardinality for all hidden variables in the same network. [sent-517, score-1.008]

25 We trained a Naive Bayes hidden variable model where the hidden variable is a parent of all the observations. [sent-522, score-0.938]

26 In these models we introduce a hidden parent to each quadrant of the image recursively. [sent-532, score-0.471]

27 The 3-level hierarchy has a hidden parent to each 8x8 quadrant, and then another hidden variable that is the parent of these four hidden variables. [sent-533, score-1.399]

28 Every 4 of these are joined into an 8x8 quadrant by another level of hidden variables, totaling 21 hidden variables, as illustrated in Figure 5(a). [sent-535, score-0.775]

29 The model we use for this data set has an hierarchical structure with 19 hidden variables in a 4-level hierarchy that was determined by the biological expert based on the nature of the different experiments, as illustrated schematically in Figure 6. [sent-543, score-0.584]

30 In this structure, 5–24 similar conditions (filled nodes) such as different hypo-osmotic shocks are children of a common hidden parent (unfilled nodes). [sent-544, score-0.538]

31 These hidden parents are in their turn children of further abstraction of conditions. [sent-545, score-0.513]

32 For example, the heat shock and heat shock with oxidative stress hidden nodes, are both children of a common more abstract heat node. [sent-546, score-0.571]

33 A root hidden variable is the common parents of these high-level abstractions. [sent-547, score-0.469]

34 As a first sanity check, for each model (and each cardinality of hidden variables) we performed 50 runs of EM with random starting points. [sent-549, score-0.641]

35 For example, Figure 5 compares the test set performance (log-likelihood per instance) of these runs on the Digit data set with a 4-level hierarchy of 21 hidden variables with 2 states each. [sent-554, score-0.581]

36 These hidden nodes are themselves children of further abstraction nodes of similar experiments, which in their turn are children of the single root node. [sent-569, score-0.556]

37 In this section we consider the case where the set of hidden variables is fixed and their cardinalities are known, and we want to learn the network structure. [sent-719, score-0.577]

38 Black circles illustrate the progress of the continuation procedure by denoting the value of γ at the end of each continuation step. [sent-732, score-0.468]

39 (1), we use I Q(T|Y ) [N(xi , pai )] = ∑ ∑ Q(X [m] = xi , Pai = pai , t | Y = m). [sent-746, score-0.654]

40 Although the method described above applies for any Bayesian network structure, for concreteness we focus on learning hierarchies of hidden variables in the following sections. [sent-775, score-0.459]

41 In this sub-class of networks each variable has at most one parent, and that parent has to be a hidden variable. [sent-776, score-0.494]

42 This implies that the hierarchy of hidden variables captures the dependencies between the observed attributes. [sent-777, score-0.52]

43 Since we are dealing with hierarchies we consider search steps that replace the parent of a variable by one of the hidden variables. [sent-778, score-0.53]

44 Learning Cardinality In real life, it is often the case that we do not know the cardinality of a hidden variable. [sent-782, score-0.556]

45 When examining the models during the continuation process, we observe that for lower values of γ the effective cardinality of the hidden variable is smaller than its cardinality in the model (we elaborate on how this is measured below). [sent-793, score-1.038]

46 Thus, limiting the cardinality of the hidden variable is in effect similar to stopping the continuation process at some γ < 1. [sent-795, score-0.836]

47 The most straightforward approach to learning the cardinality of a hidden variable is simply to try a few values, and for each value apply IB-EM independently. [sent-797, score-0.602]

48 If we want to try K different cardinalities for each hidden variable, we have to carry out |H|K independent IB-EM runs, where |H| is the number of hidden variables. [sent-806, score-0.87]

49 The intuition that the “effective” cardinality of the hidden variable will increase as we consider larger values of γ suggests that we increasing the model complexity during the continuation process. [sent-807, score-0.858]

50 The task we face is to determine when all the states of a hidden variables are being utilized and therefore a new redundant state is needed. [sent-812, score-0.463]

51 We start with a binary cardinality for the hidden variables at γ = 0. [sent-846, score-0.594]

52 At each stage, before γ is increased, we determine for each hidden variable if all its states are utilized: For each pair of states we evaluate the BDe score difference between the model with the two states and the model with the states merged. [sent-847, score-0.608]

53 In this hierarchy there are 6 hidden variables that aggregate similar experiments, a Heat node that aggregates 5 of these hidden variables and a root node that is the parent of both Heat and the additional Nitrogen Depletion node. [sent-872, score-0.981]

54 Figure 11 shows the structure along with the cardinalities of the hidden variables learned by our method and compares the performance of our method to model learned with different fixed cardinalities. [sent-873, score-0.617]

55 However, as in the case of cardinality adaptation discussed in Section 8, we can use emergent cues of the continuation process to suggest an effective method. [sent-883, score-0.463]

56 To do so, we consider a search operator that enriches the structure of hierarchy with a new hidden variable. [sent-893, score-0.59]

57 ) Suppose that we want to consider the addition of a new hidden variables into the network structure. [sent-895, score-0.519]

58 For simplicity, consider the scenario shown in Figure 12, where we start with a Naive Bayes network with a hidden variable T1 as root and want to add a hidden variable T2 as a parent of a subset C of T1 ’s children. [sent-896, score-1.021]

59 109 E LIDAN AND F RIEDMAN T1 T1 T2 X1 Xn X1 Xk Xn Figure 12: Example of enrichment with new hidden variables T2 as parent of a subset C of the observed variables X1 . [sent-911, score-0.6]

60 Proposition 8 Let P be a Bayesian network model with a hidden variable T1 and denote by C an observed subset of T1 ’s children. [sent-915, score-0.489]

61 The result is intuitive — the contribution of insertion of a new hidden variable cannot exceed the entropy of its children given their current hidden parent. [sent-925, score-0.916]

62 However, the scenarios we are interested in are those in which the information between the hidden variable and its children is high and the entropy of 110 L EARNING H IDDEN VARIABLE N ETWORKS T1 Total 0. [sent-927, score-0.54]

63 (b) shows the structure used in learning without the hidden variable T2 . [sent-938, score-0.489]

64 Instead, we use the following greedy approach: first, for each hidden variable, we limit our attention to up to K (we use 20) of its children with the highest entropy individually. [sent-946, score-0.494]

65 (c) shows the information between the hidden variable T1 and the observed children (solid), its direct children in the generating distribution (dotted) and the children of T2 (dashed). [sent-955, score-0.692]

66 Up to some point in the annealing process, the information content of the hidden variable is low and the information with both subsets of variables is low. [sent-956, score-0.616]

67 When the hidden variable starts to capture the distribution of the observed variables, the two subsets diverge and while T1 captures its original direct children significantly better, the children of T2 still have high entropy given T1 . [sent-957, score-0.655]

68 Thus, we want to start considering our information “cue” only when the hidden parent becomes meaningful, that is only when I Q (Y ; T1 ) passes some threshold. [sent-958, score-0.508]

69 If this is the case, we greedily search for subsets of children of the hidden variable that have high entropy. [sent-961, score-0.548]

70 For the subset with the highest entropy, we suggest a putative new hidden variable that is the parent of the variables in the subset. [sent-963, score-0.532]

71 If, after structure search, a hidden variable has at most one child, it is in fact redundant and can be removed from the structure. [sent-966, score-0.515]

72 We iterate the entire procedure until no more hidden variable are added and the structure learning procedure converges. [sent-967, score-0.489]

73 Full Learning — Experimental Validation We want to evaluate the effectiveness of our method when learning structure with and without the introduction of new hidden variables into the model. [sent-969, score-0.541]

74 To create a baseline for hierarchy performance, we train a Naive hierarchy with a single hidden variable and cardinality of 3 totaling 122 parameters. [sent-973, score-0.764]

75 To do this, we generated 25 random hierarchies with 5 binary hidden variables that are parents of the observed variables and a common root parent totaling 91 parameters. [sent-975, score-0.571]

76 We then use structural EM (Friedman, 1997) to adapt the structure by using a replace-parent operator where at each local step an observed node can replace its hidden parents. [sent-976, score-0.568]

77 Next, we evaluate the ability of the new hidden variable enrichment operator to improve the model. [sent-979, score-0.528]

78 8 1 Percentage of runs Figure 14: Comparison of performance on the Stock data set of Naive hierarchy (Naive), 25 hierarchies with replace-parent search (Search) , hierarchy learned with enrichment operator (Enrich) and hierarchy learned with enrichment and replace-parent search (Enrich+). [sent-990, score-0.616]

79 The learned hierarchy had only two hidden variables (requiring only 85 parameters). [sent-994, score-0.523]

80 These results show the enrichment operator effectively added useful hidden variables and that the ability to adapt the structure of the network is crucial for utilizing these hidden variables to the best extent. [sent-995, score-1.07]

81 First, as noted in Section 10, due to the nature of the annealing process we consider adding new hidden variable only when the information I Q (Y ; T ) of a hidden variable T in the current structure passes some threshold. [sent-997, score-1.067]

82 Finally, we applied runs that combine both automatic cardinality adaptation and enrichment of the structure with new hidden variables. [sent-1021, score-0.762]

83 Also shown is the automatic cardinality method using the BDe score along with the different cardinalities of the 6 hidden variables introduced into the network structure. [sent-1025, score-0.747]

84 For the large problems with hidden variables we explored in this paper, Weight annealing proved inferior with similar running times, and impractical with the settings of Elidan et al. [sent-1051, score-0.57]

85 94 # of hiddens 5 5 6 5 5 6 # of parameters 89 146 304 769 2340 526 Table 2: Effect of cardinality when inserting new hidden variables into the network structure with the Enrich operator for the Stock data set. [sent-1066, score-0.736]

86 For the automatic method, the cardinalities of each hidden variable is noted. [sent-1070, score-0.48]

87 3 In particular, Whiley and Titterington (2002); Smith and Eisner (2004)) show why applying deterministic annealing to standard unsupervised learning of Bayesian networks with hidden variables is problematic. [sent-1076, score-0.605]

88 Our method both for learning the cardinality of a hidden variable, and for introducing new hidden variables into the network structure, relies on the annealing process and utilizes emergent signals. [sent-1087, score-1.22]

89 Our contribution in enriching the structure with new hidden variables is twofold. [sent-1102, score-0.481]

90 As in cardinality learning, we are able to utilize emergent signals allowing the introduction of new hidden variables into simpler models rendering them more effective. [sent-1106, score-0.643]

91 We presented a general approach for learning the parameters of hidden variables in Bayesian networks and introduced model selection operators that allow learning of new hidden variables and their cardinality. [sent-1109, score-0.85]

92 Third, we introduced model enrichment operators for inserting new hidden variables into the network structure and adapting their cardinality. [sent-1126, score-0.624]

93 First, we can improve the introduction of new hidden variables into the structure by formulating better “signals” that can be efficiently calculated for larger clusters. [sent-1130, score-0.481]

94 We start with the case of a single hidden variables and then address the more general scenario of multiple hidden variables. [sent-1151, score-0.79]

95 E E + i i When computing the derivative with respect to the parameters of a specific variables Ti , the only E change from the case of single hidden variable, is in the derivative of I Q [log P(X, T )] given fixed P. [sent-1179, score-0.462]

96 Again using Lemma 9 with f (Y, X, T ) ≡ log P(X, T ) we get E ∂I Q [log P(X, T)] E = I Q(T |ti0 ,y0 ) [log P(x[y0 ], T)], ∂Q(ti0 | y0 ) from which we get the change from Proposition 4 to Proposition 6 for the case of multiple hidden variables. [sent-1180, score-0.468]

97 (7), where we now write the normalization term Z(y, γ) explicitly: Gt,y (Q, γ) = − log Q(t | y) + (1 − γ) log Q(t) + γ log P(x[y],t) − log ∑ exp(1−γ) log Q(t )+γ log P(x[y],t ) . [sent-1186, score-0.552]

98 (19) are ∂θxi |pai ,t ∂Q(t0 |y0 ) ) = N Q(y0,t)2 (pai = 1 {xi [y0 ] = xi , pai [y0 ] = pai }N (pai ,t) − 1 {pai [y0 ] = pai }N (xi , pai ,t) Q(y0 )1{pai [y0 ]=pai } N (pai ,t)2 (21) 1 {xi [y0 ] = xi }N (pai ,t) − N (xi , pai ,t) for t = t0 and are zero otherwise. [sent-1195, score-1.635]

99 The log-probability of a specific instance can be written as log P(x[y],t) = log θt|pat [y] + ∑ log θxi |pai ,t [y] + i∈Cht ∑ log θxi |pai [y], (23) i=t,Cht where Cht denotes the children of T in Gout and θt|pat [y] is the parameter corresponding to the values appearing in instance y. [sent-1198, score-0.504]

100 2 Multiple Hidden Variables When computing the derivative with respect to the parameters associated with a specific hidden variable ti , the only change in Gt,y (Q, γ) is that log P(x[y],t) is replaced by I Q(T|ti ,y) [log P(x[y], T)]. [sent-1223, score-0.656]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hidden', 0.376), ('pai', 0.327), ('gout', 0.279), ('continuation', 0.234), ('gin', 0.23), ('em', 0.212), ('lem', 0.204), ('bottleneck', 0.199), ('cardinality', 0.18), ('idden', 0.16), ('lidan', 0.16), ('annealing', 0.156), ('lagrangian', 0.135), ('pat', 0.135), ('riedman', 0.134), ('ti', 0.118), ('stock', 0.114), ('etworks', 0.1), ('elidan', 0.099), ('yeast', 0.099), ('log', 0.092), ('children', 0.09), ('bde', 0.082), ('hierarchy', 0.081), ('enrichment', 0.076), ('slonim', 0.075), ('parent', 0.072), ('enrich', 0.07), ('structure', 0.067), ('bayesian', 0.063), ('runs', 0.063), ('want', 0.06), ('earning', 0.059), ('cardinalities', 0.058), ('friedman', 0.057), ('score', 0.05), ('naive', 0.05), ('digit', 0.049), ('maxima', 0.049), ('emergent', 0.049), ('tishby', 0.048), ('rose', 0.047), ('parents', 0.047), ('variable', 0.046), ('likelihood', 0.046), ('network', 0.045), ('heckerman', 0.039), ('variables', 0.038), ('local', 0.037), ('ep', 0.037), ('search', 0.036), ('proposition', 0.036), ('heat', 0.035), ('omputation', 0.035), ('deterministic', 0.035), ('structural', 0.034), ('uai', 0.034), ('neal', 0.034), ('eld', 0.034), ('compress', 0.033), ('perturbation', 0.032), ('target', 0.031), ('variational', 0.031), ('operator', 0.03), ('hinton', 0.029), ('entropy', 0.028), ('cht', 0.028), ('ihq', 0.028), ('kirkpatrick', 0.028), ('scorebde', 0.028), ('learned', 0.028), ('graphical', 0.027), ('ll', 0.027), ('hours', 0.026), ('objective', 0.026), ('regularization', 0.026), ('redundant', 0.026), ('captures', 0.025), ('equations', 0.025), ('derivative', 0.024), ('adapt', 0.024), ('pac', 0.024), ('fixed', 0.024), ('compression', 0.023), ('clustering', 0.023), ('thiesson', 0.023), ('calibration', 0.023), ('quadrant', 0.023), ('marks', 0.023), ('complex', 0.023), ('states', 0.023), ('instance', 0.023), ('francisco', 0.023), ('cluster', 0.022), ('path', 0.022), ('model', 0.022), ('direction', 0.022), ('stationary', 0.022), ('multivariate', 0.022), ('interactions', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999827 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

Author: Gal Elidan, Nir Friedman

Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods

2 0.082446575 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing

3 0.08126609 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

Author: Dmitry Rusakov, Dan Geiger

Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)

4 0.077097543 44 jmlr-2005-Learning Module Networks

Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman

Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN

5 0.073786497 39 jmlr-2005-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis

6 0.07090234 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

7 0.062378656 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

8 0.060592368 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

9 0.059050858 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

10 0.053323004 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

11 0.051556926 36 jmlr-2005-Gaussian Processes for Ordinal Regression

12 0.051332138 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

13 0.048949067 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

14 0.045018587 54 jmlr-2005-Managing Diversity in Regression Ensembles

15 0.044844583 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

16 0.044422943 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

17 0.044111043 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

18 0.040308457 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

19 0.039384216 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

20 0.035758238 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.239), (1, -0.043), (2, -0.038), (3, -0.124), (4, 0.035), (5, 0.106), (6, -0.192), (7, -0.153), (8, 0.015), (9, -0.036), (10, 0.199), (11, -0.022), (12, 0.162), (13, 0.079), (14, 0.2), (15, 0.111), (16, 0.015), (17, -0.215), (18, -0.05), (19, -0.071), (20, 0.109), (21, -0.13), (22, 0.19), (23, -0.041), (24, -0.102), (25, 0.041), (26, -0.06), (27, 0.091), (28, 0.087), (29, 0.084), (30, -0.084), (31, 0.146), (32, 0.058), (33, 0.146), (34, 0.101), (35, -0.106), (36, -0.159), (37, 0.029), (38, -0.273), (39, 0.013), (40, 0.175), (41, -0.022), (42, 0.087), (43, 0.061), (44, 0.063), (45, 0.154), (46, -0.135), (47, 0.046), (48, 0.071), (49, 0.193)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96890777 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

Author: Gal Elidan, Nir Friedman

Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods

2 0.37371713 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

Author: Asela Gunawardana, William Byrne

Abstract: The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. Keywords: EM, variational EM, incremental EM, convergence, information geometry

3 0.33329228 44 jmlr-2005-Learning Module Networks

Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman

Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN

4 0.31718749 39 jmlr-2005-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis

5 0.31007874 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing

6 0.27499568 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

7 0.26568231 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

8 0.22445424 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

9 0.20568663 54 jmlr-2005-Managing Diversity in Regression Ensembles

10 0.20222327 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

11 0.20052294 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

12 0.17066872 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

13 0.16341466 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

14 0.16192061 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions

15 0.16117707 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks

16 0.15726505 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

17 0.15425397 20 jmlr-2005-Clustering with Bregman Divergences

18 0.15345941 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

19 0.15249127 12 jmlr-2005-An MDP-Based Recommender System

20 0.14384642 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.013), (17, 0.03), (19, 0.02), (36, 0.028), (37, 0.032), (42, 0.016), (43, 0.038), (47, 0.018), (52, 0.065), (59, 0.022), (70, 0.033), (88, 0.583), (90, 0.018), (94, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99663025 16 jmlr-2005-Asymptotics in Empirical Risk Minimization

Author: Leila Mohammadi, Sara van de Geer

Abstract: In this paper, we study a two-category classification problem. We indicate the categories by labels Y = 1 and Y = −1. We observe a covariate, or feature, X ∈ X ⊂ Rd . Consider a collection {ha } of classifiers indexed by a finite-dimensional parameter a, and the classifier ha∗ that minimizes the prediction error over this class. The parameter a∗ is estimated by the empirical risk minimizer an over the class, where the empirical risk is calculated on a training sample of size n. We apply ˆ the Kim Pollard Theorem to show that under certain differentiability assumptions, an converges to ˆ a∗ with rate n−1/3 , and also present the asymptotic distribution of the renormalized estimator. For example, let V0 denote the set of x on which, given X = x, the label Y = 1 is more likely (than the label Y = −1). If X is one-dimensional, the set V0 is the union of disjoint intervals. The problem is then to estimate the thresholds of the intervals. We obtain the asymptotic distribution of the empirical risk minimizer when the classifiers have K thresholds, where K is fixed. We furthermore consider an extension to higher-dimensional X, assuming basically that V0 has a smooth boundary in some given parametric class. We also discuss various rates of convergence when the differentiability conditions are possibly violated. Here, we again restrict ourselves to one-dimensional X. We show that the rate is n−1 in certain cases, and then also obtain the asymptotic distribution for the empirical prediction error. Keywords: asymptotic distribution, classification theory, estimation error, nonparametric models, threshold-based classifiers

2 0.99421775 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

Author: Savina Andonova Jaeger

Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes

same-paper 3 0.98173344 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

Author: Gal Elidan, Nir Friedman

Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods

4 0.97265637 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

Author: Petros Drineas, Michael W. Mahoney

Abstract: A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n3 ), where n is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n × n Gram matrix G such that computations of interest may be performed more rapidly. The approximation is of ˜ the form Gk = CWk+CT , where C is a matrix consisting of a small number c of columns of G and Wk is the best rank-k approximation to W , the matrix formed by the intersection between those c columns of G and the corresponding c rows of G. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let · 2 and · F denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let Gk be the best rank-k approximation to G. We prove that by choosing O(k/ε4 ) columns G −CWk+CT ξ ≤ G − Gk ξ +ε n ∑ G2 , ii i=1 both in expectation and with high probability, for both ξ = 2, F, and for all k : 0 ≤ k ≤ rank(W ). This approximation can be computed using O(n) additional space and time, after making two passes over the data from external storage. The relationships between this algorithm, other related matrix decompositions, and the Nystr¨ m method from integral equation theory are discussed.1 o Keywords: kernel methods, randomized algorithms, Gram matrix, Nystr¨ m method o

5 0.85101974 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, Dan Roth

Abstract: We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter. Keywords: generalization bounds, area under the ROC curve, ranking, large deviations, uniform convergence ∗. Parts of the results contained in this paper were presented at the 18th Annual Conference on Neural Information Processing Systems in December, 2004 (Agarwal et al., 2005a) and at the 10th International Workshop on Artificial Intelligence and Statistics in January, 2005 (Agarwal et al., 2005b). ©2005 Shivan

6 0.83740336 11 jmlr-2005-Algorithmic Stability and Meta-Learning

7 0.83260673 20 jmlr-2005-Clustering with Bregman Divergences

8 0.811387 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error

9 0.81046361 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

10 0.80751365 48 jmlr-2005-Learning the Kernel Function via Regularization

11 0.79882932 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

12 0.78758705 67 jmlr-2005-Stability of Randomized Learning Algorithms

13 0.78710139 47 jmlr-2005-Learning from Examples as an Inverse Problem

14 0.77832425 44 jmlr-2005-Learning Module Networks

15 0.77552366 71 jmlr-2005-Variational Message Passing

16 0.76835144 39 jmlr-2005-Information Bottleneck for Gaussian Variables

17 0.7630617 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

18 0.76013458 3 jmlr-2005-A Classification Framework for Anomaly Detection

19 0.74136484 64 jmlr-2005-Semigroup Kernels on Measures

20 0.73848921 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)