jmlr jmlr2005 jmlr2005-44 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman
Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN
Reference: text
sentIndex sentText sentNum sentScore
1 We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. [sent-16, score-1.841]
2 Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. [sent-18, score-2.138]
3 By making the modular structure explicit, the module network representation provides insight about the domain that are often be obscured by the intricate details of a large Bayesian network structure. [sent-47, score-1.034]
4 A module network can be viewed simply as a Bayesian network in which variables in the same module share parents and parameters. [sent-48, score-1.898]
5 (b) A simple module network; the boxes illustrate modules, where stock price variables share CPDs and parameters. [sent-55, score-1.023]
6 Note that in a module network, variables in the same module have the same CPDs but may have different descendants. [sent-56, score-1.674]
7 Our results show that our learned module network generalizes to unseen test data much better than a Bayesian network. [sent-59, score-0.95]
8 They also illustrate the ability of the learned module network to reveal high-level structure that provides important insights. [sent-60, score-0.983]
9 To model this type of situation, we might divide stock price variables into groups, which we call modules, and require that variables in the same module have the same probabilistic model; that is, all variables in the module have the same set of parents and the same CPD. [sent-77, score-1.912]
10 This notion of a module is the key idea underlying the module network formalism. [sent-81, score-1.731]
11 As described above, a module represents a set of variables that share the same set of parents and the same CPD. [sent-88, score-0.923]
12 The first defines a template probabilistic model for each module in C ; all of the variables assigned to the module will share this probabilistic model. [sent-97, score-1.782]
13 Definition 1 A module network template T = (S , θ) for C defines, for each module M j ∈ C : • a set of parents PaM j ⊂ X ; • a conditional probability distribution template P(M j | PaM j ) which specifies a distribution over Val(M j ) for each assignment in Val(PaM j ). [sent-98, score-1.909]
14 A module network defines a probabilistic model by using the formal random variables M j and their associated CPDs as templates that encode the behavior of all of the variables assigned to that module. [sent-111, score-0.968]
15 Specifically, we define the semantics of a module network by “unrolling” a Bayesian network where all of the variables assigned to module M j share the parents and conditional probability template assigned to M j in T . [sent-112, score-1.969]
16 Acyclicity can be guaranteed by the following simple condition on the module network: Definition 3 Let M be a triple (C , T , A ), where C is a module set, T is a module network template for C , and A is a module assignment function for C . [sent-114, score-3.501]
17 M defines a directed module graph GM as follows: • the nodes in GM correspond to the modules in C ; • GM contains an edge M j → Mk if and only if there is a variable X ∈ X so that A (X) = j and X ∈ PaMk . [sent-115, score-1.179]
18 We say that M is a module network if the module graph GM is acyclic. [sent-116, score-1.731]
19 For example, for the module network of Figure 1(b), the module graph has the structure M1 → M2 → M3 . [sent-117, score-1.764]
20 Returning to our example, the Bayesian network of Figure 1(a) is the ground Bayesian network of the module network of Figure 1(b). [sent-120, score-1.069]
21 Using the acyclicity of the module graph, we can now show that the semantics for a module network is well-defined. [sent-121, score-1.761]
22 By definition of the module graph, we must have an edge MA (Xi ) → MA (X j ) in the module graph. [sent-125, score-1.658]
23 Recall that a module network is specified by a set of modules C , an assignment function A of nodes to modules, the parent structure S specified in T , and the parameters θ for the local probability distributions P(M j | PaM j ). [sent-148, score-1.424]
24 Such distinctions can be made within the module network framework through more elaborate prior probability functions that take the module type into account. [sent-154, score-1.731]
25 One can consider several learning tasks for module networks, depending on which of the remaining aspects of the module network specification are known. [sent-155, score-1.731]
26 Our primary goal is to learn a module network structure and assignment function for this distribution. [sent-162, score-1.025]
27 As the semantics of a module network is defined via the ground Bayesian network, we have that, in the case of complete data, the likelihood decomposes into a product of local likelihood functions, one for each variable. [sent-171, score-1.002]
28 The CPD template of each module is connected to all variables assigned to that module (e. [sent-176, score-1.712]
29 The sufficient statistics of each CPD template are the sum of the sufficient statistics of each variable assigned to the module and the module parents. [sent-179, score-1.696]
30 In a module network, all of the variables in the same module share the same parameters. [sent-186, score-1.708]
31 The second module contains three variables; thus, the sufficient statistics for the module CPD is the sum of the statistics we would collect in the ground Bayesian network of Figure 1(a): ˆ ˆ ˆ ˆ S[M2 , MSFT] = S[AMAT, MSFT] + S[MOT, MSFT] + S[INTL, MSFT]. [sent-198, score-1.752]
32 , the module graph induced by the assignment A and structure S is acyclic), and 0 otherwise. [sent-216, score-0.952]
33 M M • ρ(S ) satisfies structure modularity if ρ(S ) = ∏ ρ j (S j ), j where S j denotes the choice of parents for module M j and ρ j is a non-negative measure over these choices. [sent-220, score-0.954]
34 • κ(A ) satisfies assignment modularity if κ(A ) = ∏ κ j (A j ), j where A j denote is the choice of variables assigned to module M j and κ j is a non-negative measure over these choices. [sent-221, score-0.999]
35 Then, the Bayesian score decomposes into local module scores: K score(S , A : D ) = ∑ scoreM (PaM , A (X j ) j j : D ), j=1 where scoreM j (U, X : D ) = log Z L j (U, X, θM j |U : D )P(θM j | U)dθM j |U + log ρ j (U) + log κ j (X). [sent-238, score-0.925]
36 (4) Proof Recall that we defined the Bayesian score of a module network as: score(S , A : D ) = log P(D | S , A ) + log P(S , A ). [sent-239, score-0.968]
37 Using global modularity, structure modularity and assignment modularity assumptions of Definition 7, log P(S , A ) decomposes by modules, resulting in the second and third terms Equation (4) that capture the preferences for the parents of module M j and the variables assigned to it. [sent-240, score-1.154]
38 (Note that edge reversal is not a well-defined operator for module networks, as an edge from a variable to a module represents a one-to-many relation between the variable and all of the variables in the module. [sent-263, score-1.674]
39 ) When an operator causes a parent Xi to be added to the parent set of module M j , we need to verify that the resulting module graph remains acyclic, relative to the current assignment A . [sent-264, score-1.82]
40 When updating the dependency structure for a module M j , the module score for another module Mk does not change, nor do the changes in score induced by various operators applied to the dependency structure of Mk . [sent-267, score-2.766]
41 Moreover, only the delta score of operators that add or remove a parent from module M j need to be recomputed after a change to the dependency structure of module M j , resulting in additional savings. [sent-269, score-1.844]
42 Overall, if the maximum number of parents per module is d, the cost of evaluating each operator applied to the module is, as usual, at most O(Md), for accumulating the necessary sufficient statistics. [sent-271, score-1.702]
43 During the structure learning phase, each step to the parent set of module M j requires that we re-evaluate at most n operators (one for each existing or potential parent of M j ), at a total cost of O(nMd). [sent-274, score-0.955]
44 This type of step occurs in two places in our algorithm: once at the very beginning of the algorithm, in order to initialize the modules, and once at each iteration, given a module network structure S learned in the previous structure learning step. [sent-277, score-1.016]
45 We can therefore view the module assignment task as the task of clustering variables into sets, so that variables in the same set have a similar behavior across all instances. [sent-284, score-0.975]
46 Moreover, our setting places certain constraints on the clustering, so that the resulting assignment function will induce a legal (acyclic) module network. [sent-294, score-0.945]
47 Our clustering algorithm uses an objective function that evaluates a partition of variables into modules by measuring the extent to which the module model is a good fit to the features (instances) of the module variables. [sent-298, score-2.048]
48 We then consider all possible legal module mergers (those corresponding to modules with the same domain), where we change the assignment function to replace two modules j1 and j2 by a new module j1,2 . [sent-308, score-2.474]
49 More subtly, the Bayesian score for each module depends non-additively on the sufficient statistics of all the variables assigned to the module. [sent-326, score-0.927]
50 ) Thus, we can only compute the delta score for moving a variable from one module to another given a fixed assignment of the other variables to these two modules. [sent-328, score-1.001]
51 We start with an initial assignment function A 0 , and in a “round-robin” fashion iterate over all of the variables one at a time, and consider changing their module assignment. [sent-331, score-0.935]
52 Our algorithm terminates when it can no longer im570 L EARNING M ODULE N ETWORKS Input: D // Data set K // Number of modules Output: M // A module network Learn-Module-Network A0 = cluster X into K modules S0 = empty structure Loop t = 1, 2, . [sent-336, score-1.649]
53 Greedy-Structure-Search successively applies operators that change the structure as long as each such operator results in a legal structure and improves the module network score prove the score. [sent-340, score-1.081]
54 Once again, the decomposition of the score plays a key role in reducing the complexity of this computation: When reassigning a variable Xi from one module Mold to another Mnew , only the local scores of these modules change. [sent-343, score-1.273]
55 The module score of all other modules remains unchanged. [sent-344, score-1.245]
56 (c) Initialization of the assignment function for the module network procedure for the data in (a). [sent-428, score-0.992]
57 Many of the domains suited for module network models contain continuous valued variables, such as gene expression or price changes in the stock market. [sent-440, score-1.165]
58 m Xi ∈X j ˆ j, S1 = (5) m Xi ∈X j ˆ j, S2 = m Xi ∈X j The local module score further decomposes into independent components, one for each leaf . [sent-482, score-0.968]
59 When performing structure search for module networks with regression-tree CPDs, in addition to choosing the parents of each module, we must also choose the associated tree structure. [sent-485, score-0.971]
60 Experimental Results We evaluated our module network learning procedure on synthetic data and on two real data sets — gene expression data, and stock market data. [sent-498, score-1.127]
61 5 0 5 10 15 0 20 5 10 15 20 Algorithm Iterations Algorithm Iterations (a) (b) Figure 9: (a) Score of the model (normalized by the number of variables/genes) across the iterations of the algorithm for a module network learned with 50 modules on the gene expression data. [sent-538, score-1.394]
62 (b) Changes in the assignment of genes to modules for the module network learned in (a) across the iterations of the algorithm. [sent-540, score-1.465]
63 We then selected 2355 genes that varied significantly in the data and learned a module network over these genes. [sent-558, score-1.025]
64 2 0 20 40 60 80 100 Runs (initialized from random clusterings) Figure 10: Score of 100 module networks (normalized by the number of variables/genes) each learned with 50 modules from a random clustering initialization, where the runs are sorted according to their score. [sent-566, score-1.279]
65 The score of a module network learned using the deterministic clustering initialization described in Section 4. [sent-567, score-1.055]
66 We generated 100 random assignments of variables to modules, and learned a module network starting from each initialization. [sent-583, score-0.989]
67 In Figure 11(a), we show the difference between module networks of different size and the baseline Bayesian network, demonstrating that module networks generalize much better to unseen data for almost all choices of number of modules. [sent-588, score-1.714]
68 We evaluated a learned module network with 50 modules, where we selected 50 modules due to the biological plausibility of having, on average, 40–50 genes per module. [sent-594, score-1.393]
69 Suppose we find l genes with a certain annotation in a module of size N. [sent-599, score-0.935]
70 A comparison of the overall enrichments of the modules learned by module networks to the enrichments obtained for clusters using AutoClass is shown in Figure 11(b), indicating that there are many annotations that are much more significantly enriched in module networks. [sent-612, score-2.149]
71 Another module regulated by the cell cycle module is the nitrogen catabolite repression (NCR) module, a cellular response activated when nitrogen sources are scarce. [sent-616, score-1.672]
72 The y-axis denotes the difference in log-likelihood on held out data between the learned module network and the learned Bayesian network, averaged over 10 folds; the error bars show the standard deviation. [sent-620, score-0.998]
73 (b) Comparison of the enrichment for annotations of functional annotations between the modules learned using the module network procedure and the clusters learned by the AutoClass clustering algorithm (Cheeseman et al. [sent-621, score-1.496]
74 The y-axis denotes the difference in log-likelihood on held out data between the learned module network and the learned Bayesian network, averaged over 10 folds; the error bars show the standard deviation. [sent-637, score-0.998]
75 (b) Comparison of the enrichment for annotations of sectors between the modules learned using the module network procedure and the clusters learned by the AutoClass clustering algorithm (Cheeseman et al. [sent-638, score-1.45]
76 To test the quality of our modules, we measured the enrichment of the modules in the network with 50 modules for annotations representing various sectors to which each stock belongs (based on sector classifications from http://finance. [sent-641, score-0.968]
77 In 20 of the 21 cases, the enrichment was far more significant in the modules learned using module networks compared to the one learned by AutoClass, as can be seen in Figure 12(b). [sent-648, score-1.335]
78 Finally, we also looked at the structure of the module network, and found several cases where the structure fit our (limited) understanding of the stock domain. [sent-649, score-1.012]
79 One can easily extend module networks with ideas from the hierarchical Bayesian framework, allowing the parameters of different variables in the same module to be correlated but not necessarily equal. [sent-664, score-1.702]
80 Each module can be viewed as a class, so that the variables in a single module share the same probabilistic model. [sent-668, score-1.726]
81 As the module assignments are not known in advance, module networks correspond most closely to the variant of these frameworks where there is type uncertainty — uncertainty about the class assignment of objects. [sent-669, score-1.799]
82 By contrast, in a module network, all of the variables in a module (class) must have the same specific parents. [sent-672, score-1.674]
83 To better relate the PRM approach to module networks, recall the relationship between module networks and clustering, as described in Section 4. [sent-686, score-1.686]
84 As we discussed, we can view the module network learning procedure as grouping variables into clusters that share the same probabilistic dependency model. [sent-688, score-1.0]
85 From this perspective, the module network framework can be viewed as being closely related to a PRM where the module assignment is a hidden attribute of a row. [sent-695, score-1.821]
86 A key difference between the PRM-based approach and our module network framework is that, in the PRM, the regulators cannot themselves participate in the probabilistic model without leading to cycles. [sent-702, score-0.943]
87 Overall, the module network framework places strong restrictions on the richness of the set of objects and on the dependency structures that can be represented. [sent-705, score-0.961]
88 Discussion and Conclusions We have introduced the framework of module networks, an extension of Bayesian networks that includes an explicit representation of modules — subsets of variables that share a statistical model. [sent-709, score-1.257]
89 We have presented a Bayesian learning framework for module networks, which learns both the partitioning of variables into modules and the dependency structure of each module. [sent-710, score-1.258]
90 Our results show that our learned module networks have much higher generalization performance than a Bayesian network learned from the same data. [sent-712, score-1.026]
91 There are several reasons why a learned module network is a better model than a learned Bayesian network. [sent-713, score-0.998]
92 Conversely, in a module network setting, a spurious correlation would have to arise between a possible parent and a large number of other variables before the algorithm would find it worthwhile to introduce the dependency. [sent-721, score-0.954]
93 In such domains, a module network would force variables into sharing dependency structures and CPDs and may result in poor representations of the underlying domain properties. [sent-725, score-0.976]
94 Even in domains where the modularity assumption is warranted, the module network models we presented here may not be ideal. [sent-726, score-0.975]
95 , 2004), we presented one possible extension to the module network framework presented in this paper, which allows genes to be assigned to several modules. [sent-733, score-0.993]
96 The expression of a gene in a particular array is then modeled as a sum of its expression in each of the modules in which it participates, and each module can potentially have a different set of regulators. [sent-734, score-1.293]
97 For instance, in the biological domain, the number of regulatory modules of an organism in an expression data set is obviously not known and thus determining 584 L EARNING M ODULE N ETWORKS the number of modules should be part of the regulatory module discovery task. [sent-738, score-1.637]
98 1 we showed that, at least in synthetic data, where the number of modules is known, we can use the score of the model to select the correct number of modules by choosing the model with the smallest number of modules from among the highest scoring models. [sent-740, score-1.137]
99 , 2003b), we use the module network learned from the gene expression data described above to predict gene regulation relationships. [sent-768, score-1.118]
100 Thus, we have demonstrated that the module network model is robust enough to learn a good approximation of the dependency structure between 2355 genes using only 173 instances. [sent-773, score-1.04]
wordName wordTfidf (topN-words)
[('module', 0.829), ('modules', 0.35), ('pam', 0.15), ('stock', 0.117), ('intl', 0.096), ('cpd', 0.092), ('msft', 0.092), ('assignment', 0.09), ('amat', 0.089), ('genes', 0.075), ('gene', 0.074), ('network', 0.073), ('bayesian', 0.071), ('odule', 0.069), ('score', 0.066), ('egal', 0.062), ('egev', 0.062), ('riedman', 0.052), ('koller', 0.051), ('modularity', 0.048), ('learned', 0.048), ('annotations', 0.046), ('mot', 0.046), ('parents', 0.044), ('leaf', 0.043), ('cpds', 0.042), ('dell', 0.039), ('hpq', 0.038), ('val', 0.037), ('etworks', 0.036), ('parent', 0.036), ('segal', 0.036), ('regulatory', 0.035), ('share', 0.034), ('structure', 0.033), ('enrichment', 0.032), ('annotation', 0.031), ('bm', 0.031), ('reassignment', 0.031), ('dependency', 0.03), ('networks', 0.028), ('price', 0.027), ('prms', 0.027), ('trading', 0.027), ('friedman', 0.026), ('relational', 0.026), ('legal', 0.026), ('domains', 0.025), ('clustering', 0.024), ('autoclass', 0.023), ('regulators', 0.023), ('assignments', 0.023), ('heckerman', 0.022), ('template', 0.022), ('earning', 0.021), ('scoring', 0.021), ('intel', 0.021), ('ground', 0.021), ('operators', 0.021), ('stocks', 0.02), ('expression', 0.02), ('tree', 0.02), ('manufacturers', 0.019), ('plate', 0.019), ('chip', 0.019), ('enriched', 0.019), ('oobns', 0.019), ('scorem', 0.019), ('pe', 0.018), ('probabilistic', 0.018), ('biological', 0.018), ('decomposes', 0.017), ('semantics', 0.017), ('search', 0.017), ('instances', 0.017), ('likelihood', 0.016), ('variables', 0.016), ('assigned', 0.016), ('regev', 0.016), ('cheeseman', 0.016), ('gm', 0.016), ('genetics', 0.015), ('pfeffer', 0.015), ('reassigning', 0.015), ('structures', 0.015), ('initialization', 0.015), ('er', 0.015), ('market', 0.014), ('objects', 0.014), ('thousands', 0.014), ('stanford', 0.014), ('cluster', 0.014), ('cell', 0.014), ('local', 0.013), ('priors', 0.013), ('dependencies', 0.013), ('domain', 0.013), ('modular', 0.013), ('acyclicity', 0.013), ('preferences', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 44 jmlr-2005-Learning Module Networks
Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman
Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN
2 0.077097543 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
Author: Gal Elidan, Nir Friedman
Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods
3 0.039901622 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
4 0.039648399 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon
Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension
5 0.034801882 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
Author: Aapo Hyvärinen
Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence
6 0.033030171 36 jmlr-2005-Gaussian Processes for Ordinal Regression
7 0.031742387 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
8 0.031596854 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
9 0.030481117 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
10 0.028650874 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
11 0.026345979 71 jmlr-2005-Variational Message Passing
12 0.021622123 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
13 0.019645514 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
14 0.019080473 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks
15 0.015696676 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
16 0.01547694 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
17 0.014808424 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
18 0.014759164 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
19 0.013635128 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
20 0.013433851 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
topicId topicWeight
[(0, 0.105), (1, -0.028), (2, -0.019), (3, -0.094), (4, 0.052), (5, 0.06), (6, -0.127), (7, -0.089), (8, 0.049), (9, -0.086), (10, 0.148), (11, 0.032), (12, 0.176), (13, 0.013), (14, 0.207), (15, 0.064), (16, 0.022), (17, -0.22), (18, -0.122), (19, 0.026), (20, 0.107), (21, -0.003), (22, 0.29), (23, 0.006), (24, 0.071), (25, -0.052), (26, -0.243), (27, 0.296), (28, 0.128), (29, -0.052), (30, -0.071), (31, 0.146), (32, -0.227), (33, -0.304), (34, 0.148), (35, 0.013), (36, 0.258), (37, -0.188), (38, 0.074), (39, -0.169), (40, -0.135), (41, 0.063), (42, -0.11), (43, -0.188), (44, 0.034), (45, -0.064), (46, 0.123), (47, -0.049), (48, -0.049), (49, -0.092)]
simIndex simValue paperId paperTitle
same-paper 1 0.96999741 44 jmlr-2005-Learning Module Networks
Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman
Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN
2 0.23480329 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
Author: Gal Elidan, Nir Friedman
Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods
3 0.1618109 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
Author: Aapo Hyvärinen
Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence
4 0.1466583 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
Author: Robert G. Cowell
Abstract: This paper describes a scheme for local computation in conditional Gaussian Bayesian networks that combines the approach of Lauritzen and Jensen (2001) with some elements of Shachter and Kenley (1989). Message passing takes place on an elimination tree structure rather than the more compact (and usual) junction tree of cliques. This yields a local computation scheme in which all calculations involving the continuous variables are performed by manipulating univariate regressions, and hence matrix operations are avoided. Keywords: Bayesian networks, conditional Gaussian distributions, propagation algorithm, elimination tree
5 0.11371844 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
6 0.11276133 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
8 0.092438094 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
9 0.085061289 36 jmlr-2005-Gaussian Processes for Ordinal Regression
10 0.084327415 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
11 0.075042054 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
12 0.07254833 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
13 0.070685782 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
14 0.06979949 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
15 0.069142073 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
16 0.056025691 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
17 0.055852644 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
18 0.055252708 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
19 0.049543738 71 jmlr-2005-Variational Message Passing
20 0.048007611 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
topicId topicWeight
[(13, 0.014), (17, 0.041), (19, 0.017), (36, 0.048), (37, 0.033), (41, 0.374), (42, 0.016), (43, 0.043), (47, 0.012), (52, 0.078), (59, 0.021), (70, 0.024), (88, 0.136), (90, 0.011), (94, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.72189099 44 jmlr-2005-Learning Module Networks
Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman
Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN
2 0.40796757 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
3 0.40427426 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
Author: Gal Elidan, Nir Friedman
Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods
4 0.40420231 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
5 0.40418461 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
6 0.40411967 39 jmlr-2005-Information Bottleneck for Gaussian Variables
7 0.40252981 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
8 0.40234092 3 jmlr-2005-A Classification Framework for Anomaly Detection
9 0.40196517 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
10 0.40172222 64 jmlr-2005-Semigroup Kernels on Measures
11 0.40030929 20 jmlr-2005-Clustering with Bregman Divergences
12 0.40014172 48 jmlr-2005-Learning the Kernel Function via Regularization
13 0.39472765 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
14 0.39117858 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
15 0.38962194 36 jmlr-2005-Gaussian Processes for Ordinal Regression
16 0.38521579 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
17 0.38476706 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
18 0.38293198 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
19 0.38267308 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
20 0.38237581 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions