nips nips2000 nips2000-17 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Simon Tong, Daphne Koller
Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.
Reference: text
sentIndex sentText sentNum sentScore
1 In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. [sent-7, score-0.281]
2 In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. [sent-8, score-0.48]
3 This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. [sent-9, score-0.404]
4 We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. [sent-10, score-0.536]
5 We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations. [sent-11, score-0.393]
6 One possible method for reducing the number of instances is to choose better instances from which to learn. [sent-14, score-0.38]
7 Almost universally, the machine learning literature assumes that we are given a set of instances chosen randomly from the underlying distribution. [sent-15, score-0.281]
8 In this paper, we assume that the learner has the ability to guide the instances it gets, selecting instances that are more likely to lead to more accurate models. [sent-16, score-0.543]
9 The possibility of active learning can arise naturally in a variety of domains, in several variants. [sent-18, score-0.359]
10 In selective active learning, we have the ability of explicitly asking for an example of a certain "type"; i. [sent-19, score-0.433]
11 For example, if our domain involves webpages, the learner might be able to ask a human teacher for examples of homepages of graduate students in a Computer Science department. [sent-22, score-0.266]
12 A variant of selective active learning is pool-based active learning, where the learner has access to a large pool of instances, about which it knows only the value of certain attributes. [sent-23, score-0.902]
13 It can then ask for instances in this pool for which these known attributes take on certain values. [sent-24, score-0.361]
14 census to have everyone fill out only the short form; the active learner could then select among the respondents for those that should fill out the more detailed long form. [sent-27, score-0.538]
15 In such active learning settings, we need a mechanism that tells us which instances to select. [sent-29, score-0.584]
16 We present a formal framework for active learning in Bayesian networks (BNs). [sent-32, score-0.392]
17 At first sight, the applicability of active learning to density estimation is unclear. [sent-35, score-0.438]
18 Given that we are not simply sampling, it is initially not clear that an active learning algorithm even learns the correct density. [sent-36, score-0.393]
19 Furthermore, it is not clear that active learning is necessarily beneficial in this setting. [sent-40, score-0.359]
20 We assume that the CPD of each node consists of a separate multinomial distribution over Dom[XiJ for each instantiation u of the parents U i . [sent-51, score-0.483]
21 Hence, we have a parameter OXi;lu for each Xij E Dom[XiJ; we use OXil u to represent the vector of parameters associated with the multinomial P(Xi I u). [sent-52, score-0.202]
22 Our focus is on the parameter estimation task: we are given the network structure Q, and our goal is to use data to estimate the network parameters O. [sent-53, score-0.297]
23 As usual [5], we make the assumption of parameter independence, which allows us to represent the joint distribution p( 0) as a set of independent distributions, one for each multinomial 0 Xi Iu. [sent-55, score-0.243]
24 For multinomials, the conjugate prior is a Dirichlet distribution [4], which is parameterized by hyperparameters aj E IR+, with a* = l:j aj. [sent-56, score-0.26]
25 If we obtain a new instance X = Xj sampled from this distribution, then our posterior distribution p(O) is also distributed Dirichlet with hyperparameters (al, . [sent-62, score-0.36]
26 In a BN with the parameter independence assumption, we have a Dirichlet distribution for every multinomial distribution OXi lu. [sent-69, score-0.259]
27 3 Active Learning Assume we start out with a network structure Q and a prior distribution p( 0) over the parameters of Q. [sent-71, score-0.236]
28 In a standard machine learning framework, data instances are independently, randomly sampled from some underlying distribution. [sent-72, score-0.341]
29 In an active learning setting, we have the ability to request certain types of instances. [sent-73, score-0.44]
30 The learner can select a subset of variables Q C C and a particular instantiation q to Q. [sent-75, score-0.329]
31 The result of such a query is a randomly sampled instance x conditioned on Q = q. [sent-77, score-0.563]
32 A (myopic) active learner eis a querying function that takes Q and p(O), and selects a query Q = q. [sent-78, score-0.986]
33 It takes the resulting instance x, and uses it to update its distribution p(O) to obtain a posterior p'(O). [sent-79, score-0.246]
34 To fully specify the algorithm, we need to address two issues: we need to describe how our parameter distribution is updated given that x is not a random sample, and we need to construct a mechanism for selecting the next query based on p. [sent-82, score-0.53]
35 To answer the first issue assume for simplicity that our query is Q = q for a single node Q. [sent-83, score-0.529]
36 First, it is clear that we cannot use the resulting instance x to update the parameters of the node Q itself. [sent-84, score-0.273]
37 More generally, we define a variable Y to be updateable in the context of a selective query Q if it is not in Q or an ancestor of a node in Q. [sent-90, score-0.731]
38 Given a prior distribution p( 9) and an instance x from a query Q = q, we do standard Bayesian updating, as in the case of randomly sampled instances, but we update only the Dirichlet distributions of update able nodes. [sent-92, score-0.793]
39 We use p( 9 t Q = q, x) to denote the distribution pi (9) obtained from this algorithm; this can be read as "the density of 9 after asking query q and obtaining the response x". [sent-93, score-0.569]
40 Our second task is to construct an algorithm for deciding on our next query given our current distribution p. [sent-94, score-0.498]
41 This allows us to evaluate the extent to which various instances would improve the quality of our model, thereby providing us with an approach for selecting the next query to perform. [sent-96, score-0.784]
42 However, our posterior distribution p( 9) represents our "optimal" beliefs about the different possible values of 9*, given our prior knowledge and the evidence. [sent-103, score-0.197]
43 Therefore, we can define the risk of a particular iJ with respect to pas: Ee~p(e) [Loss (6 II iJ)] = 10 Loss (9 II iJ)p(9) d9. [sent-104, score-0.323]
44 We shall only be considering using the Bayesian point estimate, thus we define the risk of a density p, Risk(p( 9)), to be the risk of the optimal iJ with respect to p. [sent-106, score-0.693]
45 The risk of our density p(9) is our measure for the quality of our current state of knowledge, as represented by p(9) . [sent-107, score-0.405]
46 In a greedy scheme, our goal is to obtain an instance x such that the risk of the pi obtained by updating p with x is lowest. [sent-108, score-0.413]
47 Our expected posterior risk is therefore: ExPRisk(p(9) I Q = q) = Ee~p(e)Ex~Pe(XIQ=q)Risk(p(9 t Q = q, x)). [sent-111, score-0.476]
48 (2) This definition leads immediately to the following simple algorithm: For each candidate query Q = q, we evaluate the expected posterior risk, and then select the query for which it is lowest. [sent-112, score-1.098]
49 4 Active Learning Algorithm To obtain a concrete algorithm from the active learning framework shown in the previous section, we must pick a loss function. [sent-113, score-0.486]
50 (An analogous analysis can be carried through for another very natural loss function: negative loglikelihood of future data - in the case of multinomial CPDs with Dirichlet densities over the parameters this results in an identical final algorithm. [sent-117, score-0.196]
51 The first is that the value 0 that minimizes the risk relative to p is the mean value of the parameters, Ee~p(9) [0]. [sent-120, score-0.323]
52 Then the risk decomposes as: =L L Pij(u)8(aXillu,oo. [sent-132, score-0.378]
53 (4) gives us a concrete expression for evaluating the risk of p(O). [sent-134, score-0.403]
54 However, to evaluate a potential query, we also need its expected posterior risk. [sent-135, score-0.193]
55 Recall that this is the expectation, over all possible answers to the query, of the risk of the posterior distribution p'. [sent-136, score-0.489]
56 Consider a BN where we have only one child node X and its parents U, i. [sent-139, score-0.222]
57 We also restrict attention to queries where we control all and only the parents U. [sent-142, score-0.216]
58 In this case, a query q is an instantiation to U, and the possible outcomes to the query are the possible values of the variable X. [sent-143, score-0.945]
59 The expected posterior risk contains a term for each variable Xi and each instantiation to its parents. [sent-144, score-0.595]
60 However, as these variables are not updateable, their hyperparameters remain the same following any query q. [sent-146, score-0.565]
61 Hence, their contribution to the risk is the same in every p( 0 t U = q, x) , and in our prior p( 0). [sent-147, score-0.386]
62 The second observation is that the hyperparameters corresponding to an instantiation u are the same in p and p' except for u = q. [sent-153, score-0.231]
63 lq (7) If we now select our query q so as to maximize the difference between our current risk and the expected posterior risk, we get a very natural behavior: We will select the query q that leads to the greatest reduction in the entropy of X given its parents. [sent-165, score-1.454]
64 It is also here that we can gain an insight as to where active learning has an edge over random sampling. [sent-166, score-0.359]
65 Consider one situation in which ql which is 100 times less likely than ~; ql will lead us to update a parameter whose current density is Dirichlet(l, 1), whereas q2 will lead us to update a parameter whose current density is Dirichlet(100, 100). [sent-167, score-0.512]
66 In other words, if we are confident about commonly occurring situations, it is worth more to ask about the rare cases. [sent-169, score-0.379]
67 Here, our average over possible query answers encompasses exponentially many terms. [sent-171, score-0.445]
68 2 For an arbitrary BN and an arbitrary query Q = q, the expected KL posterior risk decomposes as: ExPRisk(P(O) 1 Q = q) = Pij(u 1 Q = q)ExPRisk xi (P(O) 1 Vi = u). [sent-174, score-1.03]
69 i uEDom[ u;] In other words, the expected posterior risk is a weighted sum of expected posterior risks for conditional distributions of individual nodes Xi, where for each node we consider "queries" that are complete instantiations to the parents Vi of Xi . [sent-175, score-0.964]
70 We now have similar decompositions for the risk and the expected posterior risk. [sent-176, score-0.476]
71 In the case of a single node and a full parent query, the hyperparameters of the parents could not change, so these two weights were necessarily the same. [sent-182, score-0.372]
72 In the more general setting, an instantiation x can change hyperparameters all through the network, leading to different weights. [sent-183, score-0.231]
73 The above analysis provides us with an efficient implementation of our general active learning scheme. [sent-191, score-0.394]
74 We simply choose a set of variables in the Bayesian network that we wish to control, and for each instantiation of the controllable variables we compute the expected change in risk given by Eq. [sent-192, score-0.807]
75 We then ask the query with the greatest expected change and update the parameters of the updateable nodes. [sent-194, score-0.881]
76 Our algorithm (approximately) finds the query that reduces the expected risk the most. [sent-199, score-0.84]
77 3 Let U be the set of nodes which are updateable for at least one candidate query at each querying step. [sent-220, score-0.839]
78 Assuming that the underlying true distribution is not deterministic, then our querying algorithm produces consistent estimates for the CPD parameters of every member ofU. [sent-221, score-0.272]
79 Alarm has 37 nodes and 518 independent parameters, Asia has eight nodes and 18 independent parameters, and Cancer has five nodes and 11 independent parameters. [sent-223, score-0.249]
80 To simulate this, we obtained our prior by sampling a few hundred instances from the true network and used the counts (together with smoothing from a uniform prior) as our prior. [sent-227, score-0.33]
81 This is akin to asking for a prior network from a domain expert, or using an existing set of complete data to find initial settings of the parameters. [sent-228, score-0.296]
82 We then compared refining the parameters either by using active learning or by random sampling. [sent-229, score-0.451]
83 We permitted the active learner to abstain from choosing a value for a controlled node if it did not wish to -- that node is then sampled as usual. [sent-230, score-0.759]
84 We see that active learning provides a substantial improvement in all three networks. [sent-233, score-0.397]
85 The extent of the improvement depends on the extent to which queries allow us to reach rare events. [sent-235, score-0.351]
86 Although there was a significant gain by using active learning in this network, we found that there was a greater increase in performance if we altered the generating network to have P(Smoking) = 0. [sent-239, score-0.436]
87 One possible reason is that the improvement is "washed out" by randomness, as the active learner and standard learner are learning from different instances. [sent-243, score-0.651]
88 Overall, we found that in almost all situations active learning performed as well as or better than random sampling. [sent-247, score-0.404]
89 The situations where active learning produced most benefit were, unsurprisingly, those in which the prior was confident and correct about the commonly occurring cases and uncertain and incorrect about the rare ones. [sent-248, score-0.786]
90 By experimenting with forcing different priors we found that active learning was worse in one type of situation: where the prior was confident yet incorrect about the commonly occurring cases and uncertain but actually correct about the rare ones. [sent-250, score-0.73]
91 Another factor affecting the performance was the degree to which the controllable nodes could influence the updateable nodes. [sent-252, score-0.35]
92 6 Discussion and Conclusions We have presented a formal framework and resulting querying algorithm for parameter estimation in Bayesian networks. [sent-253, score-0.307]
93 To our knowledge, this is one of the first applications of active learning in an unsupervised context. [sent-254, score-0.359]
94 Our algorithm uses parameter distributions to guide the learner to ask queries that will improve the quality of its estimate the most. [sent-255, score-0.468]
95 BN active learning can also be performed in a causal setting. [sent-256, score-0.4]
96 A query now acts as experiment - it intervenes in a model and forces variables to take particular values. [sent-257, score-0.453]
97 The only difference is that the notion of an updateable node is even simpler - any node that is not part of a query is updateable. [sent-259, score-0.81]
98 We have demonstrated that active learning can have significant advantages for the task of parameter estimation in BNs, particularly in the case where our parameter prior is of the type that a human expert is likely to provide. [sent-261, score-0.663]
99 Although it is less important to estimate the probabilities of rare events accurately, the number of instances obtained if we randomly sample from the distribution is still not enough. [sent-263, score-0.373]
100 A further direction that we are pursuing is active learning for the causal structure of a domain. [sent-266, score-0.4]
wordName wordTfidf (topN-words)
[('query', 0.413), ('risk', 0.323), ('active', 0.304), ('pij', 0.208), ('instances', 0.19), ('dirichlet', 0.166), ('updateable', 0.165), ('querying', 0.142), ('learner', 0.127), ('bn', 0.124), ('instantiation', 0.119), ('exprisk', 0.118), ('node', 0.116), ('hyperparameters', 0.112), ('queries', 0.11), ('parents', 0.106), ('controllable', 0.102), ('asia', 0.102), ('rare', 0.096), ('ask', 0.096), ('bayesian', 0.092), ('multinomial', 0.091), ('xi', 0.086), ('posterior', 0.083), ('nodes', 0.083), ('network', 0.077), ('bns', 0.074), ('dom', 0.071), ('oxi', 0.071), ('smoking', 0.071), ('uedom', 0.071), ('expected', 0.07), ('ij', 0.069), ('kl', 0.069), ('parameter', 0.066), ('prior', 0.063), ('confident', 0.061), ('loss', 0.06), ('sampled', 0.06), ('update', 0.058), ('asking', 0.058), ('learning', 0.055), ('alarm', 0.055), ('cancer', 0.055), ('decomposes', 0.055), ('settings', 0.055), ('instance', 0.054), ('distribution', 0.051), ('cpds', 0.047), ('expriskx', 0.047), ('join', 0.047), ('refining', 0.047), ('request', 0.047), ('riskx', 0.047), ('density', 0.047), ('xij', 0.045), ('expression', 0.045), ('situations', 0.045), ('parameters', 0.045), ('occurring', 0.043), ('benefit', 0.043), ('domain', 0.043), ('select', 0.043), ('commonly', 0.042), ('causal', 0.041), ('expert', 0.041), ('pool', 0.041), ('worth', 0.041), ('variables', 0.04), ('evaluate', 0.04), ('improvement', 0.038), ('parent', 0.038), ('ee', 0.038), ('selective', 0.037), ('cpd', 0.037), ('randomly', 0.036), ('xj', 0.036), ('likely', 0.036), ('extent', 0.036), ('candidate', 0.036), ('wish', 0.036), ('updating', 0.036), ('quality', 0.035), ('graphical', 0.035), ('us', 0.035), ('aj', 0.034), ('certain', 0.034), ('algorithm', 0.034), ('uncertain', 0.034), ('greatest', 0.034), ('hyperparameter', 0.034), ('framework', 0.033), ('estimation', 0.032), ('priors', 0.032), ('fill', 0.032), ('answers', 0.032), ('lq', 0.032), ('ql', 0.032), ('conditional', 0.03), ('koller', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
Author: Simon Tong, Daphne Koller
Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.
2 0.16359825 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
Author: Gal Elidan, Noam Lotner, Nir Friedman, Daphne Koller
Abstract: A serious problem in learning probabilistic models is the presence of hidden variables. These variables are not observed, yet interact with several of the observed variables. As such, they induce seemingly complex dependencies among the latter. In recent years, much attention has been devoted to the development of algorithms for learning parameters, and in some cases structure, in the presence of hidden variables. In this paper, we address the related problem of detecting hidden variables that interact with the observed variables. This problem is of interest both for improving our understanding of the domain and as a preliminary step that guides the learning procedure towards promising models. A very natural approach is to search for
3 0.14261691 148 nips-2000-`N-Body' Problems in Statistical Learning
Author: Alexander G. Gray, Andrew W. Moore
Abstract: We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. 1
4 0.13370207 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
5 0.11546011 144 nips-2000-Vicinal Risk Minimization
Author: Olivier Chapelle, Jason Weston, Léon Bottou, Vladimir Vapnik
Abstract: The Vicinal Risk Minimization principle establishes a bridge between generative models and methods derived from the Structural Risk Minimization Principle such as Support Vector Machines or Statistical Regularization. We explain how VRM provides a framework which integrates a number of existing algorithms, such as Parzen windows, Support Vector Machines, Ridge Regression, Constrained Logistic Classifiers and Tangent-Prop. We then show how the approach implies new algorithms for solving problems usually associated with generative models. New algorithms are described for dealing with pattern recognition problems with very different pattern distributions and dealing with unlabeled data. Preliminary empirical results are presented.
6 0.10286661 133 nips-2000-The Kernel Gibbs Sampler
7 0.10087503 114 nips-2000-Second Order Approximations for Probability Models
8 0.094654731 4 nips-2000-A Linear Programming Approach to Novelty Detection
9 0.090438306 21 nips-2000-Algorithmic Stability and Generalization Performance
10 0.087387189 18 nips-2000-Active Support Vector Machine Classification
11 0.080225229 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
12 0.079071842 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks
13 0.076866433 122 nips-2000-Sparse Representation for Gaussian Process Models
14 0.075879335 127 nips-2000-Structure Learning in Human Causal Induction
15 0.074336879 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
16 0.074198849 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
17 0.072403945 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks
18 0.071985342 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
19 0.070582472 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
20 0.070305377 23 nips-2000-An Adaptive Metric Machine for Pattern Classification
topicId topicWeight
[(0, 0.263), (1, 0.04), (2, 0.082), (3, -0.082), (4, 0.13), (5, 0.011), (6, 0.061), (7, 0.005), (8, -0.068), (9, 0.097), (10, 0.084), (11, 0.076), (12, 0.076), (13, -0.042), (14, 0.089), (15, -0.004), (16, -0.079), (17, 0.056), (18, -0.084), (19, 0.157), (20, -0.222), (21, -0.077), (22, 0.099), (23, -0.175), (24, 0.051), (25, -0.007), (26, 0.161), (27, 0.089), (28, 0.043), (29, -0.144), (30, 0.157), (31, -0.249), (32, -0.069), (33, 0.1), (34, 0.016), (35, -0.037), (36, -0.015), (37, -0.063), (38, 0.031), (39, -0.067), (40, -0.018), (41, 0.075), (42, -0.044), (43, -0.035), (44, -0.17), (45, -0.102), (46, -0.04), (47, 0.124), (48, -0.037), (49, 0.062)]
simIndex simValue paperId paperTitle
same-paper 1 0.95457947 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
Author: Simon Tong, Daphne Koller
Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.
2 0.64739394 148 nips-2000-`N-Body' Problems in Statistical Learning
Author: Alexander G. Gray, Andrew W. Moore
Abstract: We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. 1
3 0.64131296 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
Author: Gal Elidan, Noam Lotner, Nir Friedman, Daphne Koller
Abstract: A serious problem in learning probabilistic models is the presence of hidden variables. These variables are not observed, yet interact with several of the observed variables. As such, they induce seemingly complex dependencies among the latter. In recent years, much attention has been devoted to the development of algorithms for learning parameters, and in some cases structure, in the presence of hidden variables. In this paper, we address the related problem of detecting hidden variables that interact with the observed variables. This problem is of interest both for improving our understanding of the domain and as a preliminary step that guides the learning procedure towards promising models. A very natural approach is to search for
4 0.45875344 144 nips-2000-Vicinal Risk Minimization
Author: Olivier Chapelle, Jason Weston, Léon Bottou, Vladimir Vapnik
Abstract: The Vicinal Risk Minimization principle establishes a bridge between generative models and methods derived from the Structural Risk Minimization Principle such as Support Vector Machines or Statistical Regularization. We explain how VRM provides a framework which integrates a number of existing algorithms, such as Parzen windows, Support Vector Machines, Ridge Regression, Constrained Logistic Classifiers and Tangent-Prop. We then show how the approach implies new algorithms for solving problems usually associated with generative models. New algorithms are described for dealing with pattern recognition problems with very different pattern distributions and dealing with unlabeled data. Preliminary empirical results are presented.
5 0.42462784 44 nips-2000-Efficient Learning of Linear Perceptrons
Author: Shai Ben-David, Hans-Ulrich Simon
Abstract: We consider the existence of efficient algorithms for learning the class of half-spaces in ~n in the agnostic learning model (Le., making no prior assumptions on the example-generating distribution). The resulting combinatorial problem - finding the best agreement half-space over an input sample - is NP hard to approximate to within some constant factor. We suggest a way to circumvent this theoretical bound by introducing a new measure of success for such algorithms. An algorithm is IL-margin successful if the agreement ratio of the half-space it outputs is as good as that of any half-space once training points that are inside the IL-margins of its separating hyper-plane are disregarded. We prove crisp computational complexity results with respect to this success measure: On one hand, for every positive IL, there exist efficient (poly-time) IL-margin successful learning algorithms. On the other hand, we prove that unless P=NP, there is no algorithm that runs in time polynomial in the sample size and in 1/ IL that is IL-margin successful for all IL> O. 1
6 0.42284608 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
7 0.40796989 18 nips-2000-Active Support Vector Machine Classification
8 0.35919219 16 nips-2000-Active Inference in Concept Learning
9 0.34849438 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks
10 0.33974639 133 nips-2000-The Kernel Gibbs Sampler
11 0.3328132 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach
12 0.33269796 127 nips-2000-Structure Learning in Human Causal Induction
13 0.31374139 21 nips-2000-Algorithmic Stability and Generalization Performance
14 0.30328155 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
15 0.30313212 35 nips-2000-Computing with Finite and Infinite Networks
16 0.29933408 103 nips-2000-Probabilistic Semantic Video Indexing
17 0.29552934 27 nips-2000-Automatic Choice of Dimensionality for PCA
18 0.29337963 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
19 0.28439406 114 nips-2000-Second Order Approximations for Probability Models
20 0.2777639 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
topicId topicWeight
[(10, 0.061), (17, 0.091), (32, 0.03), (33, 0.043), (55, 0.058), (62, 0.048), (65, 0.017), (67, 0.068), (75, 0.013), (76, 0.035), (79, 0.354), (81, 0.023), (90, 0.036), (91, 0.018), (97, 0.021)]
simIndex simValue paperId paperTitle
1 0.95188808 29 nips-2000-Bayes Networks on Ice: Robotic Search for Antarctic Meteorites
Author: Liam Pedersen, Dimitrios Apostolopoulos, William Whittaker
Abstract: A Bayes network based classifier for distinguishing terrestrial rocks from meteorites is implemented onboard the Nomad robot. Equipped with a camera, spectrometer and eddy current sensor, this robot searched the ice sheets of Antarctica and autonomously made the first robotic identification of a meteorite, in January 2000 at the Elephant Moraine. This paper discusses rock classification from a robotic platform, and describes the system onboard Nomad. 1
same-paper 2 0.88923645 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
Author: Simon Tong, Daphne Koller
Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.
3 0.84830517 27 nips-2000-Automatic Choice of Dimensionality for PCA
Author: Thomas P. Minka
Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.
4 0.59262502 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
5 0.55466276 92 nips-2000-Occam's Razor
Author: Carl Edward Rasmussen, Zoubin Ghahramani
Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.
6 0.50680888 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
7 0.49573702 133 nips-2000-The Kernel Gibbs Sampler
8 0.47294751 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
9 0.47081545 94 nips-2000-On Reversing Jensen's Inequality
10 0.46558669 127 nips-2000-Structure Learning in Human Causal Induction
11 0.46472788 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
12 0.45244676 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates
13 0.44676307 57 nips-2000-Four-legged Walking Gait Control Using a Neuromorphic Chip Interfaced to a Support Vector Learning Algorithm
14 0.44629893 102 nips-2000-Position Variance, Recurrence and Perceptual Learning
15 0.44401118 80 nips-2000-Learning Switching Linear Models of Human Motion
16 0.44326106 49 nips-2000-Explaining Away in Weight Space
17 0.44211772 4 nips-2000-A Linear Programming Approach to Novelty Detection
18 0.43793169 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition
19 0.43384832 74 nips-2000-Kernel Expansions with Unlabeled Examples
20 0.42811504 82 nips-2000-Learning and Tracking Cyclic Human Motion