nips nips2007 nips2007-212 knowledge-graph by maker-knowledge-mining

212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes


Source: pdf

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. [sent-3, score-0.736]

2 We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. [sent-4, score-0.522]

3 If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. [sent-5, score-0.543]

4 Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel. [sent-6, score-0.355]

5 1 Introduction Gaussian processes (GP’s) are a widely used method for Bayesian non-linear non-parametric regression and classification [13, 16]. [sent-7, score-0.124]

6 GP’s are based on defining a similarity or kernel function that encodes prior knowledge of the smoothness of the underlying process that is being modeled. [sent-8, score-0.293]

7 Many real-world applications are characterized by high-dimensional, highly-structured data with a large supply of unlabeled data but a very limited supply of labeled data. [sent-10, score-0.529]

8 Applications such as information retrieval and machine vision are examples where unlabeled data is readily available. [sent-11, score-0.232]

9 GP’s are discriminative models by nature and within the standard regression or classification scenario, unlabeled data is of no use. [sent-12, score-0.319]

10 labeled input vectors Xl = {xn }N and their n=1 associated target labels {yn }N ∈ R or {yn }N ∈ {−1, 1} for regression/classification, GP’s n=1 n=1 model p(yn |xn ) directly. [sent-16, score-0.334]

11 Unless some assumptions are made about the underlying distribution of the input data X = [Xl , Xu ], unlabeled data, Xu , cannot be used. [sent-17, score-0.36]

12 Many researchers have tried to use unlabeled data by incorporating a model of p(X). [sent-18, score-0.261]

13 For classification tasks, [11] model p(X) as a mixture yn p(xn |yn )p(yn ) and then infer p(yn |xn ), [15] attempts to learn covariance kernels based on p(X), and [10] assumes that the decision boundaries should occur in regions where the data density, p(X), is low. [sent-19, score-0.489]

14 When faced with high-dimensional, highly-structured data, however, none of the existing approaches have proved to be particularly successful. [sent-20, score-0.042]

15 First, they can be learned efficiently from unlabeled data and the top-level features generally capture significant, high-order correlations in the data. [sent-22, score-0.327]

16 We first learn a DBN model of p(X) in an entirely unsupervised way using the fast, greedy learning algorithm introduced by [7] and further investigated in [2, 14, 6]. [sent-24, score-0.245]

17 We then use this generative model to initialize a multi-layer, non-linear mapping F (x|W ), parameterized by W , with F : X → Z mapping the input vectors in X into a feature space Z. [sent-25, score-0.391]

18 Typically the mapping F (x|W ) will contain millions of parameters. [sent-26, score-0.127]

19 The top-level features produced by this mapping allow fairly accurate reconstruction of the input, so they must contain most of the information in the input vector, but they express this information in a way that makes explicit a lot of the higher-order structure in the input data. [sent-27, score-0.406]

20 After learning F (x|W ), a natural way to define a kernel function is to set K(xi , xj ) = exp (−||F (xi |W ) − F (xj |W )||2 ). [sent-28, score-0.255]

21 Note that the kernel is initialized in an entirely unsupervised way. [sent-29, score-0.218]

22 The parameters W of the covariance kernel can then be fine-tuned using the labeled data by 1 maximizing the log probability of the labels with respect to W . [sent-30, score-0.449]

23 In the final model most of the information for learning a covariance kernel will have come from modeling the input data. [sent-31, score-0.383]

24 The very limited information in the labels will be used only to slightly adjust the layers of features already discovered by the DBN. [sent-32, score-0.271]

25 2 Gaussian Processes for Regression and Binary Classification For a regression task, we are given a data set D of i. [sent-33, score-0.117]

26 labeled input vectors Xl = {xn }N and n=1 their corresponding target labels {yn }N ∈ R. [sent-36, score-0.334]

27 We are interested in the following probabilistic n=1 regression model: yn = f (xn ) + ǫ, ǫ ∼ N (ǫ|0, σ 2 ) (1) A Gaussian process regression places a zero-mean GP prior over the underlying latent function f we are modeling, so that a-priori p(f |Xl ) =N (f |0, K), where f = [f (x1 ), . [sent-37, score-0.545]

28 , f (xn )]T and K is the covariance matrix, whose entries are specified by the covariance function Kij = K(xi , xj ). [sent-40, score-0.348]

29 The covariance function encodes our prior notion of the smoothness of f , or the prior assumption that if two input vectors are similar according to some distance measure, their labels should be highly correlated. [sent-41, score-0.549]

30 Given a new test point x∗ , a prediction is obtained by conditioning on the observed data and θ. [sent-43, score-0.03]

31 (4) For a binary classification task, we similarly place a zero mean GP prior over the underlying latent function f , which is then passed through the logistic function g(x) = 1/(1 + exp(−x)) to define a prior p(yn = 1|xn ) = g(f (xn )). [sent-45, score-0.314]

32 In our experiments, we approximate the non-Gaussian posterior p(f |Xl , y) with a Gaussian one using expectation propagation [12]. [sent-48, score-0.049]

33 For more thorough reviews and implementation details refer to [13, 16]. [sent-49, score-0.067]

34 3 Learning Deep Belief Networks (DBN’s) In this section we describe an unsupervised way of learning a DBN model of the input data X = [Xl , Xu ], that contains both labeled and unlabeled data sets. [sent-50, score-0.488]

35 A DBN can be trained efficiently by using a Restricted Boltzmann Machine (RBM) to learn one layer of hidden features at a time [7]. [sent-51, score-0.378]

36 [18] introduced a class of two-layer undirected graphical models that generalize RBM’s to exponential family distributions. [sent-54, score-0.033]

37 This framework will allow us to model real-valued images of face patches and word-count vectors of documents. [sent-55, score-0.084]

38 1 Modeling Real-valued Data We use a conditional Gaussian distribution for modeling observed “visible” pixel values x (e. [sent-57, score-0.051]

39 images of faces) and a conditional Bernoulli distribution for modeling “hidden” features h (Fig. [sent-59, score-0.146]

40 The top layer represents stochastic binary hidden features h and and the bottom layer is composed of linear visible units x with Gaussian noise. [sent-61, score-0.841]

41 When using a Constrained Poisson Model, the top layer represents stochastic binary latent topic features h and the bottom layer represents the Poisson visible word-count vector x. [sent-62, score-0.828]

42 Middle panel: Pretraining consists of learning a stack of RBM’s. [sent-63, score-0.042]

43 Right panel: After pretraining, the RBM’s are used to initialize a covariance function of the Gaussian process, which is then fine-tuned by backpropagation. [sent-64, score-0.175]

44 where g(x) = 1/(1 + exp(−x)) is the logistic function, wij is a symmetric interaction term between 2 input i and feature j, σi is the variance of input i, and bi , bj are biases. [sent-65, score-0.596]

45 The marginal distribution over visible vector x is: exp (−E(x, h)) (9) p(x) = g exp (−E(u, g))du u h 2 −b where E(x, h) is an energy term: E(x, h) = i (xi2σ2i ) − j bj hj − i,j hj wij xi . [sent-66, score-1.16]

46 9: ∆wij = ǫ ∂ log p(x) = ǫ( data − model ) ∂wij (10) where ǫ is the learning rate, zi = xi /σi , < ·>data denotes an expectation with respect to the data distribution and < ·>model is an expectation with respect to the distribution defined by the model. [sent-68, score-0.335]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xl', 0.356), ('dbn', 0.321), ('rbm', 0.295), ('gp', 0.267), ('hj', 0.23), ('yn', 0.229), ('wij', 0.226), ('unlabeled', 0.202), ('layer', 0.171), ('visible', 0.157), ('covariance', 0.136), ('deep', 0.126), ('xn', 0.124), ('kernel', 0.105), ('pretraining', 0.105), ('supply', 0.096), ('features', 0.095), ('input', 0.091), ('xu', 0.09), ('bj', 0.09), ('regression', 0.087), ('discriminatively', 0.084), ('kij', 0.08), ('xi', 0.079), ('xj', 0.076), ('labeled', 0.075), ('gaussian', 0.074), ('exp', 0.074), ('labels', 0.072), ('df', 0.071), ('wt', 0.071), ('panel', 0.07), ('zi', 0.067), ('mapping', 0.065), ('latent', 0.061), ('unsupervised', 0.06), ('poisson', 0.057), ('hidden', 0.056), ('learn', 0.056), ('smoothness', 0.056), ('vectors', 0.055), ('entirely', 0.053), ('ruslan', 0.052), ('encodes', 0.051), ('modeling', 0.051), ('classi', 0.05), ('logistic', 0.05), ('belief', 0.05), ('expectation', 0.049), ('bi', 0.048), ('units', 0.048), ('backpropagation', 0.048), ('binary', 0.046), ('salakhutdinov', 0.045), ('param', 0.045), ('parameterized', 0.044), ('prior', 0.044), ('greedy', 0.043), ('faced', 0.042), ('recon', 0.042), ('activate', 0.042), ('stack', 0.042), ('spherical', 0.042), ('target', 0.041), ('contrastive', 0.04), ('initialize', 0.039), ('geoffrey', 0.038), ('layers', 0.038), ('stochastically', 0.038), ('kernels', 0.038), ('circumvent', 0.037), ('king', 0.037), ('top', 0.037), ('processes', 0.037), ('underlying', 0.037), ('toronto', 0.036), ('discovered', 0.036), ('cation', 0.035), ('thorough', 0.034), ('lot', 0.033), ('nets', 0.033), ('reviews', 0.033), ('du', 0.033), ('introduced', 0.033), ('passed', 0.032), ('welling', 0.032), ('boltzmann', 0.032), ('generative', 0.032), ('contain', 0.031), ('log', 0.031), ('reconstruct', 0.031), ('millions', 0.031), ('net', 0.031), ('represents', 0.03), ('bottom', 0.03), ('data', 0.03), ('adjust', 0.03), ('hinton', 0.03), ('researchers', 0.029), ('patches', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.

2 0.40912354 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields

Author: Simon Osindero, Geoffrey E. Hinton

Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1

3 0.31462699 182 nips-2007-Sparse deep belief net model for visual area V2

Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng

Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1

4 0.22832209 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

5 0.19973743 32 nips-2007-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Harald Steck, R. B. Rao, Rómer Rosales

Abstract: We propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, unlike some previous multi-view learning methods. Furthermore, it can automatically estimate how much each view should be trusted, and thus accommodate noisy or unreliable views. Experiments on toy data and real world data sets illustrate the benefits of this approach. 1

6 0.18463026 170 nips-2007-Robust Regression with Twinned Gaussian Processes

7 0.18255731 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

8 0.16636334 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

9 0.15219705 135 nips-2007-Multi-task Gaussian Process Prediction

10 0.14009002 69 nips-2007-Discriminative Batch Mode Active Learning

11 0.13596183 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

12 0.12690622 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

13 0.10623258 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

14 0.10589744 175 nips-2007-Semi-Supervised Multitask Learning

15 0.10336244 195 nips-2007-The Generalized FITC Approximation

16 0.10084089 166 nips-2007-Regularized Boost for Semi-Supervised Learning

17 0.09679386 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

18 0.087666504 97 nips-2007-Hidden Common Cause Relations in Relational Learning

19 0.084167033 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

20 0.080834277 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.286), (1, 0.181), (2, -0.065), (3, 0.084), (4, -0.065), (5, 0.115), (6, -0.43), (7, -0.128), (8, 0.323), (9, 0.034), (10, 0.138), (11, -0.112), (12, -0.054), (13, -0.019), (14, -0.053), (15, 0.006), (16, -0.032), (17, -0.174), (18, -0.035), (19, 0.084), (20, 0.14), (21, -0.107), (22, -0.089), (23, -0.033), (24, 0.101), (25, -0.01), (26, -0.011), (27, -0.064), (28, 0.026), (29, -0.008), (30, 0.053), (31, 0.058), (32, -0.029), (33, 0.097), (34, 0.085), (35, -0.03), (36, -0.037), (37, -0.029), (38, -0.065), (39, 0.011), (40, -0.024), (41, -0.022), (42, 0.012), (43, 0.038), (44, 0.033), (45, 0.029), (46, 0.048), (47, 0.022), (48, 0.015), (49, -0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94743276 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.

2 0.8069123 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields

Author: Simon Osindero, Geoffrey E. Hinton

Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1

3 0.63398105 182 nips-2007-Sparse deep belief net model for visual area V2

Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng

Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1

4 0.59956509 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun

Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1

5 0.55165166 32 nips-2007-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Harald Steck, R. B. Rao, Rómer Rosales

Abstract: We propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, unlike some previous multi-view learning methods. Furthermore, it can automatically estimate how much each view should be trusted, and thus accommodate noisy or unreliable views. Experiments on toy data and real world data sets illustrate the benefits of this approach. 1

6 0.51631916 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

7 0.41022813 170 nips-2007-Robust Regression with Twinned Gaussian Processes

8 0.39548701 97 nips-2007-Hidden Common Cause Relations in Relational Learning

9 0.37985203 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

10 0.37795445 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

11 0.36734837 135 nips-2007-Multi-task Gaussian Process Prediction

12 0.35405749 195 nips-2007-The Generalized FITC Approximation

13 0.35157332 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

14 0.34082681 175 nips-2007-Semi-Supervised Multitask Learning

15 0.33663535 166 nips-2007-Regularized Boost for Semi-Supervised Learning

16 0.33308136 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

17 0.3235406 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

18 0.31880811 69 nips-2007-Discriminative Batch Mode Active Learning

19 0.31561428 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

20 0.28880176 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.05), (9, 0.139), (13, 0.016), (16, 0.057), (21, 0.137), (34, 0.025), (35, 0.027), (47, 0.057), (49, 0.012), (83, 0.27), (85, 0.022), (87, 0.025), (90, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93898219 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.

2 0.92763627 30 nips-2007-Bayes-Adaptive POMDPs

Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau

Abstract: Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be finitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. 1

3 0.9268564 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray

Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1

4 0.90351993 96 nips-2007-Heterogeneous Component Analysis

Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii

Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1

5 0.88214195 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa

Abstract: Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time. 1

6 0.88055933 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

7 0.87988627 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

8 0.87814856 196 nips-2007-The Infinite Gamma-Poisson Feature Model

9 0.87564653 175 nips-2007-Semi-Supervised Multitask Learning

10 0.87337339 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

11 0.87275171 46 nips-2007-Cluster Stability for Finite Samples

12 0.87239987 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

13 0.87181813 147 nips-2007-One-Pass Boosting

14 0.8703528 187 nips-2007-Structured Learning with Approximate Inference

15 0.8685503 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

16 0.86803722 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

17 0.86572623 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

18 0.86566389 116 nips-2007-Learning the structure of manifolds using random projections

19 0.86526531 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

20 0.86435628 180 nips-2007-Sparse Feature Learning for Deep Belief Networks