nips nips2012 nips2012-13 knowledge-graph by maker-knowledge-mining

13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Source: pdf

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve ﬁnding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of ﬁrst, doing inference over the function space and second, ﬁnding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efﬁciently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We propose a novel Bayesian approach to solve stochastic optimization problems that involve ﬁnding extrema of noisy, nonlinear functions. [sent-22, score-0.189]

2 To this end, we devise a non-parametric conjugate prior based on a kernel regressor. [sent-25, score-0.298]

3 The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. [sent-26, score-0.189]

4 Given t observations of the function, the posterior can be evaluated efﬁciently in time O(t2 ) up to a multiplicative constant. [sent-27, score-0.242]

5 1 Introduction Historically, the ﬁelds of statistical inference and stochastic optimization have often developed their own speciﬁc methods and approaches. [sent-29, score-0.121]

6 Here we consider stochastic optimization problems where we observe noise-contaminated values from an unknown nonlinear function and we want to ﬁnd the input that maximizes the expected value of this function. [sent-31, score-0.121]

7 Consider a stochastic function f :X R mapping a test point x ∈ X to real values y ∈ R characterized by the conditional pdf P (y|x). [sent-34, score-0.235]

8 (1) The goal consists in modeling the optimal test point ¯ x∗ := arg max{f (x)}. [sent-36, score-0.107]

9 5 3 ¯ Figure 1: a) Given an estimate h of the mean function f (left), a simple probability density function over the location of the maximum x∗ is obtained using the transformation P (x∗ ) ∝ exp{αh(x∗ )}, where α > 0 plays the role of the precision (right). [sent-47, score-0.426]

10 b) Illustration of the Gramian matrix for different test locations. [sent-48, score-0.074]

11 Within the context of statistical inference, Bayesian optimization methods have been developed where a prior distribution over the space of functions is assumed and uncertainty is tracked during the entire optimization process [6, 7]. [sent-51, score-0.253]

12 In particular, non-parametric Bayesian approaches such as Gaussian Processes have been applied for derivative-free optimization [8, 9], also within the context of the continuum-armed bandit problem [10]. [sent-52, score-0.101]

13 Typically, these Bayesian approaches aim to explicitly represent the unknown objective function of (1) by entertaining a posterior distribution over the space of objective functions. [sent-53, score-0.189]

14 Let h(x) : X → R be an estimate ¯ of the mean f (x) constructed from data Dt := {(xi , yi )}t (Figure 1a, left). [sent-56, score-0.108]

15 This estimate can i=1 easily be converted into a posterior pdf over the location of the maximum by ﬁrst multiplying it with a precision parameter α > 0 and then taking the normalized exponential (Figure 1a, right) P (x∗ |Dt ) ∝ exp{α · h(x∗ )}. [sent-57, score-0.63]

16 In this transformation, the precision parameter α controls the certainty we have over our estimate of the maximizing argument: α ≈ 0 expresses almost no certainty, while α → ∞ expresses certainty. [sent-58, score-0.497]

17 The rationale for the precision is: the more distinct inputs we test, the higher the precision—testing the same (or similar) inputs only provides local information and therefore should not increase our knowledge about the global maximum. [sent-59, score-0.359]

18 ¯ In (3), the mean function f is estimated with a kernel regressor [11] that combines the function observations with a prior estimate of the function, and the total effective number of locations is calculated as the sum of the prior locations ξ and the number of distinct locations in the data Dt . [sent-61, score-1.115]

19 The latter is estimated by multiplying the number of data points t with the coefﬁcient i i 1 K(xi , xi ) ∈ (0, 1], j K(xi , xj ) Implementations can be downloaded from http://www. [sent-62, score-0.188]

20 5 3 Figure 2: Illustration of the posterior distribution over the maximizing argument for 10, 100 and 1000 observations drawn from a function with varying noise. [sent-85, score-0.422]

21 The top-left panel illustrates the function and the variance bounds (one standard deviation). [sent-86, score-0.13]

22 It can be seen that the prior gets progressively washed out with more observations. [sent-89, score-0.131]

23 Inputs that are very close to each other will have overlapping kernels, resulting in large off-diagonal entries of the Gramian matrix—hence decreasing the number of distinct locations (Figure 1b). [sent-93, score-0.22]

24 For example, if we have t observations from n ≪ t locations, and each location has t/n observations, then the coefﬁcient (4) is equal to n/t and hence the number of distinct locations is exactly n, as expected. [sent-94, score-0.362]

25 Figure 2 illustrates the behavior of the posterior distribution. [sent-95, score-0.224]

26 The expression for the posterior can be calculated up to a constant factor in O(t) time. [sent-96, score-0.226]

27 Therefore, our proposed posterior can be easily combined with Markov chain Monte Carlo methods (MCMC) to implement stochastic optimizers as will be illustrated in Section 4. [sent-98, score-0.311]

28 1 Function-Based, Indirect Model Our ﬁrst task is to derive an indirect Bayesian model for the optimal test point that builds its estimate via the underlying function space. [sent-100, score-0.274]

29 Let G be the set of hypotheses, and assume that each hypothesis g ∈ G corresponds to a stochastic mapping g : X R. [sent-101, score-0.139]

30 Let P (g) be the prior2 over G and let the likelihood be P ({yt }|g, {xt }) = t P (yt |g, xt ). [sent-102, score-0.384]

31 Then, the posterior of g is given by P (g) t P (yt |g, xt ) P (g)P ({yt }|g, {xt }) P (g|{yt }, {xt }) = = . [sent-103, score-0.49]

32 Note that we assume that the mean function g is bounded and that it has a unique maximizing test point. [sent-107, score-0.239]

33 2 Domain-Based, Direct Model We want to arrive at a Bayesian model that bypasses the integration step suggested by (6) and directly models the location of optimal test point x∗ . [sent-109, score-0.23]

34 The Bayesian model for the optimal test point x∗ is given by P (x∗ ) = P (g) dg (prior) G(x∗ ) P (yt |x∗ , xt , Dt−1 ) = G(x∗ ) P (yt |g, xt )P (g) G(x∗ ) P (g) t−1 k=1 t−1 k=1 P (yk |g, xk ) dg P (yk |g, xk ) dg , (likelihood) where Dt := {(xk , yk )}t is the set of past tests. [sent-112, score-1.367]

35 Using Bayes’ rule, the posterior distribution P (x∗ |{yt }, {xt }) can be rewritten as P (x∗ ) P (yt |x∗ , xt , Dt−1 ) . [sent-114, score-0.49]

36 P ({yt }|{xt }) t (7) Since this posterior is equal to (6), one concludes (using (5)) that P (x∗ ) P (yt |x∗ , xt , Dt−1 ) = P (g) G(x∗ ) t P (yt |g, xt ) dg. [sent-115, score-0.791]

37 t Note that this expression corresponds to the joint P (x∗ , {yt }|{xt }). [sent-116, score-0.075]

38 The likelihood is obtained as the fraction P (yt |x∗ , xt , Dt−1 ) = P (x∗ , {yk }t |{xk }t ) k=1 k=1 , P (x∗ , {yk }t−1 |{xk }t−1 ) k=1 k=1 where it shall be noted that the denominator P (x∗ , {yk }t−1 |{xk }t−1 ) doesn’t change if we add the k=1 k=1 condition xt . [sent-118, score-0.685]

39 From Theorem 1 it is seen that although the likelihood model P (yt |g, xt ) for the indirect model is i. [sent-119, score-0.495]

40 at each test point, the likelihood model P (yt |x∗ , xt , Dt−1 ) for the direct model depends on the past tests Dt−1 , that is, it is adaptive. [sent-122, score-0.529]

41 More critically though, the likelihood function’s internal structure of the direct model corresponds to an integration over function space as well— thus inheriting all the difﬁculties of the indirect model. [sent-123, score-0.297]

42 Then, the crucial insight consists in realizing that the value of the mean function g inside ¯ a sufﬁciently small neighborhood of x∗ is larger than the value outside of it (see Figure 3a). [sent-127, score-0.214]

43 / ¯ ¯ Furthermore, we impose a symmetry condition on the likelihood function. [sent-133, score-0.083]

44 There is no reason for them to 1 2 be very different: in fact, they should virtually be indistinguishable outside of the neighborhoods of x∗ and x∗ . [sent-135, score-0.098]

45 It is only inside of the neighborhood of x∗ when G(x∗ ) becomes distinguishable from 1 2 1 1 the other equivalence classes because the functions in G(x∗ ) systematically predict higher values 1 4 a) c) b) 0 Figure 3: Illustration of assumptions. [sent-136, score-0.106]

46 b) Schematic representation of the likelihood function of x∗ ∈ X conditioned on a few observations. [sent-139, score-0.083]

47 The curve corresponds to the mean and the shaded area to the conﬁdence bounds. [sent-140, score-0.183]

48 The density inside of the neighborhood is unique to the hypothesis x∗ , while the density outside is shared amongst all the hypotheses. [sent-141, score-0.281]

49 c) The log-likelihood ratio of the hypotheses x∗ and 1 x∗ as a function of the test point x. [sent-142, score-0.174]

50 In fact, taking the log-likelihood ratio of two competing hypotheses P (yt |x∗ , xt , Dt−1 ) 1 log P (yt |x∗ , xt , Dt−1 ) 2 for a given test location xt should give a value equal to zero unless xt is inside of the vicinity of x∗ 1 or x∗ (see Figure 3c). [sent-146, score-1.489]

51 In other words, the amount of evidence a hypothesis gets when the test point 2 is outside of its neighborhood is essentially zero (i. [sent-147, score-0.255]

52 4 Likelihood and Conjugate Prior Following our previous discussion, we propose the following likelihood model. [sent-151, score-0.083]

53 We have chosen the precision αt as i αt := ρ · ξ + i K(xi , xi ) j K(xi , xj ) where ρ > 0 is a scaling parameter; ξ > 0 is a parameter representing the number prior locations tested; and K : X × X → R+ is a symmetric kernel function4. [sent-153, score-0.701]

54 For the estimate ht , we have chosen a Naradaya-Watson kernel regressor [11] ht (x∗ ) := t i=1 K(xi , x∗ )yi + K0 (x∗ )y0 (x∗ ) t i=1 K(xi , x∗ ) + K0 (x∗ ) . [sent-154, score-0.508]

55 ¯ In the last expression, y0 corresponds to a prior estimate of f with prior precision K0 . [sent-155, score-0.546]

56 Inspecting (8), we see that the likelihood model favours positive changes to the estimated mean function from new, unseen test locations. [sent-156, score-0.277]

57 We propose the conjugate prior P (x∗ ) = 4 1 1 exp{α0 · g0 (x∗ )} = exp{ξ · y0 (x∗ )}. [sent-159, score-0.199]

58 Z0 Z0 We refer the reader to the kernel regression literature for an analysis of the choice of kernel functions. [sent-160, score-0.198]

59 5 (9) The conjugate prior just encodes a prior estimate of the mean function. [sent-161, score-0.438]

60 In a practical optimization application, it serves the purpose of guiding the exploration of the domain, as locations x∗ with high prior value y0 (x∗ ) are more likely to contain the maximizing argument. [sent-162, score-0.495]

61 exp αt · ht (x′ ) dx′ = (10) Thus, the particular choice of the likelihood function guarantees an analytically compact posterior expression. [sent-164, score-0.482]

62 In general, the normalizing constant in (10) is intractable, which is why the expression is only practical for relative comparisons of test locations. [sent-165, score-0.155]

63 Substituting the precision αt and the mean function estimate ht yields P (x∗ |Dt ) ∝ exp ρ · ξ + t · i i K(xi , xi ) j K(xi , xj ) · i K(xi , x∗ )yi + K0 (x∗ )y0 (x∗ ) . [sent-166, score-0.628]

64 We have investigated the inﬂuence of the parameters on the resulting posterior probability distribution. [sent-169, score-0.189]

65 We have used the Gaussian kernel K(x, x∗ ) = exp − 1 (x − x∗ )2 . [sent-170, score-0.158]

66 (12) 2 The prior precision K0 and the prior estimate of the mean function y0 were chosen as K0 (x) = 1 and y0 (x) = − 1 2 2 (x − µ0 ) , 2σ0 (13) where the latter corresponds to the logarithm of a Gaussian with mean µ0 = 1. [sent-173, score-0.65]

67 Figure 4 shows how the choice of the precision scale ρ and the kernel width σ affect the shape of the posterior probability density. [sent-176, score-0.54]

68 Here, it is seen that a larger kernel width σ increases the region of inﬂuence of a particular data point, and hence produce smoother posterior densities. [sent-177, score-0.35]

69 The precision scale parameter ρ controls the precision per distinct data point: higher values for ρ lead to sharper updates of the posterior distribution. [sent-178, score-0.628]

70 The main motivation behind our proposed model is its application to the optimization of noisy functions. [sent-181, score-0.169]

71 Because of the noise, choosing new test locations requires carefully balancing explorative and exploitative tests—a problem well known in the multiarmed bandits literature. [sent-182, score-0.235]

72 To overcome this, one can apply the Bayesian control rule/Thompson sampling [12, 13]: the next test location is chosen by sampling it from the posterior. [sent-183, score-0.163]

73 6 a) b) c) Figure 4: Effect of the change of parameters on the posterior density over the location of the maximizing test point. [sent-185, score-0.504]

74 Panel (a) shows the 7 data points drawn from the noisy function (solid curve). [sent-186, score-0.138]

75 Panel (b) shows the effect of increasing the width of the kernel (here, Gaussian). [sent-187, score-0.161]

76 Panel (c) shows the effect of diminishing the precision on the posterior, where solid and shaded curves correspond to ρ = 0. [sent-191, score-0.295]

77 5 −2 0 50 100 150 # of samples −2 0 200 50 100 150 # of samples 200 Figure 5: Observation values obtained by sampling from the posterior over the maximizing argument (left panel) and according to GP-UCB (right panel). [sent-202, score-0.369]

78 The solid blue curve corresponds to the timeaveraged function value, averaged over ten runs. [sent-203, score-0.219]

79 The gray area corresponds to the error bounds (1 standard deviation), and the dashed curve in red shows the time-average of a single run. [sent-204, score-0.084]

80 This is done by sampling the next test point xt directly from the posterior density over the optimum location P (x∗ |Dt ), and then using the resulting pair (xt , yt ) to recursively update the model. [sent-207, score-1.123]

81 We have implemented our proposed method with a Gaussian kernel as in (11) with width σ 2 = 0. [sent-215, score-0.161]

82 The prior sufﬁcient statistics are exactly as in (13). [sent-217, score-0.131]

83 We show the timeaveraged observation values y of the noisy function evaluated at test locations sampled from the posterior. [sent-221, score-0.45]

84 To test our proposed method on a challenging problem, we have designed a non-convex, high-dimensional noisy function with multiple local optima. [sent-224, score-0.182]

85 This Noisy Ripples function is deﬁned as 1 f (x) = − 1000 x − µ 2 2 + cos( 3 π x − µ ) where µ ∈ X is the location of the global maximum, and where observations have additive Gaussian noise with zero mean and variance 0. [sent-225, score-0.24]

86 b) The time-averaged value and the regret obtained by the optimization algorithm on a 50-dimensional version of the Noisy Ripples function. [sent-229, score-0.108]

87 This function is difﬁcult to optimize because it requires averaging the noisy observations and smoothing the ridged landscape in order to detect the underlying quadratic form. [sent-231, score-0.161]

88 We optimized the 50-dimensional version of this function using a Metropolis-Hastings scheme to sample the next test locations from the posterior over the maximizing argument. [sent-232, score-0.537]

89 07 before the point was used as an actual test location. [sent-234, score-0.107]

90 For the arg-max prior, we used a Gaussian kernel with lengthscale l = 2, precision factor ρ = 1. [sent-235, score-0.289]

91 5, prior precision K0 (x∗ ) = 1 and prior mean 2 estimate y0 (x∗ ) = − 1000 x + 5 2 . [sent-236, score-0.56]

92 Crucial for this success was the choice of a kernel that is wide enough to accurately estimate the mean function. [sent-240, score-0.207]

93 5 Conclusions Our goal was to design a probabilistic model over the maximizing argument that is algorithmically efﬁcient and statistically robust even for large, high-dimensional noisy functions. [sent-242, score-0.288]

94 To this end, we have derived a Bayesian model that directly captures the uncertainty over the maximizing argument, thereby bypassing having to model the underlying function space—a much harder problem. [sent-243, score-0.147]

95 Our proposed model is computationally very efﬁcient when compared to Gaussian process-based (which have cubic time complexity) or models based on upper conﬁdence bounds (which require ﬁnding the input maximizing the bound—a generally intractable operation). [sent-244, score-0.148]

96 In our model, evaluating the posterior up to a constant factor scales quadratically with the size of the data. [sent-245, score-0.189]

97 As in any kernel-based estimation method, choosing the appropriate kernel bandwidth can signiﬁcantly change the estimate and affect the performance of optimizers that rely on the model. [sent-247, score-0.217]

98 A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. [sent-254, score-0.141]

99 Application of bayesian approach to numerical methods of global and stochastic optimization. [sent-283, score-0.186]

100 Gaussian process optimization in the bandit setting: No regret and experimental design. [sent-310, score-0.148]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yt', 0.398), ('dt', 0.347), ('xt', 0.301), ('planck', 0.2), ('precision', 0.19), ('posterior', 0.189), ('locations', 0.161), ('ht', 0.151), ('yk', 0.137), ('prior', 0.131), ('xk', 0.121), ('gramian', 0.115), ('ripples', 0.115), ('maximizing', 0.113), ('indirect', 0.111), ('noisy', 0.108), ('kernel', 0.099), ('panel', 0.095), ('dg', 0.093), ('location', 0.089), ('cybernetics', 0.085), ('likelihood', 0.083), ('bayesian', 0.08), ('xi', 0.079), ('intelligent', 0.077), ('timeaveraged', 0.077), ('test', 0.074), ('institute', 0.072), ('hk', 0.071), ('pdf', 0.068), ('conjugate', 0.068), ('dk', 0.068), ('extrema', 0.068), ('favours', 0.068), ('ortega', 0.068), ('hypotheses', 0.067), ('argument', 0.067), ('obs', 0.062), ('optimizers', 0.062), ('width', 0.062), ('optimization', 0.061), ('stochastic', 0.06), ('exp', 0.059), ('distinct', 0.059), ('solid', 0.058), ('estimate', 0.056), ('outside', 0.056), ('inside', 0.055), ('observations', 0.053), ('mean', 0.052), ('culties', 0.051), ('regressor', 0.051), ('neighborhood', 0.051), ('gaussian', 0.05), ('certainty', 0.048), ('regret', 0.047), ('shaded', 0.047), ('curve', 0.046), ('global', 0.046), ('expresses', 0.045), ('normalizing', 0.044), ('virtually', 0.042), ('hypothesis', 0.041), ('dx', 0.041), ('xj', 0.041), ('tests', 0.04), ('bandit', 0.04), ('cos', 0.039), ('density', 0.039), ('illustration', 0.038), ('corresponds', 0.038), ('multiplying', 0.038), ('max', 0.037), ('expression', 0.037), ('illustrates', 0.035), ('intractable', 0.035), ('located', 0.034), ('bypassing', 0.034), ('balduzzi', 0.034), ('blackbox', 0.034), ('bypasses', 0.034), ('garnett', 0.034), ('inheriting', 0.034), ('neglect', 0.034), ('point', 0.033), ('dence', 0.033), ('inputs', 0.032), ('ance', 0.031), ('braun', 0.031), ('bristol', 0.031), ('brochu', 0.031), ('cora', 0.031), ('rawlik', 0.031), ('srinivas', 0.031), ('direct', 0.031), ('points', 0.03), ('observation', 0.03), ('guiding', 0.029), ('inspecting', 0.029), ('kappen', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

2 0.42752248 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been ﬁrst studied in [1]. The paper extends this work in two directions: ﬁrst, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identiﬁability and the uniqueness of maximum likelihood estimates are being given. 1

3 0.37940043 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-conﬁdence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-conﬁdence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

4 0.25663713 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semideﬁnite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

5 0.2256541 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

Author: Marc Deisenroth, Shakir Mohamed

Abstract: Rich and complex time-series data, such as those generated from engineering systems, ﬁnancial markets, videos, or neural recordings are now a common feature of modern data analysis. Explaining the phenomena underlying these diverse data sets requires ﬂexible and accurate models. In this paper, we promote Gaussian process dynamical systems as a rich model class that is appropriate for such an analysis. We present a new approximate message-passing algorithm for Bayesian state estimation and inference in Gaussian process dynamical systems, a nonparametric probabilistic generalization of commonly used state-space models. We derive our message-passing algorithm using Expectation Propagation and provide a unifying perspective on message passing in general state-space models. We show that existing Gaussian ﬁlters and smoothers appear as special cases within our inference framework, and that these existing approaches can be improved upon using iterated message passing. Using both synthetic and real-world data, we demonstrate that iterated message passing can improve inference in a wide range of tasks in Bayesian state estimation, thus leading to improved predictions and more effective decision making. 1

6 0.22405356 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

7 0.22360305 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

8 0.22179282 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

9 0.21139567 293 nips-2012-Relax and Randomize : From Value to Algorithms

10 0.19241241 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

11 0.18045515 292 nips-2012-Regularized Off-Policy TD-Learning

12 0.16567409 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

13 0.1572894 195 nips-2012-Learning visual motion in recurrent neural networks

14 0.12336375 168 nips-2012-Kernel Latent SVM for Visual Recognition

15 0.11803151 300 nips-2012-Scalable nonconvex inexact proximal splitting

16 0.11760175 303 nips-2012-Searching for objects driven by context

17 0.11730887 283 nips-2012-Putting Bayes to sleep

18 0.11100668 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

19 0.10803243 282 nips-2012-Proximal Newton-type methods for convex optimization

20 0.10709167 27 nips-2012-A quasi-Newton proximal splitting method

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.322), (1, 0.001), (2, 0.215), (3, 0.432), (4, 0.09), (5, -0.161), (6, -0.003), (7, -0.064), (8, 0.077), (9, -0.051), (10, -0.081), (11, -0.028), (12, 0.126), (13, -0.019), (14, -0.009), (15, -0.033), (16, 0.027), (17, 0.037), (18, -0.009), (19, 0.044), (20, -0.039), (21, -0.005), (22, 0.107), (23, 0.023), (24, 0.049), (25, -0.015), (26, 0.032), (27, -0.032), (28, 0.059), (29, 0.017), (30, -0.056), (31, -0.029), (32, -0.049), (33, -0.018), (34, -0.008), (35, -0.032), (36, -0.033), (37, 0.026), (38, 0.072), (39, -0.009), (40, -0.009), (41, 0.04), (42, 0.033), (43, 0.012), (44, 0.037), (45, -0.024), (46, -0.003), (47, -0.045), (48, -0.012), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95630521 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

2 0.86557657 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

3 0.80922574 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

4 0.78664201 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

Abstract: This paper provides the ﬁrst —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1

5 0.75605822 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efﬁcient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

6 0.74888009 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

7 0.73263586 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

8 0.69710565 292 nips-2012-Regularized Off-Policy TD-Learning

9 0.6756525 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

10 0.64652443 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

11 0.63315809 11 nips-2012-A Marginalized Particle Gaussian Process Regression

12 0.62827462 283 nips-2012-Putting Bayes to sleep

13 0.62336934 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

14 0.61620188 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

15 0.55961555 324 nips-2012-Stochastic Gradient Descent with Only One Projection

16 0.53323478 305 nips-2012-Selective Labeling via Error Bound Minimization

17 0.50935596 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

18 0.50454956 195 nips-2012-Learning visual motion in recurrent neural networks

19 0.45909518 23 nips-2012-A lattice filter model of the visual pathway

20 0.45594341 41 nips-2012-Ancestor Sampling for Particle Gibbs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (17, 0.016), (21, 0.035), (36, 0.015), (38, 0.165), (39, 0.015), (42, 0.041), (54, 0.042), (55, 0.016), (74, 0.047), (76, 0.197), (80, 0.151), (83, 0.125), (92, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94163668 146 nips-2012-Graphical Gaussian Vector for Image Categorization

Author: Tatsuya Harada, Yasuo Kuniyoshi

Abstract: This paper proposes a novel image representation called a Graphical Gaussian Vector (GGV), which is a counterpart of the codebook and local feature matching approaches. We model the distribution of local features as a Gaussian Markov Random Field (GMRF) which can efﬁciently represent the spatial relationship among local features. Using concepts of information geometry, proper parameters and a metric from the GMRF can be obtained. Then we deﬁne a new image feature by embedding the proper metric into the parameters, which can be directly applied to scalable linear classiﬁers. We show that the GGV obtains better performance over the state-of-the-art methods in the standard object recognition datasets and comparable performance in the scene dataset. 1

2 0.93890733 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

Author: Nicholas Ruozzi

Abstract: Sudderth, Wainwright, and Willsky conjectured that the Bethe approximation corresponding to any ﬁxed point of the belief propagation algorithm over an attractive, pairwise binary graphical model provides a lower bound on the true partition function. In this work, we resolve this conjecture in the afﬁrmative by demonstrating that, for any graphical model with binary variables whose potential functions (not necessarily pairwise) are all log-supermodular, the Bethe partition function always lower bounds the true partition function. The proof of this result follows from a new variant of the “four functions” theorem that may be of independent interest. 1

same-paper 3 0.93515843 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

4 0.93073553 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

Author: Marc Deisenroth, Shakir Mohamed

5 0.91666448 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

6 0.91609502 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

7 0.91385949 279 nips-2012-Projection Retrieval for Classification

8 0.91328484 200 nips-2012-Local Supervised Learning through Space Partitioning

9 0.91155356 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

10 0.91146708 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

11 0.91134089 292 nips-2012-Regularized Off-Policy TD-Learning