jmlr jmlr2013 jmlr2013-90 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Philipp Hennig, Martin Kiefel
Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes
Reference: text
sentIndex sentText sentNum sentScore
1 We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. [sent-9, score-0.086]
2 Introduction Quasi-Newton algorithms are arguably the most popular class of nonlinear numerical optimization methods, used widely in numerical applications not just in machine learning. [sent-12, score-0.086]
3 The algorithm performs consecutive line searches along onedimensional subspaces xi (α) = αei + x0 , with α ∈ R+ and a unit length vector ei ∈ RN spanning the i line search space starting at x0 . [sent-23, score-0.504]
4 The derivations of classical quasi-Newton algorithms proceed along the following line of argument: We require an update rule incorporating an observation ∇ f (xi+1 ) into a current estimate Bi to get a new estimate Bi+1 , subject to the following desiderata: 1. [sent-27, score-0.175]
5 Consistency with Quadratic Model If f is locally described well to second order, then yi ≡ ∇ f (xi ) − ∇ f (xi−1 ) ≈ B(xi )si , (1) with si ≡ xi − xi−1 . [sent-31, score-0.38]
6 Varying the prior covariance and choosing one of two possible likelihoods gives rise to the different members of the family of quasi-Newton methods. [sent-41, score-0.17]
7 In fact, the posterior arising from the newly identified prior and likelihood assigns nonzero probability mass to non-symmetric (Section 2. [sent-43, score-0.269]
8 A further advantage 844 Q UASI -N EWTON M ETHODS : A N EW D IRECTION of the nonparametric formulation is that it allows the use of every gradient observation calculated during a line search instead of just the last one, something that is not easily achievable under the old parametric models. [sent-53, score-0.332]
9 Using si = xi − xi−1 , we can write it using Dirac’s distribution as ⊺ ⇀ p(yi B, si ) = δ(yi − Bsi ) = lim N [yi ;S▷ B , (Vi−1 ⊗ β)], β 0 (4) with any arbitrary N ×N matrix Vi−1 , a scalar β, and the linear operator S▷ = (I ⊗si ) (the significance of the subscript ▷ will become clear later). [sent-70, score-0.591]
10 Instead of enforcing this point mass likelihood (4), we could equivalently minimize its negative logarithm 1 −log p(yi B, si ) = lim β (yi − Bsi )⊺V −1 (yi − Bsi ) + const. [sent-71, score-0.308]
11 So the posterior is Gaussian, too, even for the limit case of a Dirac likelihood. [sent-77, score-0.151]
12 A first form for the posterior can be found by explicitly multiplying the two Gaussians and “completing the square” in the exponent of the product of Gaussians: Posterior covariance and mean are −1 ⊺ Σ▷ = (Σ−1 + S▷ (Vi−1 ⊗ β−1 )S▷ )−1 , i−1 ⇀ ⇀ −1 B▷ = Σ▷ (S▷ (Vi−1 ⊗ β−1 ) Y + Σ−1 B i−1 ). [sent-79, score-0.271]
13 Using this result, we re-write the posterior mean, using the Matrix inversion lemma, as ⇀ −1 ⇀ ⇀ −1 B ▷ = Σ▷ ((Vi−1 ⊗ β−1 )S▷ Y + Σi−1 B i−1 ) ⇀ ⇀ ⊺ ⇀ ⊺ = B i−1 + Σi−1 S▷ (S▷ Σi−1 S▷ +Vi−1 ⊗ β)−1 ⋅ ( Y − S▷ B i−1 ). [sent-86, score-0.194]
14 Note that Σi−1 S▷ = (Vi−1 ⊗Vi−1 )(I ⊗ si ) = ⊺ (Vi−1 ⊗Vi−1 si ) and likewise S▷ Σi−1 S▷ = (Vi−1 ⊗s⊺Vi−1 si ). [sent-88, score-0.828]
15 The new mean is a rank-1 update of the old mean, and the rank of the new covariance Σi is one less than that of Σi−1 . [sent-90, score-0.229]
16 The posterior mean has maximum posterior probability (minimal regularized loss), and is thus our new point estimate. [sent-91, score-0.338]
17 Choosing a unit variance prior Σi−1 = I ⊗ I recovers one of the oldest quasi-Newton algorithms: Broyden’s method (1965): Bi = Bi−1 + (yi − Bi−1 si )s⊺ i . [sent-92, score-0.362]
18 ⊺ si si Broyden’s method does not satisfy the third requirement of Section 1: the updated estimate is, in general, not a symmetric matrix. [sent-93, score-0.608]
19 A supposed remedy for this problem, and in fact the only rank-1 update rule that obeys Equation (4) (Dennis and Mor´ , 1977) is the symmetric rank 1 (SR1) method e (Davidon, 1959; Broyden, 1967): Bi = Bi−1 + (yi − Bi−1 si )(yi − Bi−1 si )⊺ . [sent-94, score-0.674]
20 s⊺ (yi − Bi−1 si ) i The SR1 update rule has acquired a controversial reputation (e. [sent-95, score-0.307]
21 Our Bayesian interpretation identifies the SR1 formula as Gaussian regression with a datadependent prior variance involving Vi−1 with Vi−1 si = (yi − Bi−1 si ). [sent-99, score-0.638]
22 Given the explicitly Gaussian prior of Equation (6), there is no rank 1 update rule that gives a symmetric posterior. [sent-100, score-0.208]
23 2 847 H ENNIG AND K IEFEL Since this is a linear map, the resulting posterior is analytic, and Gaussian. [sent-107, score-0.151]
24 The posterior has mean ⇀ ⇀ ⇀ ⊺ ⇀ B i = B ▷ + Σ▷ S◁ (K◁ + γI ⊗V▷ )−1 (y⊺ − S◁ B ▷ ) i ⇀ = B i−1 + (I ⊗ ⇀ Vi−1 si ) (yi − Bi−1 si ) ⊺ si Vi−1 si + β ⇀ y − Bi−1 si ⊺ s Vi−1 )]. [sent-117, score-1.567]
25 + (Vi−1 si ⊗Vi )[(s⊺Vi−1 s + γ) ⊗Vi ]−1 [y⊺ − s⊺ (Bi−1 + ⊺i i i i si Vi−1 si + β i The calculation for the posterior covariance can be reduced to a simple symmetry argument. [sent-118, score-1.111]
26 si Vi−1 si Bi = Bi−1 + (10) (11) The posterior mean is clearly symmetric if Bi−1 is symmetric (as Vi−1 is symmetric by definition). [sent-120, score-0.907]
27 Choosing the unit prior Σi−1 = I ⊗ I once more, Equation (10) gives what is known as Powell’s (1970) symmetric Broyden (PSB) update. [sent-121, score-0.142]
28 Equation (10) has previously been known to be the most general form of a symmetric rank 2 update obeying the quasi-Newton equation (1) and minimizing a Frobenius regularizer (Dennis and Mor´ , 1977). [sent-122, score-0.173]
29 e But note that symmetry only extends to the mean, not the entire belief: In contrast to the posterior generated by Equation (8), samples from this posterior are, with probability 1, not symmetric. [sent-124, score-0.35]
30 (12) Since Γ is a symmetric linear operator, the projection of any Gaussian belief N (X;X0 , Σ) onto the space of symmetric matrices is itself a Gaussian N (ΓX;ΓX0 , ΓΣΓ). [sent-126, score-0.163]
31 But symmetrized samples from the posterior of Equations (10), (11) do not necessarily obey the quasi-Newton Equation (1). [sent-127, score-0.151]
32 So quasi-Newton methods ensure symmetry in the maximum of the posterior, but not the posterior itself. [sent-132, score-0.199]
33 + s⊺ yi y⊺ si (yi si )2 i i And, if we exchange in the entire preceding derivation s y, B B−1 , Bi−1 B−1 , then we arrive i−1 at the BFGS method (Broyden, 1969; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970), which ranks among the most widely used algorithms in numerical optimization. [sent-141, score-0.66]
34 e DFP and BFGS owe much of their popularity to the fact that the updated Bi,DFP and B−1 i,BFGS are −1 guaranteed to be positive definite whenever Bi−1,DFP and Bi−1,BFGS are positive definite, respectively, and additionally y⊺ si > 0. [sent-145, score-0.276]
35 i−1 i−1 i i i−1 i If the prior covariance is not to depend on the data, it is thus impossible to guarantee positive definiteness in this framework—BFGS and DFP circumvent this conceptual issue by choosing Vi−1 = B, then applying Equation (1) a second time. [sent-148, score-0.17]
36 These observations do not rule out any utility of guaranteeing positive definiteness in this way, and the prior (13) deserves closer study. [sent-152, score-0.141]
37 3 Rank M Updates The classical quasi-Newton algorithms update the mean of the belief at every step in a rank 2 operation, then, implicitly, reset their uncertainty in the next step, thereby discarding information acquired earlier. [sent-159, score-0.153]
38 It is straightforward to extend Equation (4) to observations (Y, S) from several line searches: Ynm = ∇n f (xim ) − ∇n f (xim −1 ), Snm = xim ,n − xim −1,n . [sent-162, score-0.259]
39 Given a prior p(B) = N (B;B0 ,V0 ), the Gaussian posterior then has mean and covariance Bi = B0 + (Y − B0 S)(S⊺V0 S)−1 S⊺V0 +V0 S(S⊺V0 S)−1 (Y − B0 S)⊺ −V S(S V0 S) (S (Y − B0 S))(S V S) S V0 , ⊺ −1 ⊺ ⊺ (14) −1 ⊺ Σi = (V0 −V0 S(S⊺V0 S)−1 S⊺V0 ) ⊗ (V0 −V0 S(S⊺V0 S)−1 S⊺V0 ). [sent-163, score-0.357]
40 Here, the absence of information about the symmetry of the Hessian becomes even more obvious: No matter the prior covariance V0 , because of the term S⊺Y in the second line of Equation (14), the posterior mean is not in general symmetric, unless Y = BS, (e. [sent-164, score-0.485]
41 850 Q UASI -N EWTON M ETHODS : A N EW D IRECTION Two particularly interesting observations concern the way in which the desiderata of symmetry and positive definiteness of the MAP estimator are achieved in these algorithms. [sent-170, score-0.143]
42 Similarly, positive definiteness is just guaranteed for the mode, not the entire support of the posterior distribution. [sent-172, score-0.151]
43 It showed that quasi-Newton methods can be interpreted as Gaussian regressors using algebraic structure to weaken prior knowledge, in exchange for lower computational cost. [sent-178, score-0.119]
44 Old observations collected “far” from the current location (in the sense that a second order expansion is a poor approximation) may thus be useless or even harmful. [sent-181, score-0.09]
45 On an only slightly related point, individual line searches typically involve several evaluations of the objective function f and its gradient; but the algorithms only make use of one of those (the last one). [sent-183, score-0.228]
46 3 has this problem, because the matrix S of several observations along one line search has rank 1, so the inverse of S⊺V0 S is not defined. [sent-185, score-0.217]
47 The mean function is assumed to be an arbitrary integrable function B0 (x) (in our implementation we use a constant function, but the analytic derivations do not need to be so restrictive). [sent-189, score-0.139]
48 The core idea is to assume that the covariance between the element Bi j at location1 x and the entry Bkℓ at location x is cov(Bi j (x ), Bkℓ (x )) = kik (x⊺ , x⊺ )k jℓ (x , x ) = (k ⊗ k)(i j)(kℓ) (x , x ) with an N × N matrix of kernels, k. [sent-190, score-0.119]
49 In our treatment, we will replace this approximate statement with its exact version: We observe the value of the line integral along the path ri ∶ [0, 1] RN , ri (t) = xi−1 +t(xi − xi−1 ), Yni = ∑ ∫ Bnm (x) dxm . [sent-209, score-0.167]
50 m i rm Note that, for scalar fields φi with Bim = ∇m φi , such as the gradient φi = ∇i f , it follows from the chain rule that (the following derivations again use the sum convention defined in Section 1. [sent-210, score-0.108]
51 dt ∂t Thus, our line integral obeys Yi j = ∫ Bim (x) dxm = ∫ rj =∫ 0 1 0 1 j Bim (r j (t)) ⋅ ∂t rm (t) dt (16) ∂t φi (r j (t)) dt = φi (r j (1)) − φi (r j (0)). [sent-212, score-0.356]
52 This is the classic result that line integrals over the gradients of scalar fields are independent of the path taken, they only depend on the starting and end points of the path. [sent-213, score-0.19]
53 In particular, our path j satisfies ∂t rm (t) = S jm (its derivative is constant), and our line integral can be written as Yi j = ∫ 0 1 Bim (r j (t))S jm dt = δik ⋅ S jm ∫ ⇀ ⇀ ⇀ Y = [I ⊗ (S⊺ ⊙ ∫ )] B ≡ S▷ B . [sent-214, score-0.415]
54 t 852 0 1 Bkm (t j ) dt j , (17) Q UASI -N EWTON M ETHODS : A N EW D IRECTION f (x) 2 0 −2 −4 −5 −4 −3 −2 −1 0 x 1 2 3 4 5 Figure 1: One-dimensional Gaussian process inference from integral observations (squared exponential kernel). [sent-215, score-0.174]
55 Corresponding integrals over the mean, and each sample, are consistent with the integral observations. [sent-219, score-0.126]
56 An interesting aspect to note is that, while path-independence holds for the ground-truth integrals of Equations (16), the prior covariance of Equation (15) does not encode this fact. [sent-225, score-0.24]
57 Assume a Gaussian process prior, with mean function µ(x), covariance function (kernel) k. [sent-240, score-0.12]
58 The posterior can be found as above, by “completing the square” p( f y) = N (Ψ(K −1 µ + σ−2 my), Ψ) with the covariance Kmm⊺ K . [sent-245, score-0.235]
59 m (20) (x ) = ∫ 0 1 ≡ k(x ) ∈ {RN RN×M }, and the integrated mean function ⇀ S⊺ B 0 ▷ mk = S jk ∫ 0 1 These objects are homologous to concepts in canonical Gaussian process inference: B0,nm is the n-th mean prediction along the m-th line integral observation. [sent-253, score-0.208]
60 knm (x ) is the covariance between the n-th column of the Hessian at location x and the m-th line-integral observation. [sent-254, score-0.119]
61 K pq is the covariance between the p-th and q-th line integral observations. [sent-255, score-0.22]
62 The derivations for the covariance are similar and contain the same terms. [sent-256, score-0.148]
63 Together with the dual observation, we arrive at a posterior, which has mean and covariance functions B◇ (x ) = B0 (x ) + (Y − B0 )K−1 k⊺ (x ) + k(x )K−1 (Y − B0 )⊺ − k(x )K−1 S⊺ (Y − B0 )K−1 k⊺ (x ), Σ◇ (x , x ) = [k(x⊺ , x⊺ ) − k(x⊺ )K−1 k⊺ (x )] ⊗ [k(x , x ) − k(x )K−1 k⊺ (x )]. [sent-257, score-0.153]
64 The actual numerical realisation of this nonparametric method involves relatively tedious algebraic derivations, which can be found in Appendix A. [sent-258, score-0.121]
65 An important aspect is that, because k is a positive definite kernel, unless two observations are exactly identical, K has full rank M (the number of function evaluations), even if several observations take place within one shared 1-dimensional subspace. [sent-259, score-0.145]
66 So it is possible to make full use of all function evaluations made during line searches, not just the last one, as in the parametric setting of existing quasi-Newton methods. [sent-260, score-0.167]
67 3, it is clear that the posterior mean is not in general a symmetric matrix. [sent-263, score-0.243]
68 Where it is not, note that, because optimization proceeds along a trajectory through the parameter space, old observations tend to have low covariance with the Hessian at the current location, and thus a small effect on the local mean estimate. [sent-268, score-0.218]
69 Gaussian process posterior with thick mean and two standard deviations marginal variance as shaded region, as well as three samples as dashed lines. [sent-279, score-0.187]
70 The posterior from all observations captures much more structure, and in particular a different mean estimate at the end of the line search (x = 2. [sent-282, score-0.369]
71 1 is over the elements of the Hessian, and gradient observations are integrals of this function. [sent-291, score-0.169]
72 One may wonder why we did not just start with a Gaussian process prior on the objective function f and used observations of the gradient to infer the Hessian directly from there. [sent-292, score-0.185]
73 4) cov ⎛ ∂ f (x ) ∂ f (x ) ⎞ ∂2 k(x , x ) = , , ∂x j ⎠ ∂xi , ∂x j ⎝ ∂xi = which, for an SE kernel, is (21) (xi − xi )(x j − x j ) ⎞ ⎛1 δ + kSE (x , x ). [sent-294, score-0.095]
74 2 ij λ2 λ2 ⎠ ⎝λj i j The covariance between elements of the Hessian and elements of the gradient is cov ⎛ ∂2 f (x ) ∂ f (x ) ⎞ ∂2 k(x , x ) = , , ∂x j ⎠ ∂xi , ∂x j ⎝ ∂xi dxk = which, for an SE kernel, is ⎛ δik (x j − x j ) + δ jk (xi − xi ) (xk − xk ) ⎞ − kSE (x , x ). [sent-295, score-0.223]
75 Experiments The calculations required by nonparametric quasi-Newton algorithm using the squared-exponential kernel involve exponential functions, error functions, and numerical integrals (see Appendix A for details). [sent-307, score-0.191]
76 857 H ENNIG AND K IEFEL 100 f (x)/ f (x0 ) 10−3 10−6 10−9 0 10 DFP BFGS parametric Bayes nonparametric Bayes 101 # linesearches 102 Figure 3: Minimization of a 100-dimensional quadratic. [sent-315, score-0.118]
77 The Bayesian algorithms converge more regularly and faster initially, but suffer from bad numerical conditioning toward the end of the optimization. [sent-319, score-0.105]
78 The plot also shows the mean belief on one element of the Hessian. [sent-323, score-0.087]
79 It was gathered on the corresponding posterior after the addition of 10 datapoints per problem. [sent-329, score-0.184]
80 This makes the objective function less regular, meaning that the optimal Newton path to the minimum has more complex shape, and more line searches are necessary to converge to the minimum. [sent-330, score-0.181]
81 Bayes 0 2 4 6 8 10 12 14 16 18 # line searches 20 22 24 26 28 30 Figure 4: Minimizing Rosenbrock’s polynomial, a non-convex function with unique minimum at (1,1). [sent-337, score-0.181]
82 Top left: Function values, line search trajectory of the Bayesian algorithm in white. [sent-340, score-0.127]
83 Middle Row: Two times marginal posterior standard deviation (a. [sent-342, score-0.151]
84 posterior uncertainty, left) and mean estimate (right) of the Bayesian regressor. [sent-345, score-0.187]
85 The cross after 24 line searches marks the point where the Bayesian method switches to a local parametric model for numerical stability. [sent-348, score-0.264]
86 Right: Minimizing the corresponding posterior after the addition of 10 datapoints sampled from the correct model. [sent-353, score-0.184]
87 The datapoints create a more complicated optimization problem in which line searches tend to be shorter, thus reducing the advantage of the Bayesian method gained from superior Hessian estimates. [sent-354, score-0.214]
88 Averages over 20 sampled problems; plotted is the relative distance from initial function value (shared by all algorithms) to the minimum, as a function of the number of line searches (all algorithms use the same line search method). [sent-355, score-0.308]
89 Bayes 10−6 10−11 0 −2 2 0 −2 10−16 0 10 20 30 # line searches 40 Figure 6: Minimizing randomly generated 4-dimensional analytic functions. [sent-369, score-0.22]
90 Right: function values achieved by three numerical optimizers as a function of the number of line searches. [sent-373, score-0.159]
91 A numerical challenge in the implementation arises from the required integrals over squared exponentials. [sent-397, score-0.113]
92 For this purpose, it is helpful to use an explicit notation for individual line searches: We change the index set from m to ( jh): Let observation ym have been taken as the h-th observation of line search number j. [sent-400, score-0.207]
93 If the line search proceeded along unit direction e j and started from x0 j , then the h-th observation was the difference between the gradients at locations x0 j + (ηh − νh )e j and x 0 j + νh e j . [sent-401, score-0.167]
94 Along its block diagonal lie covariance between observations collected as part of the same line search. [sent-410, score-0.219]
95 These have the form Kii = (ηh − νh )(ηk − νk )(e⊺V ei )θ2 hk i ∬ 0 1 1 exp[ − [(ηh − νh )th ei + νh ei + x0i − (αk − νk )tk ei − νk ei − x0i ]⊺ Λ−1 [. [sent-411, score-0.785]
96 ]] dth dtk 2 ηh = e⊺V ei θ2 ∫ i ∫ αk νk νh exp[ − (uh − uk )2 ] duh duk . [sent-414, score-0.188]
97 Such integrals can be integrated by parts, leading to an analytic expression that only involves error functions and exponential functions (Peltonen, 2012). [sent-416, score-0.109]
98 The most challenging calculations involve elements of K describing the covariance between observations made along different line search directions. [sent-417, score-0.266]
99 We make use, once more, of the closure of the Gaussian exponential family under linear maps, to write Ki j =(ηh − νh )(ηk − νk )e⊺V ei j hk ∬ 0 1 ej (η − ν )t ⎞ 1 ⎛ h h h⎞ ⎛ ⎟ Λ−1 . [sent-418, score-0.157]
100 Using an argument largely analogous to the derivation of the L-BFGS algorithm (Nocedal, 1980) a diagonal prior mean B0 lowers cost to O(NM + M 3 ), linear in N. [sent-427, score-0.122]
wordName wordTfidf (topN-words)
[('bfgs', 0.294), ('vi', 0.289), ('si', 0.276), ('bi', 0.253), ('dfp', 0.251), ('ennig', 0.17), ('ewton', 0.17), ('iefel', 0.17), ('uasi', 0.17), ('ei', 0.157), ('hessian', 0.157), ('broyden', 0.155), ('posterior', 0.151), ('irection', 0.146), ('niteness', 0.138), ('dennis', 0.124), ('mor', 0.124), ('ew', 0.107), ('searches', 0.101), ('davidon', 0.093), ('ethods', 0.088), ('prior', 0.086), ('covariance', 0.084), ('fletcher', 0.083), ('km', 0.083), ('line', 0.08), ('nonparametric', 0.078), ('bim', 0.077), ('bsi', 0.077), ('nocedal', 0.072), ('jm', 0.072), ('integrals', 0.07), ('bayesian', 0.067), ('yi', 0.065), ('dm', 0.064), ('derivations', 0.064), ('dt', 0.063), ('gaussian', 0.062), ('regularly', 0.062), ('xim', 0.062), ('cov', 0.056), ('integral', 0.056), ('symmetric', 0.056), ('observations', 0.055), ('powell', 0.055), ('hennig', 0.053), ('bs', 0.053), ('equation', 0.051), ('belief', 0.051), ('symmetry', 0.048), ('search', 0.047), ('evaluations', 0.047), ('ahh', 0.046), ('akk', 0.046), ('goldfarb', 0.046), ('psb', 0.046), ('shanno', 0.046), ('kronecker', 0.044), ('gradient', 0.044), ('old', 0.043), ('inversion', 0.043), ('numerical', 0.043), ('quadratic', 0.041), ('gradients', 0.04), ('uh', 0.04), ('kkm', 0.04), ('hessians', 0.04), ('mccormick', 0.04), ('aik', 0.04), ('desiderata', 0.04), ('parametric', 0.04), ('analytic', 0.039), ('nm', 0.039), ('xi', 0.039), ('mean', 0.036), ('optimizers', 0.036), ('xq', 0.036), ('rank', 0.035), ('location', 0.035), ('rn', 0.033), ('datapoints', 0.033), ('regressors', 0.033), ('dual', 0.033), ('likelihood', 0.032), ('update', 0.031), ('axb', 0.031), ('bik', 0.031), ('desideratum', 0.031), ('detrk', 0.031), ('dtk', 0.031), ('dxm', 0.031), ('greenstadt', 0.031), ('hestenes', 0.031), ('kiefel', 0.031), ('kmm', 0.031), ('kse', 0.031), ('mpg', 0.031), ('rosenbrock', 0.031), ('tuebingen', 0.031), ('uhi', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 90 jmlr-2013-Quasi-Newton Method: A New Direction
Author: Philipp Hennig, Martin Kiefel
Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes
2 0.09083122 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
Author: Edward Challis, David Barber
Abstract: We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufficient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-t or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. Keywords: generalised linear models, latent linear models, variational approximate inference, large scale inference, sparse learning, experimental design, active learning, Gaussian processes
3 0.081904665 86 jmlr-2013-Parallel Vector Field Embedding
Author: Binbin Lin, Xiaofei He, Chiyuan Zhang, Ming Ji
Abstract: We propose a novel local isometry based dimensionality reduction method from the perspective of vector fields, which is called parallel vector field embedding (PFE). We first give a discussion on local isometry and global isometry to show the intrinsic connection between parallel vector fields and isometry. The problem of finding an isometry turns out to be equivalent to finding orthonormal parallel vector fields on the data manifold. Therefore, we first find orthonormal parallel vector fields by solving a variational problem on the manifold. Then each embedding function can be obtained by requiring its gradient field to be as close to the corresponding parallel vector field as possible. Theoretical results show that our method can precisely recover the manifold if it is isometric to a connected open subset of Euclidean space. Both synthetic and real data examples demonstrate the effectiveness of our method even if there is heavy noise and high curvature. Keywords: manifold learning, isometry, vector field, covariant derivative, out-of-sample extension
4 0.07836616 120 jmlr-2013-Variational Algorithms for Marginal MAP
Author: Qiang Liu, Alexander Ihler
Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms significantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models
5 0.06394583 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
Author: Markus Thom, Günther Palm
Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity
6 0.063219406 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
7 0.061840344 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
8 0.056925718 121 jmlr-2013-Variational Inference in Nonconjugate Models
9 0.056237455 108 jmlr-2013-Stochastic Variational Inference
10 0.05125067 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
11 0.050723847 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
12 0.049351331 15 jmlr-2013-Bayesian Canonical Correlation Analysis
13 0.04875996 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
14 0.046001021 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
15 0.045471057 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
16 0.045094665 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
17 0.045010757 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
18 0.044652615 104 jmlr-2013-Sparse Single-Index Model
19 0.044283811 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
20 0.043626167 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
topicId topicWeight
[(0, -0.235), (1, -0.097), (2, 0.054), (3, -0.05), (4, 0.029), (5, 0.018), (6, -0.023), (7, -0.027), (8, -0.012), (9, -0.023), (10, -0.029), (11, -0.006), (12, -0.006), (13, -0.029), (14, 0.041), (15, -0.035), (16, -0.085), (17, -0.034), (18, -0.039), (19, 0.112), (20, 0.115), (21, 0.02), (22, -0.145), (23, 0.043), (24, 0.123), (25, -0.211), (26, -0.18), (27, -0.0), (28, -0.181), (29, -0.066), (30, 0.027), (31, -0.044), (32, 0.132), (33, -0.06), (34, -0.266), (35, 0.17), (36, 0.076), (37, 0.057), (38, -0.097), (39, 0.016), (40, 0.119), (41, -0.194), (42, 0.088), (43, -0.02), (44, -0.047), (45, -0.069), (46, -0.075), (47, 0.01), (48, -0.045), (49, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.96180785 90 jmlr-2013-Quasi-Newton Method: A New Direction
Author: Philipp Hennig, Martin Kiefel
Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes
2 0.39486462 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
Author: Edward Challis, David Barber
Abstract: We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufficient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-t or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. Keywords: generalised linear models, latent linear models, variational approximate inference, large scale inference, sparse learning, experimental design, active learning, Gaussian processes
3 0.39378512 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos
Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm
4 0.38598204 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
Author: Markus Thom, Günther Palm
Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity
5 0.38391328 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
Author: Vinayak Rao, Yee Whye Teh
Abstract: Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this paper, we tackle the problem of simulating from the posterior distribution over paths in these models, given partial and noisy observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. The first step involves simulating a piecewise-constant inhomogeneous Poisson process, while for the second, we use a standard hidden Markov model forward filtering-backward sampling algorithm. Our method is exact and does not involve approximations like time-discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show significant computational benefits over state-ofthe-art MCMC samplers for these models. Keywords: Markov jump process, MCMC, Gibbs sampler, uniformization, Markov-modulated Poisson process, continuous-time Bayesian network
6 0.37947208 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
7 0.37406817 120 jmlr-2013-Variational Algorithms for Marginal MAP
8 0.36558479 15 jmlr-2013-Bayesian Canonical Correlation Analysis
9 0.35069647 83 jmlr-2013-Orange: Data Mining Toolbox in Python
10 0.34961343 40 jmlr-2013-Efficient Program Synthesis Using Constraint Satisfaction in Inductive Logic Programming
11 0.34278399 86 jmlr-2013-Parallel Vector Field Embedding
12 0.33634773 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
13 0.32605341 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
14 0.31628793 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
15 0.31490508 68 jmlr-2013-Machine Learning with Operational Costs
16 0.31277865 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
17 0.31273973 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
18 0.29399481 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes
19 0.28128579 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
20 0.27916166 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
topicId topicWeight
[(0, 0.541), (5, 0.114), (6, 0.042), (10, 0.07), (20, 0.012), (23, 0.021), (68, 0.012), (70, 0.02), (75, 0.052), (85, 0.019), (87, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.89322597 90 jmlr-2013-Quasi-Newton Method: A New Direction
Author: Philipp Hennig, Martin Kiefel
Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes
2 0.88153124 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
Author: Pekka Parviainen, Mikko Koivisto
Abstract: We consider the problem of finding a directed acyclic graph (DAG) that optimizes a decomposable Bayesian network score. While in a favorable case an optimal DAG can be found in polynomial time, in the worst case the fastest known algorithms rely on dynamic programming across the node subsets, taking time and space 2n , to within a factor polynomial in the number of nodes n. In practice, these algorithms are feasible to networks of at most around 30 nodes, mainly due to the large space requirement. Here, we generalize the dynamic programming approach to enhance its feasibility in three dimensions: first, the user may trade space against time; second, the proposed algorithms easily and efficiently parallelize onto thousands of processors; third, the algorithms can exploit any prior knowledge about the precedence relation on the nodes. Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . Considering sufficiently many carefully chosen partial orders guarantees that a globally optimal DAG will be found. Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. Keywords: exact algorithm, parallelization, partial order, space-time tradeoff, structure learning
3 0.45805407 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks
Author: Vinod K. Valsalam, Risto Miikkulainen
Abstract: Sorting networks are an interesting class of parallel sorting algorithms with applications in multiprocessor computers and switching networks. They are built by cascading a series of comparisonexchange units called comparators. Minimizing the number of comparators for a given number of inputs is a challenging optimization problem. This paper presents a two-pronged approach called Symmetry and Evolution based Network Sort Optimization (SENSO) that makes it possible to scale the solutions to networks with a larger number of inputs than previously possible. First, it uses the symmetry of the problem to decompose the minimization goal into subgoals that are easier to solve. Second, it minimizes the resulting greedy solutions further by using an evolutionary algorithm to learn the statistical distribution of comparators in minimal networks. The final solutions improve upon half-century of results published in patents, books, and peer-reviewed literature, demonstrating the potential of the SENSO approach for solving difficult combinatorial problems. Keywords: symmetry, evolution, estimation of distribution algorithms, sorting networks, combinatorial optimization
4 0.40836728 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont
Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy. Keywords: group feature selection, generalized spike-and-slab priors, expectation propagation, sparse linear model, approximate inference, sequential experimental design, signal reconstruction
5 0.39209145 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
Author: Rami Mahdi, Jason Mezey
Abstract: Constraint-based learning of Bayesian networks (BN) from limited data can lead to multiple testing problems when recovering dense areas of the skeleton and to conflicting results in the orientation of edges. In this paper, we present a new constraint-based algorithm, light mutual min (LMM) for improved accuracy of BN learning from small sample data. LMM improves the assessment of candidate edges by using a ranking criterion that considers conditional independence on neighboring variables at both sides of an edge simultaneously. The algorithm also employs an adaptive relaxation of constraints that, selectively, allows some nodes not to condition on some neighbors. This relaxation aims at reducing the incorrect rejection of true edges connecting high degree nodes due to multiple testing. LMM additionally incorporates a new criterion for ranking v-structures that is used to recover the completed partially directed acyclic graph (CPDAG) and to resolve conflicting v-structures, a common problem in small sample constraint-based learning. Using simulated data, each of these components of LMM is shown to significantly improve network inference compared to commonly applied methods when learning from limited data, including more accurate recovery of skeletons and CPDAGs compared to the PC, MaxMin, and MaxMin hill climbing algorithms. A proof of asymptotic correctness is also provided for LMM for recovering the correct skeleton and CPDAG. Keywords: Bayesian networks, skeleton, constraint-based learning, mutual min
6 0.39048168 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
7 0.38974217 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression
8 0.38811392 120 jmlr-2013-Variational Algorithms for Marginal MAP
9 0.38718548 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
10 0.37937182 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
11 0.37685919 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
12 0.37333912 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
13 0.37249699 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
14 0.36980474 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
15 0.36555639 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
16 0.36007616 68 jmlr-2013-Machine Learning with Operational Costs
17 0.3573567 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
19 0.35663864 86 jmlr-2013-Parallel Vector Field Embedding
20 0.35653615 22 jmlr-2013-Classifying With Confidence From Incomplete Information