nips nips2003 nips2003-155 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jason Palmer, Bhaskar D. Rao, David P. Wipf
Abstract: Recently, relevance vector machines (RVM) have been fashioned from a sparse Bayesian learning (SBL) framework to perform supervised learning using a weight prior that encourages sparsity of representation. The methodology incorporates an additional set of hyperparameters governing the prior, one for each weight, and then adopts a specific approximation to the full marginalization over all weights and hyperparameters. Despite its empirical success however, no rigorous motivation for this particular approximation is currently available. To address this issue, we demonstrate that SBL can be recast as the application of a rigorous variational approximation to the full model by expressing the prior in a dual form. This formulation obviates the necessity of assuming any hyperpriors and leads to natural, intuitive explanations of why sparsity is achieved in practice. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Recently, relevance vector machines (RVM) have been fashioned from a sparse Bayesian learning (SBL) framework to perform supervised learning using a weight prior that encourages sparsity of representation. [sent-4, score-0.519]
2 The methodology incorporates an additional set of hyperparameters governing the prior, one for each weight, and then adopts a specific approximation to the full marginalization over all weights and hyperparameters. [sent-5, score-0.26]
3 Despite its empirical success however, no rigorous motivation for this particular approximation is currently available. [sent-6, score-0.135]
4 To address this issue, we demonstrate that SBL can be recast as the application of a rigorous variational approximation to the full model by expressing the prior in a dual form. [sent-7, score-0.89]
5 This formulation obviates the necessity of assuming any hyperpriors and leads to natural, intuitive explanations of why sparsity is achieved in practice. [sent-8, score-0.294]
6 Recently, a sparse Bayesian learning (SBL) framework has been derived to find robust solutions to (1) [3, 7]. [sent-20, score-0.169]
7 The key feature of this development is the incorporation of a prior on the weights that encourages sparsity in representation, i. [sent-21, score-0.293]
8 , 1 p(t∗ |w)p(w, t)dw, (2) p(t) where the joint density p(w, t) = p(t|w)p(w) combines all relevant information from the training data (likelihood principle) with our prior beliefs about the model weights. [sent-32, score-0.171]
9 The likelihood term p(t|w) is assumed to be Gaussian, 1 (3) p(t|w) = (2πσ 2 )−N/2 exp − 2 t − Φw 2 , 2σ p(t∗ |t) = where for now we assume that the noise variance σ 2 is known. [sent-33, score-0.153]
10 For sparse priors p(w) (possibly improper), the required integrations, including the computation of the normalizing term p(t), are typically intractable, and we are forced to accept some form of approximation to p(w, t). [sent-34, score-0.277]
11 Sparse Bayesian learning addresses this issue by introducing a set of hyperparameters into the specification of the problematic weight prior p(w) before adopting a particular approximation. [sent-35, score-0.275]
12 (5) p(t) Proceeding further, by applying Bayes’ rule to this expression, we can exploit the plugin rule [2] via, p(γ|t) dwdγ p(t∗ |t) = p(t∗ |w)p(t|w)p(w|γ) p(t|γ) δ(γM AP ) dwdγ ≈ p(t∗ |w)p(t|w)p(w|γ) p(t|γ) 1 = p(t∗ |w)p(w, t; γM AP )dw. [sent-41, score-0.08]
13 Also, the normalizing term becomes p(w, t; γM AP )dw and we assume that all required integrations can now be handled in closed form. [sent-43, score-0.08]
14 The answer is that the hyperparameters enter as weight prior variances of the form, p(wi |γi ) = N (0, γi ). [sent-45, score-0.246]
15 In practice, we find that many of the estimated γi ’s converge to zero, leading to sparse solutions since the corresponding weights, and therefore columns of Φ, can effectively be pruned from the model. [sent-48, score-0.206]
16 However, no rigorous motivation for this particular claim is currently available nor is it immediately obvious exactly how the mass of this approximate distribution relates to the true mass. [sent-57, score-0.304]
17 (9) Different transformations lead to different modes and ultimately, different approximations to p(w, t) and therefore p(t∗ |t). [sent-60, score-0.097]
18 The canonical form of SBL, and the one that has displayed remarkable success in the literature, does not in fact find a mode of p(γ|t), but a mode of p(− log γ|t). [sent-62, score-0.146]
19 But again, why should this mode necessarily be more reflective of the desired mass than any other? [sent-63, score-0.262]
20 As already mentioned, SBL often leads to sparse results in practice, namely, the approximation p(w, t; γM AP ) is typically nonzero only on a small subspace of M -dimensional w space. [sent-64, score-0.246]
21 The question remains, however, why should an approximation to the full Bayesian treatment necessarily lead to sparse results in practice? [sent-65, score-0.332]
22 To address all of these ambiguities, we will herein demonstrate that the sparse Bayesian learning procedure outlined above can be recast as the application of a rigorous variational approximation to the distribution p(w, t). [sent-66, score-0.783]
23 2 This will allow us to quantify the exact relationship between the true mass and the approximate mass of this distribution. [sent-67, score-0.478]
24 In effect, we will demonstrate that SBL is attempting to directly capture significant portions of the probability mass of p(w, t), while still allowing us to perform the required integrations. [sent-68, score-0.201]
25 This framework also obviates the necessity of assuming any hyperprior p(γ) and is independent of the (subjective) parameterization (e. [sent-69, score-0.134]
26 Moreover, this perspective leads to natural, intuitive explanations of why sparsity is observed in practice and why, in general, this need not be the case. [sent-73, score-0.234]
27 To accomplish this, we appeal to variational methods to find a viable approximation to p(w, t; H) [5]. [sent-76, score-0.511]
28 We may then substitute this approximation into (10), leading to tractable integrations and analytic posterior distributions. [sent-77, score-0.183]
29 To find a class of suitable approximations, we first express p(w; H) in its dual form by introducing a set of variational parameters. [sent-78, score-0.531]
30 2 We note that the analysis in this paper is different from [1], which derives an alternative SBL algorithm based on variational methods. [sent-80, score-0.383]
31 1 Dual Form Representation of p(w; H) At the heart of this methodology is the ability to represent a convex function in its dual form. [sent-82, score-0.203]
32 For example, given a convex function f (y) : → , the dual form is given by f (y) = sup [λy − f ∗ (λ)] , (11) λ where f ∗ (λ) denotes the conjugate function. [sent-83, score-0.212]
33 If we drop the maximization in (11), we obtain the bound f (y) ≥ λy − f ∗ (λ). [sent-86, score-0.144]
34 (12) Thus, for any given λ, we have a lower bound on f (y); we may then optimize over λ to find the optimal or tightest bound in a region of interest. [sent-87, score-0.131]
35 To apply this theory to the problem at hand, we specify the form for our sparse prior M p(w; H) = i=1 p(wi ; H). [sent-88, score-0.288]
36 Using (7) and (8), we obtain the prior p(wi ; H) = p(wi |γi )p(γi )dγi = C b + 2 wi 2 −(a+1/2) , (13) which for a, b > 0 is proportional to a Student-t density. [sent-89, score-0.577]
37 The constant C is not chosen to enforce proper normalization; rather, it is chosen to facilitate the variational analysis below. [sent-90, score-0.408]
38 Also, this density function can be seen to encourage sparsity since it has heavy tails and a 2 sharp peak at zero. [sent-91, score-0.146]
39 Clearly p(wi ; H) is not convex in wi ; however, if we let yi wi as suggested in [5] and define yi , (14) f (yi ) log p(wi ; H) = −(a + 1/2) log C b + 2 we see that we now have a convex function in yi amenable to dual representation. [sent-92, score-1.83]
40 By computing the conjugate function f ∗ (yi ), constructing the dual, and then transforming back to p(wi ; H), we obtain the representation (see Appendix for details) p(wi ; H) = max γi ≥0 (2πγi )−1/2 exp − 2 wi 2γi exp − b γi −a γi . [sent-93, score-0.715]
41 (15) As a, b → 0, it is readily apparent from (15) that what were straight lines in the yi domain are now Gaussian functions with variance γi in the wi domain. [sent-94, score-0.725]
42 When we drop the maximization, we obtain a lower bound on p(wi ; H) of the form w2 b −a ˆ p(wi ; H) ≥ p(wi ; H) (2πγi )−1/2 exp − i exp − γi , (16) 2γi γi which serves as our approximate prior to p(w; H). [sent-96, score-0.378]
43 We will now ˆ ˆ incorporate these results into an algorithm for finding a good H, or more accurately H(γ), since each candidate hypothesis is characterized by a different set of variational parameters. [sent-98, score-0.412]
44 2 Variational Approximation to p(w, t; H) So now that we have a variational approximation to the problematic weight prior, we must return to our original problem of estimating p(t∗ |t; H). [sent-100, score-0.535]
45 5 0 −5 3 −4 −3 −2 −1 0 wi 1 2 3 4 5 Figure 1: Variational approximation example in both yi space and wi space for a, b → 0. [sent-118, score-1.179]
46 The solid line represents the plot of f (yi ) while the dotted lines represent variational lower bounds in the dual representation for three different values of λi . [sent-120, score-0.656]
47 The solid line represents the plot of p(wi ; H) while the dotted lines represent Gaussian distributions with three different variances. [sent-122, score-0.1]
48 In other words, given that different H are distinguished by a different set of variational parameters γ, how do we choose the most appropriate γ? [sent-124, score-0.383]
49 In choosing p(w, t; H), we would therefore like to match, where possible, significant regions of probability mass in the true ˆ model p(w, t; H). [sent-126, score-0.201]
50 , ˆ H = arg min ˆ p(w, t; H) − p(w, t; H) dw = arg max ˆ p(t|w)p(w; H)dw, ˆ H ˆ H (17) where the variational assumptions have allowed us to remove the absolute value (since the argument must always be positive). [sent-129, score-0.698]
51 Also, we note that (17) is tantamount to selecting the variational approximation with maximal Bayesian evidence [6]. [sent-130, score-0.551]
52 In other words, we ˆ are selecting the H, out of a class of variational approximations to H, that most probably explains the training data t, marginalized over the weights. [sent-131, score-0.471]
53 From an implementational standpoint, (17) can be reexpressed using (16) as, M γ = arg max log γ p(t|w) i=1 ˆ p wi ; H(γi ) dw M = arg max − γ b 1 log |Σt | + tT Σ−1 t + − − a log γi , t 2 γi i=1 (18) where Σt σ 2 I +Φdiag(γ)ΦT . [sent-132, score-1.051]
54 This is the same cost function as in [7] only without terms resulting from a prior on σ 2 , which we will address later. [sent-133, score-0.148]
55 Thus, the end result of this analysis is an evidence maximization procedure equivalent to the one in [7]. [sent-134, score-0.138]
56 The difference is that, where before we were optimizing over a somewhat arbitrary model parameterization, now we see that it is actually optimization over the space of variational approximations to a model with a sparse, regularizing prior. [sent-135, score-0.475]
57 Also, we know from (17) that this procedure is ˆ effectively matching, as much as possible, the mass of the full model p(w, t; H). [sent-136, score-0.235]
58 3 Analysis While the variational perspective is interesting, two pertinent questions still remain: 1. [sent-137, score-0.417]
59 Why should it be that approximating a sparse prior p(w; H) leads to sparse representations in practice? [sent-138, score-0.457]
60 In Figure 2 below, we have illustrated a 2D example of evidence maximization within the context of variational approximations to the sparse prior p(w; H). [sent-142, score-0.868]
61 On the left, the shaded area represents the region of w space where both p(w; H) and p(t|w) (and therefore p(w, t; H)) have significant probability mass. [sent-144, score-0.163]
62 Maximization of ˆ (17) involves finding an approximate distribution p(w, t; H) with a substantial percentage of its mass in this region. [sent-145, score-0.246]
63 Left: Contours of equiprobability density for p(w; H) and constant likelihood p(t|w); the prominent density and likelihood lie within each region respectively. [sent-147, score-0.298]
64 The shaded region represents the area where both have significant mass. [sent-148, score-0.163]
65 The shaded region represents the area where both the likelihood and the approximate ˆ ˆ prior Ha have significant mass. [sent-152, score-0.365]
66 Note that by the variational bound, each p(w; H) must lie within the contours of p(w; H). [sent-153, score-0.474]
67 In the plot on the right, we have graphed two approximate priors that satisfy the variational bounds, i. [sent-154, score-0.459]
68 We see that the narrow prior that aligns with the horizontal spine of p(w; H) places the largest percentage of its mass (and ˆ therefore the mass of p(w, t; Ha )) in the shaded region. [sent-157, score-0.676]
69 This corresponds with a prior of ˆ p(w; Ha ) = p(w1 , w2 ; γ1 0, γ2 ≈ 0). [sent-158, score-0.119]
70 (19) This creates a long narrow prior since there is minimal variance along the w2 axis. [sent-159, score-0.192]
71 In fact, it can be shown that owing to the infinite density of the variational constraint along each axis (which is allowed as a and b go to zero), the maximum evidence is obtained when γ2 is strictly equal to zero, giving the approximate prior infinite density along this axis as well. [sent-160, score-0.739]
72 In contrast, ˆ a model with significant prior variance along both axes, Hb , is hampered because it cannot extend directly out (due to the dotted variational boundary) along the spine to penetrate the likelihood. [sent-162, score-0.621]
73 In higher dimensions, the algorithm only retains those weights associated with the prior spines that span a subspace penetrating the most prominent portion of the likelihood mass (i. [sent-164, score-0.482]
74 , a higher-dimensional analog to the shaded ˆ region already mentioned). [sent-166, score-0.133]
75 The prior p(w; H) navigates the variational constraints, placing as much as possible of its mass in this region, driving many of the γi ’s to zero. [sent-167, score-0.703]
76 It is not difficult to show that, assuming a noise variance σ 2 > 0, the variational approximation to p(w, t; H) with maximal evidence cannot have any γi = wi = 0. [sent-169, score-1.022]
77 Intuitively, this occurs because the now finite spines of the prior p(w; H), which bound the variational approximation, do not allow us to place infinite prior density in any region of weight space (as occurred previously when any γi → 0). [sent-170, score-0.858]
78 Consequently, if any γi goes to zero with a, b > 0, the associated approximate prior mass, and therefore the approximate evidence, must also fall to zero by (16). [sent-171, score-0.261]
79 As such, models with all non-zero weights will be now be favored when we form the variational approximation. [sent-172, score-0.424]
80 We therefore cannot assume an approximation to a sparse prior will necessarily give us sparse results in practice. [sent-173, score-0.56]
81 SBL assumes it is unknown and random with prior distribution p(1/σ 2 ) ∝ (σ 2 )1−c exp(−d/σ 2 ), and c, d > 0. [sent-176, score-0.146]
82 After integrating out the unknown σ 2 , we arrive at the implicit likelihood equation, p(t|w) = p(t|w, σ 2 )p(σ 2 )dσ 2 ∝ d+ 1 t − Φw 2 −(¯+1/2) c 2 , (20) where c c + (N − 1)/2. [sent-177, score-0.09]
83 By replacing p(t|w) with the ¯ lower bound from (21), we then maximize over the variational parameters γ and σ 2 via M γ, σ 2 = arg max − 2 γ,σ 1 b d log |Σt | + tT Σ−1 t + − − a log γi − 2 −c log σ 2 , (22) t 2 γi σ i=1 the exact SBL optimization procedure. [sent-179, score-0.747]
84 Thus, we see that the entire SBL framework, including noise variance estimation, can be seen in variational terms. [sent-180, score-0.425]
85 4 Conclusions The end result of this analysis is an evidence maximization procedure that is equivalent to the one originally formulated in [7]. [sent-181, score-0.138]
86 The difference is that, where before we were optimizing over a somewhat arbitrary model parameterization, we now see that SBL is actually searching a space of variational approximations to find an alternative distribution that captures the significant mass of the full model. [sent-182, score-0.71]
87 Moreover, from the vantage point afforded by this new perspective, we can better understand the sparsity properties of SBL and the relationship between sparse priors and approximations to sparse priors. [sent-183, score-0.553]
88 Appendix: Derivation of the Dual Form of p(wi ; H) To accommodate the variational analysis of Sec. [sent-184, score-0.383]
89 1, we require the dual representation of p(wi ; H). [sent-186, score-0.173]
90 As an intermediate step, we must find the dual representation of f (yi ), where 2 yi wi and yi −(a+1/2) f (yi ) log p(wi ; H) = log C b + . [sent-187, score-1.155]
91 (23) 2 To accomplish this, we find the conjugate function f ∗ (λi ) using the duality relation yi 1 log b + . [sent-188, score-0.349]
92 (24) f ∗ (λi ) = max [λi yi − f (yi )] = max λi yi − log C + a + yi yi 2 2 To find the maximizing yi , we take the gradient of the left side and set it to zero, giving us, 1 a max − 2b. [sent-189, score-1.182]
93 (25) yi =− − λi 2λi Substituting this value into the expression for f ∗ (λi ) and selecting C = (2π)−1/2 exp − a + 1 2 a+ 1 2 (a+1/2) , (26) we arrive at 1 −1 1 log + log 2π − 2bλi . [sent-190, score-0.465]
94 (27) 2 2λi 2 We are now ready to represent f (yi ) in its dual form, observing first that we only need consider maximization over λi ≤ 0 since f (yi ) is a monotonically decreasing function (i. [sent-191, score-0.252]
95 Proceeding forward, we have f ∗ (λi ) = a+ 1 1 b −yi − a+ log γi − log 2π − , (28) 2γi 2 2 γi where we have used the monotonically increasing transformation λi = −1/(2γi ), γi ≥ 0. [sent-194, score-0.18]
96 The attendant dual representation of p(wi ; H) can then be obtained by exponentiating both 2 sides of (28) and substituting yi = wi , b w2 1 −a exp − i exp − γi . [sent-195, score-0.999]
97 (29) p(wi ; H) = max √ γi ≥0 2γi γi 2πγi f (yi ) = max [λi yi − f ∗ (λi )] = max λi ≤0 γi ≥0 Acknowledgments This research was supported by DiMI grant #22-8376 sponsored by Nissan. [sent-196, score-0.336]
98 Tipping, “Analysis of sparse Bayesian learning,” Advances in Neural Information Processing Systems 14, pp. [sent-213, score-0.169]
99 Girolami, “A variational method for learning sparse and overcomplete representations,” Neural Computation, vol. [sent-216, score-0.552]
100 Saul, “An introduction to variational methods for graphical models,” Machine Learning, vol. [sent-226, score-0.383]
wordName wordTfidf (topN-words)
[('wi', 0.458), ('sbl', 0.401), ('variational', 0.383), ('mass', 0.201), ('ap', 0.186), ('yi', 0.186), ('dw', 0.169), ('sparse', 0.169), ('dual', 0.148), ('prior', 0.119), ('bayesian', 0.099), ('ha', 0.099), ('sparsity', 0.094), ('hyperparameters', 0.081), ('dwd', 0.08), ('hb', 0.08), ('integrations', 0.08), ('plugin', 0.08), ('shaded', 0.078), ('approximation', 0.077), ('log', 0.076), ('maximization', 0.076), ('exp', 0.073), ('contours', 0.065), ('evidence', 0.062), ('approximations', 0.059), ('rigorous', 0.058), ('region', 0.055), ('obviates', 0.053), ('density', 0.052), ('relevance', 0.052), ('cant', 0.051), ('accomplish', 0.051), ('max', 0.05), ('tipping', 0.049), ('arg', 0.048), ('parameterization', 0.047), ('spines', 0.046), ('hyperpriors', 0.046), ('improper', 0.046), ('spine', 0.046), ('weight', 0.046), ('approximate', 0.045), ('recast', 0.042), ('rvm', 0.042), ('tt', 0.042), ('explanations', 0.042), ('variance', 0.042), ('weights', 0.041), ('encourages', 0.039), ('practice', 0.039), ('lines', 0.039), ('likelihood', 0.038), ('modes', 0.038), ('bound', 0.038), ('proceeding', 0.037), ('prominent', 0.037), ('pruned', 0.037), ('derivation', 0.037), ('conjugate', 0.036), ('intractable', 0.036), ('substituting', 0.036), ('mode', 0.035), ('perspective', 0.034), ('ective', 0.034), ('necessity', 0.034), ('full', 0.034), ('somewhat', 0.033), ('appendix', 0.032), ('priors', 0.031), ('narrow', 0.031), ('ambiguities', 0.031), ('relationship', 0.031), ('dotted', 0.031), ('drop', 0.03), ('represents', 0.03), ('address', 0.029), ('signi', 0.029), ('problematic', 0.029), ('hypothesis', 0.029), ('nd', 0.029), ('selecting', 0.029), ('monotonically', 0.028), ('convex', 0.028), ('tangent', 0.027), ('unknown', 0.027), ('methodology', 0.027), ('modern', 0.027), ('zero', 0.026), ('necessarily', 0.026), ('lie', 0.026), ('question', 0.026), ('analytic', 0.026), ('giving', 0.026), ('arrive', 0.025), ('outlined', 0.025), ('representation', 0.025), ('facilitate', 0.025), ('intuitive', 0.025), ('mentioned', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 155 nips-2003-Perspectives on Sparse Bayesian Learning
Author: Jason Palmer, Bhaskar D. Rao, David P. Wipf
Abstract: Recently, relevance vector machines (RVM) have been fashioned from a sparse Bayesian learning (SBL) framework to perform supervised learning using a weight prior that encourages sparsity of representation. The methodology incorporates an additional set of hyperparameters governing the prior, one for each weight, and then adopts a specific approximation to the full marginalization over all weights and hyperparameters. Despite its empirical success however, no rigorous motivation for this particular approximation is currently available. To address this issue, we demonstrate that SBL can be recast as the application of a rigorous variational approximation to the full model by expressing the prior in a dual form. This formulation obviates the necessity of assuming any hyperpriors and leads to natural, intuitive explanations of why sparsity is achieved in practice. 1
2 0.19707127 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
Author: David Barber, Felix V. Agakov
Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1
3 0.18552759 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
4 0.12235247 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
5 0.11308224 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
Author: Michael I. Jordan, Martin J. Wainwright
Abstract: We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problem that can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3]. 1
6 0.099850208 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles
7 0.08530318 122 nips-2003-Margin Maximizing Loss Functions
8 0.082835905 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
9 0.080158874 32 nips-2003-Approximate Expectation Maximization
10 0.071561351 55 nips-2003-Distributed Optimization in Adaptive Networks
11 0.070992567 51 nips-2003-Design of Experiments via Information Theory
12 0.069919996 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
13 0.067707464 115 nips-2003-Linear Dependent Dimensionality Reduction
14 0.065843999 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
15 0.063674599 176 nips-2003-Sequential Bayesian Kernel Regression
16 0.063598752 189 nips-2003-Tree-structured Approximations by Expectation Propagation
17 0.061293408 100 nips-2003-Laplace Propagation
18 0.060957272 73 nips-2003-Feature Selection in Clustering Problems
19 0.060046688 1 nips-2003-1-norm Support Vector Machines
20 0.057822093 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations
topicId topicWeight
[(0, -0.208), (1, -0.055), (2, -0.039), (3, 0.051), (4, 0.149), (5, -0.009), (6, 0.134), (7, -0.118), (8, -0.065), (9, -0.047), (10, -0.072), (11, -0.115), (12, -0.056), (13, 0.029), (14, -0.102), (15, 0.11), (16, -0.106), (17, 0.018), (18, 0.024), (19, -0.075), (20, -0.137), (21, -0.004), (22, 0.003), (23, 0.013), (24, 0.1), (25, 0.02), (26, 0.054), (27, 0.08), (28, -0.058), (29, 0.074), (30, -0.052), (31, -0.119), (32, 0.096), (33, 0.143), (34, -0.046), (35, 0.005), (36, 0.033), (37, 0.022), (38, -0.194), (39, 0.207), (40, -0.028), (41, 0.06), (42, -0.056), (43, -0.035), (44, -0.205), (45, -0.013), (46, 0.064), (47, -0.093), (48, 0.031), (49, -0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.97379303 155 nips-2003-Perspectives on Sparse Bayesian Learning
Author: Jason Palmer, Bhaskar D. Rao, David P. Wipf
Abstract: Recently, relevance vector machines (RVM) have been fashioned from a sparse Bayesian learning (SBL) framework to perform supervised learning using a weight prior that encourages sparsity of representation. The methodology incorporates an additional set of hyperparameters governing the prior, one for each weight, and then adopts a specific approximation to the full marginalization over all weights and hyperparameters. Despite its empirical success however, no rigorous motivation for this particular approximation is currently available. To address this issue, we demonstrate that SBL can be recast as the application of a rigorous variational approximation to the full model by expressing the prior in a dual form. This formulation obviates the necessity of assuming any hyperpriors and leads to natural, intuitive explanations of why sparsity is achieved in practice. 1
2 0.72197455 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
Author: David Barber, Felix V. Agakov
Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1
3 0.59936339 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
4 0.5093047 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
Author: Michael I. Jordan, Martin J. Wainwright
Abstract: We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problem that can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3]. 1
5 0.4305627 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
6 0.41050157 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering
7 0.38160378 32 nips-2003-Approximate Expectation Maximization
8 0.37709123 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles
9 0.35812581 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
10 0.35095996 129 nips-2003-Minimising Contrastive Divergence in Noisy, Mixed-mode VLSI Neurons
11 0.33723092 100 nips-2003-Laplace Propagation
12 0.32511479 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
13 0.32495224 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
14 0.31066138 51 nips-2003-Design of Experiments via Information Theory
15 0.30304644 181 nips-2003-Statistical Debugging of Sampled Programs
16 0.29599988 88 nips-2003-Image Reconstruction by Linear Programming
17 0.29559785 73 nips-2003-Feature Selection in Clustering Problems
18 0.29499847 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
19 0.28673956 196 nips-2003-Wormholes Improve Contrastive Divergence
20 0.28625122 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
topicId topicWeight
[(0, 0.043), (11, 0.02), (30, 0.012), (35, 0.042), (53, 0.081), (69, 0.023), (71, 0.048), (76, 0.495), (85, 0.086), (91, 0.06), (99, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.95981997 155 nips-2003-Perspectives on Sparse Bayesian Learning
Author: Jason Palmer, Bhaskar D. Rao, David P. Wipf
Abstract: Recently, relevance vector machines (RVM) have been fashioned from a sparse Bayesian learning (SBL) framework to perform supervised learning using a weight prior that encourages sparsity of representation. The methodology incorporates an additional set of hyperparameters governing the prior, one for each weight, and then adopts a specific approximation to the full marginalization over all weights and hyperparameters. Despite its empirical success however, no rigorous motivation for this particular approximation is currently available. To address this issue, we demonstrate that SBL can be recast as the application of a rigorous variational approximation to the full model by expressing the prior in a dual form. This formulation obviates the necessity of assuming any hyperpriors and leads to natural, intuitive explanations of why sparsity is achieved in practice. 1
2 0.93655598 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation
Author: Chen Yanover, Yair Weiss
Abstract: Loopy belief propagation (BP) has been successfully used in a number of difficult graphical models to find the most probable configuration of the hidden variables. In applications ranging from protein folding to image analysis one would like to find not just the best configuration but rather the top M . While this problem has been solved using the junction tree formalism, in many real world problems the clique size in the junction tree is prohibitively large. In this work we address the problem of finding the M best configurations when exact inference is impossible. We start by developing a new exact inference algorithm for calculating the best configurations that uses only max-marginals. For approximate inference, we replace the max-marginals with the beliefs calculated using max-product BP and generalized BP. We show empirically that the algorithm can accurately and rapidly approximate the M best configurations in graphs with hundreds of variables. 1
3 0.91971815 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
Author: Jan Eichhorn, Andreas Tolias, Alexander Zien, Malte Kuss, Jason Weston, Nikos Logothetis, Bernhard Schölkopf, Carl E. Rasmussen
Abstract: We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms. 1
4 0.89350557 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
Author: Thomas R. Strohmann, Andrei Belitski, Gregory Z. Grudic, Dennis DeCoste
Abstract: The Minimax Probability Machine Classification (MPMC) framework [Lanckriet et al., 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracy bound Ω. The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i.e. kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound Ω. SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation. Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i.e. feature weighting), based solely on maximizing the accuracy bound Ω. Experimental results indicate that we can obtain reliable bounds Ω, as well as test set accuracies that are comparable to state of the art classification algorithms.
5 0.60571516 112 nips-2003-Learning to Find Pre-Images
Author: Jason Weston, Bernhard Schölkopf, Gökhan H. Bakir
Abstract: We consider the problem of reconstructing patterns from a feature map. Learning algorithms using kernels to operate in a reproducing kernel Hilbert space (RKHS) express their solutions in terms of input points mapped into the RKHS. We introduce a technique based on kernel principal component analysis and regression to reconstruct corresponding patterns in the input space (aka pre-images) and review its performance in several applications requiring the construction of pre-images. The introduced technique avoids difficult and/or unstable numerical optimization, is easy to implement and, unlike previous methods, permits the computation of pre-images in discrete input spaces. 1
6 0.57317823 189 nips-2003-Tree-structured Approximations by Expectation Propagation
7 0.56716287 176 nips-2003-Sequential Bayesian Kernel Regression
8 0.56327248 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
9 0.54179692 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
10 0.52656382 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
11 0.5208391 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
12 0.51836073 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
13 0.51648843 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
14 0.51411021 122 nips-2003-Margin Maximizing Loss Functions
15 0.51162112 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression
16 0.51107204 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
17 0.50650758 152 nips-2003-Pairwise Clustering and Graphical Models
18 0.50505137 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution
19 0.50388235 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
20 0.50147247 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers