nips nips2003 nips2003-94 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Barber, Felix V. Agakov
Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The IM Algorithm : A variational approach to Information Maximization David Barber Felix Agakov Institute for Adaptive and Neural Computation : www. [sent-1, score-0.079]
2 Abstract The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. [sent-7, score-0.248]
3 We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. [sent-8, score-0.278]
4 The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. [sent-9, score-0.133]
5 1 Introduction The reliable communication of information over noisy channels is a widespread issue, ranging from the construction of good error-correcting codes to feature extraction[3, 12]. [sent-11, score-0.111]
6 The goal is to adjust parameters of the mapping p(y|x) to maximise I(x, y). [sent-14, score-0.146]
7 The key difficulty lies in the computation of the entropy of p(y) (a mixture). [sent-16, score-0.086]
8 One such tractable special case is if the mapping y = g(x; Θ) is deterministic and invertible, for which the difficult entropy term trivially becomes H(y) = log |J| p(y) + const. [sent-17, score-0.266]
9 Another tractable special case is if the source distribution p(x) is Gaussian and the mapping p(y|x) is Gaussian. [sent-20, score-0.159]
10 x p(x) source y p(y|x) encoder z p(z) x q(x|y,z) decoder Figure 1: An illustration of the form of a more general mixture decoder. [sent-21, score-0.672]
11 In neural coding, a popular alternative is to maximise the Fisher ‘Information’[5]. [sent-26, score-0.146]
12 Other approaches use different objective criteria, such as average reconstruction error. [sent-27, score-0.088]
13 2 Variational Lower Bound on Mutual Information Since the MI is a measure of information transmission, our central aim is to maximise a lower bound on the MI. [sent-28, score-0.411]
14 Since we shall generally be interested in optimising MI with respect to the parameters of p(y|x), and p(x) is simply the data distribution, we need to bound H(x|y) suitably. [sent-30, score-0.315]
15 The KullbackLeibler bound x p(x|y) log p(x|y) − p(x|y) log q(x|y) ≥ 0 gives I(x, y) ≥ H(x) + log q(x|y) “entropy def p(x,y) ˜ = I(x, y). [sent-31, score-0.588]
16 (3) “energy where q(x|y) is an arbitrary variational distribution. [sent-32, score-0.079]
17 The form of this bound is convenient since it explicitly includes both the encoder p(y|x) and decoder q(x|y), see fig(1). [sent-34, score-0.741]
18 However, our current experience suggests that the bound considered above is particularly computationally convenient. [sent-36, score-0.199]
19 Since the bound is based on the KL divergence, it is equivalent to a moment matching approximation of p(x|y) by q(x|y). [sent-37, score-0.229]
20 The IM algorithm To maximise the MI with respect to any parameters θ of p(y|x, θ), we aim to push up the lower bound (3). [sent-40, score-0.411]
21 First one needs to choose a class of variational distributions q(x|y) ∈ Q for which the energy term is tractable. [sent-41, score-0.11]
22 Then a natural ˜ recursive procedure for maximising I(X, Y ) for given p(x), is ˜ 1. [sent-42, score-0.094]
23 This procedure is analogous to the (G)EM algorithm which maximises a lower bound on the likelihood[9]. [sent-46, score-0.278]
24 Note that if |y| is large, the posterior p(x|y) will typically be sharply peaked around its mode. [sent-48, score-0.091]
25 This would motivate a simple approximation q(x|y) to the posterior, Figure 2: The MI optimal linear projection of data x (dots) is not always given by PCA. [sent-49, score-0.071]
26 PCA projects data onto the vertical line, for which the entropy conditional on the projection H(x|y) is large. [sent-50, score-0.081]
27 A simple approximation would then be to use a Laplace approximation to p(x|y) with 2 log covariance elements [Σ−1 ]ij = ∂ ∂xi p(x|y) . [sent-54, score-0.15]
28 The bound presented here is arguably more general and appropriate than presented in [5] since, whilst it also tends to the exact value of the MI in the limit of a large number of responses, it is a principled bound for any response dimension. [sent-56, score-0.572]
29 Even though we do not directly maximise the MI, we also indirectly maximise the probability of a correct reconstruction – a form of autoencoder. [sent-59, score-0.38]
30 Generalisation to Mixture Decoders A straightforward application of Jensen’s inequality leads to the more general result: I(X, Y ) ≥ H(X) + log q(x|y, z) p(y|x)p(x)q(z) ˜ ≡ I(X, Y ) where q(x|y, z) and q(z) are variational distributions. [sent-60, score-0.189]
31 The aim is to choose q(x|y, z) such that the bound is tractably computable. [sent-61, score-0.276]
32 The classical solution to this problem (and minimizes the linear reconstruction error) is given by PCA. [sent-64, score-0.088]
33 To see how we might improve on the PCA approach, we consider optimising our bound with respect to linear mappings. [sent-66, score-0.283]
34 For a decoder q(x|y) = N (m(y), Σ(y)), maximising the bound on MI is equivalent to minimising P (x − m(y))T Σ−1 (y)(x − m(y)) + log det Σ(y) p(y|xµ ) µ=1 For constant diagonal matrices Σ(y), this reduces to minimal mean square reconstruction error autoencoder training in the limit s2 → 0. [sent-70, score-1.145]
35 Linear Gaussian Decoder A simple decoder is given by q(x|y) ∼ N (Uy, σ 2 I), for which ˜ I(x, y) ∝ 2tr(UWS) − tr(UMUT ), where S = xx T = µ µ (4) µ T x (x ) /P is the sample covariance of the data, and M = Is2 + WSWT (5) is the covariance of the mixture distribution p(y). [sent-72, score-0.566]
36 For noisy channels, unconstrained optimization of (6) leads to a divergence of the matrix norm WWT ∞ ; a norm-constrained optimisation in general produces a different result to PCA. [sent-75, score-0.179]
37 The simplicity of the linear decoder in this case severely limits any potential improvement over PCA, and certainly would not resolve the issue in fig(2). [sent-76, score-0.443]
38 For this, a non-linear decoder q(x|y) is required, for which the integrals become more complex. [sent-77, score-0.482]
39 Non-linear Encoders and Kernel PCA An alternative to using non-linear decoders to improve on PCA is to use a non-linear encoder. [sent-78, score-0.123]
40 In the zero-noise limit the optimal solution for the encoder results in non-linear PCA on the covariance Φ(x)Φ(x)T of the transformed data. [sent-80, score-0.217]
41 By Mercer’s theorem, the elements of the covariance matrix may be replaced by a Kernel function of the users choice[8]. [sent-81, score-0.1]
42 An advantage of our framework is that our bound enables the principled comparison of embedding functions/kernels. [sent-82, score-0.242]
43 4 Binary Responses (Neural Coding) In a neurobiological context, a popular issue is how to encode real-valued stimuli in a population of spiking neurons. [sent-83, score-0.087]
44 Here we look briefly at a simple case in which each neuron fires (yi = 1) with increasing probability the further the membrane T potential wi x is above threshold −bi . [sent-84, score-0.25]
45 Independent neural firing suggests: def p(yi |x) = p(y|x) = i T σ(yi (wi x + bi )). [sent-85, score-0.14]
46 (7) Figure 3: Top row: a subset of the original real-valued source data. [sent-86, score-0.087]
47 Middle row: after training, 20 samples from each of the 7 output units, for each of the corresponding source inputs. [sent-87, score-0.087]
48 Bottom row: Reconstruction of the source data from 50 samples of the output units. [sent-88, score-0.087]
49 This results in a visibly larger bottom loop of the 8th reconstructed pattern, which agrees with the original source data. [sent-90, score-0.116]
50 In this case, exact evaluation of the bound (3) is straightforward, since it only involves computations of the second-order moments of y over the factorized distribution. [sent-94, score-0.233]
51 A reasonable reconstruction of the source x from its representation y will be given ˜ by the mean x = x q(x|y) of the learned approximate posterior. [sent-95, score-0.207]
52 In noisy channels ˜ we need to average over multiple possible representations, i. [sent-96, score-0.111]
53 We performed reconstruction of continuous source data from stochastic binary responses for |x| = 196 input and |y| = 7 output units. [sent-99, score-0.207]
54 The bound was optimized with respect to the parameters of p(y|x) and q(x|y) with isotropic norm constraints on W and b for 30 instances of digits 1 and 8 (15 of each class). [sent-100, score-0.199]
55 The source variables were reconstructed from 50 samples of the corresponding binary representations at the mean of the learned q(x|y), see fig(3). [sent-101, score-0.148]
56 , M wishes to send a bit sj ∈ {0, 1} of information to a base station. [sent-105, score-0.209]
57 To send sj = 1, she transmits an N dimensional realvalued vector gj , which represents a time-discretised waveform (sj = 0 corresponds to no transmission). [sent-106, score-0.168]
58 The simultaneous transmissions from all users results in a received signal at the base station of j gi sj + ηi , ri = i = 1, . [sent-107, score-0.2]
59 The task for the base station (which knows G) is to decode the received vector r so that s can be recovered reliably. [sent-112, score-0.071]
60 Using Bayes’ rule, p(s|r) ∝ p(r|s)p(s), and assuming a flat prior on s, p(s|r) ∝ exp − −2rT Gs + sT GT Gs /(2σ 2 ) (8) Computing either the MAP solution arg maxs p(s|r) or the MPM solution arg maxsj p(sj |r), j = 1, . [sent-114, score-0.08]
61 If GT G is diagonal, optimal decoding is easy, since the posterior factorises, with p(sj |r) ∝ exp ri Gji − Djj 2 sj /(2σ 2 ) i T where the diagonal matrix D = G G (and we used s2 ≡ si for si ∈ {0, 1}). [sent-118, score-0.73]
62 For i suitably randomly chosen matrices G, GT G will be approximately diagonal in the limit of large N . [sent-119, score-0.08]
63 However, ideally, one would like to construct decoders that perform near-optimal decoding without recourse to the approximate diagonality of GT G. [sent-120, score-0.315]
64 The MAP decoder solves the problem mins∈{0,1}N sT GT Gs − 2sT GT r ≡ mins∈{0,1}N s − G−1 r T GT G s − G−1 r and hence the MAP solution is that s which is closest to the vector G−1 r. [sent-121, score-0.474]
65 The difficulty lies in the meaning of ‘closest’ since the space is non-isotropically warped by the matrix GT G. [sent-122, score-0.065]
66 A useful guess for the decoder is that it is the closest in the Euclidean sense to the vector G−1 r. [sent-123, score-0.474]
67 Computing the Mutual Information Of prime interest in CDMA is the evaluation of decoders in the case of nonorthogonal matrices G[11]. [sent-125, score-0.123]
68 In this respect, a principled comparison of decoders can be obtained by evaluating the corresponding bound on the MI1 , I(r, s) ≡ H(s) − H(s|r) ≥ H(s) + p(s)p(r|s) log q(s|r) r (9) s where H(s) is trivially given by M (bits). [sent-126, score-0.508]
69 We make the specific assumption in the following that our decoding algorithm takes the factorised form q(s|r) = i q(si |r) and, without loss of generality, we may write q(si |r) = σ ((2si − 1)fi (r)) (10) for some decoding function fi (r). [sent-128, score-0.499]
70 We restrict interest here to the case of simple linear decoding functions wij rj . [sent-129, score-0.192]
71 fi (r) = ai + j Since p(r|s) is Gaussian, (2si − 1)fi (r) ≡ xi is also Gaussian, p(xi |s) = N (µi (s), vari ), T µi (s) ≡ (2si − 1)(ai + wi Gs), T vari ≡ σ 2 wi wi T where wi is the ith row of the matrix [W ]ij ≡ wij . [sent-130, score-1.265]
72 Hence 1 −H(s|r) ≥ i T 2πσ 2 wi wi ∞ T [log σ (x)] e−[x−(2si −1)(ai +wi Gs)] 2 T /(2σ 2 wi wi ) x=−∞ p(s) (11) In general, the average over the factorised distribution p(s) can be evaluated by using the Fourier Transform [1]. [sent-131, score-1.057]
73 However, to retain clarity here, we constrain the T decoding matrix W so that wi Gs = bi si , i. [sent-132, score-0.709]
74 Figure 4: The bound given by the decoder W ∝ G−1 r plotted against the optimised bound (for the same G) found using 50 updates of conjugate gradients. [sent-136, score-0.904]
75 For clarity, a small number of poor results (in which the bound is negative) have been omitted. [sent-138, score-0.199]
76 8 1 MI bound for Inverse Decoder a sum of one dimensional integrals, each of which can be evaluated numerically. [sent-156, score-0.229]
77 In the case of an orthogonal matrix GT G = D the decoding function is optimal and the MI bound is exact with the parameters in (12) set to ai = −[GT G]ii /(2σ 2 ) W = GT /σ 2 bi = [GT G]ii /σ 2 . [sent-157, score-0.628]
78 Optimising the linear decoder In the case that GT G is non-diagonal, what is the optimal linear decoder? [sent-158, score-0.478]
79 A partial answer is given by numerically optimising the bound from (11). [sent-159, score-0.283]
80 Using W = diag(b)G−1 , ([G−1 ]ij )2 , T σ 2 wi wi = σ 2 b2 i j and the bound depends only on a and b. [sent-161, score-0.699]
81 Under this constraint the bound can be numerically optimised as a function of a and b, given a fixed vector j ([G−1 ]ij )2 . [sent-162, score-0.262]
82 As an alternative we can employ the decorrelation decoder, W = G−1 /σ 2 , with ai = −1/(2σ 2 ). [sent-163, score-0.124]
83 In fig(4) we see that, according to our bound, the decorrelation or T (‘inverse’) decoder is suboptimal versus the linear decoder fi (r) = ai + wi r with −1 W = diag(b)G , optimised over a and b. [sent-164, score-1.381]
84 These initial results are encouraging, and motivate further investigations, for example, using syndrome decoding for CDMA. [sent-165, score-0.228]
85 6 Posterior Approximations There is an interesting relationship between maximising the bound on the MI and computing an optimal estimate q(s|r) of an intractable posterior p(s|r). [sent-166, score-0.419]
86 The optimal bit error solution sets q(si |r) to the mean of the exact posterior marginal p(si |r). [sent-167, score-0.298]
87 Mean Field Theory approximates the posterior marginal by minimising the KL divergence: KL(q||p) = s (q(s|r) log q(s|r) − q(s|r) log p(s|r)), where q(s|r) = i q(si |r). [sent-168, score-0.42]
88 In this case, the KL divergence is tractably computable (up to a neglectable prefactor). [sent-169, score-0.115]
89 However, this form of the KL divergence chooses q(si |r) to be any one of a very large number of local modes of the posterior distribution p(si |r). [sent-170, score-0.161]
90 Since the optimal choice is to choose the posterior marginal mean, this is why using Mean Field decoding is generally suboptimal. [sent-171, score-0.417]
91 Alternatively, consider (p(s|r) log p(s|r) − p(s|r) log q(s|r)) = − KL(p||q) = s p(s|r) log q(s|r) + const. [sent-172, score-0.33]
92 s This is the correct KL divergence in the sense that, optimally, q(si |r) = p(si |r), that is, the posterior marginal is correctly calculated. [sent-173, score-0.228]
93 The difficulty lies in performing averages with respect to p(s|r), which are generally intractable. [sent-174, score-0.067]
94 Since we will have a distribution p(r) it is reasonable to provide an averaged objective function, p(r)p(s|r) log q(s|r) = r s p(s)p(r|s) log q(s|r). [sent-175, score-0.22]
95 r (13) s Whilst, for any given r, we cannot calculate the best posterior marginal estimate, we may be able to calculate the best posterior marginal estimate on average. [sent-176, score-0.316]
96 Whilst the bound is straightforward, it appears to have attracted little previous attention as a practical tool for MI optimisation. [sent-180, score-0.199]
97 It is a more direct approach to optimal coding than using the Fisher ‘Information’ in neurobiological population encoding. [sent-182, score-0.17]
98 Our bound enables a principled comparison of different information maximisation algorithms, and may have applications in other areas of machine learning and Information Theory, such as error-correction. [sent-183, score-0.287]
99 , Improving the mean field approximation via the use of mixture distributions, Proceedings of the NATO ASI on Learning in Graphical Models, Kluwer, 1997. [sent-207, score-0.075]
100 Willsky, A new class of upper bounds on the log partition function, Uncertainty in Artificial Intelligence, 2002. [sent-248, score-0.11]
wordName wordTfidf (topN-words)
[('decoder', 0.443), ('mi', 0.284), ('gt', 0.276), ('wi', 0.25), ('gs', 0.227), ('bound', 0.199), ('decoding', 0.192), ('maximise', 0.146), ('cdma', 0.142), ('di', 0.123), ('si', 0.123), ('decoders', 0.123), ('log', 0.11), ('sj', 0.099), ('encoder', 0.099), ('pca', 0.096), ('kl', 0.094), ('maximising', 0.094), ('posterior', 0.091), ('reconstruction', 0.088), ('mutual', 0.088), ('source', 0.087), ('optimising', 0.084), ('bi', 0.081), ('variational', 0.079), ('divergence', 0.07), ('decorrelation', 0.067), ('marginal', 0.067), ('channels', 0.064), ('optimised', 0.063), ('field', 0.06), ('fisher', 0.06), ('transmission', 0.06), ('def', 0.059), ('im', 0.059), ('ij', 0.058), ('fi', 0.058), ('ai', 0.057), ('autoencoder', 0.057), ('factorised', 0.057), ('swt', 0.057), ('uy', 0.057), ('whilst', 0.054), ('diag', 0.053), ('entropy', 0.051), ('barber', 0.049), ('coding', 0.048), ('noisy', 0.047), ('erent', 0.045), ('mins', 0.045), ('neurobiological', 0.045), ('tractably', 0.045), ('maximises', 0.045), ('maximisation', 0.045), ('vari', 0.045), ('principled', 0.043), ('limit', 0.043), ('mixture', 0.043), ('population', 0.042), ('tractable', 0.042), ('minimising', 0.042), ('opper', 0.042), ('saad', 0.042), ('arg', 0.04), ('covariance', 0.04), ('send', 0.039), ('station', 0.039), ('integrals', 0.039), ('bit', 0.039), ('culty', 0.039), ('diagonal', 0.037), ('compression', 0.037), ('motivate', 0.036), ('optimal', 0.035), ('lies', 0.035), ('lower', 0.034), ('exact', 0.034), ('clarity', 0.033), ('trivially', 0.033), ('generally', 0.032), ('mean', 0.032), ('optimisation', 0.032), ('base', 0.032), ('em', 0.032), ('responses', 0.032), ('aim', 0.032), ('energy', 0.031), ('closest', 0.031), ('jensen', 0.031), ('special', 0.03), ('vertical', 0.03), ('matching', 0.03), ('row', 0.03), ('matrix', 0.03), ('dimensional', 0.03), ('advanced', 0.03), ('users', 0.03), ('blind', 0.029), ('reconstructed', 0.029), ('ring', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
Author: David Barber, Felix V. Agakov
Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1
2 0.19707127 155 nips-2003-Perspectives on Sparse Bayesian Learning
Author: Jason Palmer, Bhaskar D. Rao, David P. Wipf
Abstract: Recently, relevance vector machines (RVM) have been fashioned from a sparse Bayesian learning (SBL) framework to perform supervised learning using a weight prior that encourages sparsity of representation. The methodology incorporates an additional set of hyperparameters governing the prior, one for each weight, and then adopts a specific approximation to the full marginalization over all weights and hyperparameters. Despite its empirical success however, no rigorous motivation for this particular approximation is currently available. To address this issue, we demonstrate that SBL can be recast as the application of a rigorous variational approximation to the full model by expressing the prior in a dual form. This formulation obviates the necessity of assuming any hyperpriors and leads to natural, intuitive explanations of why sparsity is achieved in practice. 1
3 0.18096314 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
4 0.16159476 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
Author: Ryan C. Kelly, Tai Sing Lee
Abstract: Decoding is a strategy that allows us to assess the amount of information neurons can provide about certain aspects of the visual scene. In this study, we develop a method based on Bayesian sequential updating and the particle filtering algorithm to decode the activity of V1 neurons in awake monkeys. A distinction in our method is the use of Volterra kernels to filter the particles, which live in a high dimensional space. This parametric Bayesian decoding scheme is compared to the optimal linear decoder and is shown to work consistently better than the linear optimal decoder. Interestingly, our results suggest that for decoding in real time, spike trains of as few as 10 independent but similar neurons would be sufficient for decoding a critical scene variable in a particular class of visual stimuli. The reconstructed variable can predict the neural activity about as well as the actual signal with respect to the Volterra kernels. 1
5 0.12531817 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
Author: Haifeng Li, Tao Jiang, Keshu Zhang
Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. 1
6 0.10632102 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
7 0.10556773 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
8 0.098731562 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
9 0.096144356 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
10 0.093887165 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations
11 0.090661354 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
12 0.089722022 129 nips-2003-Minimising Contrastive Divergence in Noisy, Mixed-mode VLSI Neurons
13 0.087024495 32 nips-2003-Approximate Expectation Maximization
14 0.085382111 92 nips-2003-Information Bottleneck for Gaussian Variables
15 0.082099892 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
16 0.078820556 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
17 0.077159084 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
18 0.072970279 51 nips-2003-Design of Experiments via Information Theory
19 0.072660439 55 nips-2003-Distributed Optimization in Adaptive Networks
20 0.072486199 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
topicId topicWeight
[(0, -0.253), (1, -0.048), (2, 0.034), (3, 0.08), (4, 0.154), (5, 0.046), (6, 0.208), (7, -0.113), (8, -0.063), (9, -0.116), (10, -0.077), (11, -0.183), (12, -0.047), (13, -0.025), (14, 0.024), (15, 0.085), (16, -0.04), (17, 0.033), (18, -0.024), (19, -0.029), (20, -0.054), (21, -0.109), (22, 0.139), (23, -0.057), (24, 0.141), (25, -0.01), (26, -0.112), (27, 0.104), (28, -0.14), (29, -0.091), (30, 0.001), (31, -0.086), (32, 0.112), (33, 0.172), (34, -0.023), (35, 0.005), (36, 0.06), (37, 0.042), (38, -0.054), (39, 0.066), (40, 0.051), (41, -0.042), (42, 0.018), (43, -0.041), (44, -0.16), (45, -0.042), (46, 0.034), (47, -0.036), (48, -0.026), (49, -0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.96287918 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
Author: David Barber, Felix V. Agakov
Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1
2 0.75053841 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
3 0.72159326 155 nips-2003-Perspectives on Sparse Bayesian Learning
Author: Jason Palmer, Bhaskar D. Rao, David P. Wipf
Abstract: Recently, relevance vector machines (RVM) have been fashioned from a sparse Bayesian learning (SBL) framework to perform supervised learning using a weight prior that encourages sparsity of representation. The methodology incorporates an additional set of hyperparameters governing the prior, one for each weight, and then adopts a specific approximation to the full marginalization over all weights and hyperparameters. Despite its empirical success however, no rigorous motivation for this particular approximation is currently available. To address this issue, we demonstrate that SBL can be recast as the application of a rigorous variational approximation to the full model by expressing the prior in a dual form. This formulation obviates the necessity of assuming any hyperpriors and leads to natural, intuitive explanations of why sparsity is achieved in practice. 1
4 0.45209593 129 nips-2003-Minimising Contrastive Divergence in Noisy, Mixed-mode VLSI Neurons
Author: Hsin Chen, Patrice Fleury, Alan F. Murray
Abstract: This paper presents VLSI circuits with continuous-valued probabilistic behaviour realized by injecting noise into each computing unit(neuron). Interconnecting the noisy neurons forms a Continuous Restricted Boltzmann Machine (CRBM), which has shown promising performance in modelling and classifying noisy biomedical data. The Minimising-Contrastive-Divergence learning algorithm for CRBM is also implemented in mixed-mode VLSI, to adapt the noisy neurons’ parameters on-chip. 1
5 0.44094375 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
Author: Michael I. Jordan, Martin J. Wainwright
Abstract: We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problem that can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3]. 1
6 0.44057009 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
7 0.43330136 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
8 0.4269208 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
9 0.41303 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
10 0.40737358 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations
11 0.40035766 32 nips-2003-Approximate Expectation Maximization
12 0.39636555 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis
13 0.37495258 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
14 0.3748402 66 nips-2003-Extreme Components Analysis
15 0.35804155 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
16 0.340747 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
17 0.33790198 92 nips-2003-Information Bottleneck for Gaussian Variables
18 0.33468768 51 nips-2003-Design of Experiments via Information Theory
19 0.33338737 100 nips-2003-Laplace Propagation
20 0.3315233 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
topicId topicWeight
[(0, 0.03), (11, 0.024), (29, 0.023), (30, 0.016), (35, 0.084), (49, 0.011), (53, 0.156), (69, 0.014), (71, 0.065), (76, 0.069), (84, 0.245), (85, 0.102), (91, 0.067), (99, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.85347819 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
Author: David Barber, Felix V. Agakov
Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1
2 0.8424027 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
3 0.66405851 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
4 0.66388816 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
5 0.66351211 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki
Abstract: In this paper, sparse representation (factorization) of a data matrix is first discussed. An overcomplete basis matrix is estimated by using the K−means method. We have proved that for the estimated overcomplete basis matrix, the sparse solution (coefficient matrix) with minimum l1 −norm is unique with probability of one, which can be obtained using a linear programming algorithm. The comparisons of the l1 −norm solution and the l0 −norm solution are also presented, which can be used in recoverability analysis of blind source separation (BSS). Next, we apply the sparse matrix factorization approach to BSS in the overcomplete case. Generally, if the sources are not sufficiently sparse, we perform blind separation in the time-frequency domain after preprocessing the observed data using the wavelet packets transformation. Third, an EEG experimental data analysis example is presented to illustrate the usefulness of the proposed approach and demonstrate its performance. Two almost independent components obtained by the sparse representation method are selected for phase synchronization analysis, and their periods of significant phase synchronization are found which are related to tasks. Finally, concluding remarks review the approach and state areas that require further study. 1
6 0.66313446 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
7 0.66256946 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
8 0.66215271 66 nips-2003-Extreme Components Analysis
9 0.65890086 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
10 0.65874815 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
11 0.65873307 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
12 0.65781879 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
13 0.65677106 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty
14 0.65643108 112 nips-2003-Learning to Find Pre-Images
15 0.65609163 189 nips-2003-Tree-structured Approximations by Expectation Propagation
16 0.65587515 78 nips-2003-Gaussian Processes in Reinforcement Learning
17 0.65389043 115 nips-2003-Linear Dependent Dimensionality Reduction
18 0.65359986 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
19 0.65261078 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
20 0.65256017 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification