nips nips2005 nips2005-65 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Consider the problem of joint parameter estimation and prediction in a Markov random field: i. [sent-4, score-0.338]
2 , the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e. [sent-6, score-0.144]
3 Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. [sent-9, score-0.603]
4 The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i. [sent-10, score-0.193]
5 , an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. [sent-12, score-0.311]
6 En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. [sent-13, score-0.768]
7 We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. [sent-14, score-0.217]
8 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. [sent-15, score-0.344]
9 1 Introduction Consider the problem of joint learning (parameter estimation) and prediction in a Markov random field (MRF): in the learning phase, an initial collection of data is used to estimate parameters, and the fitted model is then used to perform prediction (e. [sent-16, score-0.356]
10 by using computationally tractable methods versus exact methods (i. [sent-23, score-0.074]
11 Route A: computationally intractable combination of parameter estimation and prediction. [sent-27, score-0.196]
12 Route B: computationally efficient combination of approximate parameter estimation and prediction. [sent-28, score-0.218]
13 This paper focuses on the analysis of variational methods based convex relaxations, which includes a broad range of extant algorithms—among them the tree-reweighted sum-product algorithm [11], reweighted forms of generalized belief propagation [13], and semidefinite relaxations [12]. [sent-30, score-0.854]
14 At a high level, the key idea of this paper is the following: given that approximate methods can lead to errors at both the estimation and prediction phases, it is natural to speculate that these sources of error might be arranged to partially cancel one another. [sent-34, score-0.27]
15 Our theoretical analysis confirms this intuition: we show that with respect to end-to-end performance, it is in fact beneficial, even in the infinite data limit, to learn the “wrong” the model by using an inconsistent parameter estimator. [sent-35, score-0.125]
16 More specifically, we show how any convex variational method can be used to define a surrogate likelihood function. [sent-36, score-0.7]
17 We then investigate the asymptotic properties of parameter estimators based maximizing such surrogate likelihoods, and establish that they are asymptotically normal but inconsistent in general. [sent-37, score-0.574]
18 We then prove that any variational method that is based on a strongly concave entropy approximation is globally Lipschitz stable. [sent-38, score-0.335]
19 Finally, focusing on prediction for a coupled mixture of Gaussians, we prove upper bounds on the increase in MSE of our computationally efficient method, relative to the unachievable Bayes optimum. [sent-39, score-0.384]
20 We provide experimental results using the tree-reweighted (TRW) sumproduct algorithm that confirm the stability of our methods, and demonstrate its superior performance to a heuristic method based on standard sum-product. [sent-40, score-0.088]
21 2 Background We begin with necessary notation and background on multinomial Markov random fields, as well as variational representations and methods. [sent-41, score-0.168]
22 Overall, the family of MRFs (1) is an exponential family with canonical parameter θ ∈ Rd . [sent-48, score-0.059]
23 Variational representation: We now describe how the cumulant generating function can be represented as the solution of an optimization problem. [sent-50, score-0.154]
24 The constraint set is given by MARG(G; φ) := µ ∈ Rd | µ = x∈X N p(x)φ(x) for some p(·) , consisting of all globally realizable singleton µs (·) and pairwise µst (· , ·) marginal distributions on the graph G. [sent-51, score-0.171]
25 With these definitions, it can be shown [12] that A has the variational representation A(θ) 3 = max µ∈MARG(G;φ) θT µ − A∗ (µ) . [sent-53, score-0.168]
26 (2) From convex surrogates to joint estimation/prediction In general, solving the variational problem (2) is intractable for two reasons: (i) the constraint set MARG(G; φ) is extremely difficult to characterize; and (ii) the dual function A∗ lacks a closed-form representation. [sent-54, score-0.508]
27 These challenges motivate approximations to A∗ and MARG(G; φ); the resulting relaxed optimization problem defines a convex surrogate to the cumulant generating function. [sent-55, score-0.743]
28 Convex surrogates: Let REL(G; φ) be a compact and convex outer bound to the marginal polytope MARG(G; φ), and let B ∗ be a strictly convex and twice continuously differentiable approximation to the dual function A∗ . [sent-56, score-0.45]
29 We use these approximations to define a convex surrogate B via the relaxed optimization problem B(θ) := max τ ∈REL(G;φ) θT τ − B ∗ (τ ) . [sent-57, score-0.589]
30 First, since B is defined by the maximum of a collection of functions linear in θ, it is convex [1]. [sent-59, score-0.164]
31 Moreover, by the strict convexity of B ∗ and compactness of REL(G; φ), the optimum is uniquely attained at some τ (θ). [sent-60, score-0.084]
32 Since τ (θ) has a natural interpretation as a pseudomarginal, this last property of B is analogous to the well-known cumulant generating property of A—namely, ∇A(θ) = µ(θ). [sent-62, score-0.236]
33 One example of such a convex surrogate is the tree-reweighted Bethe free energy considered in our previous work [11]. [sent-63, score-0.501]
34 For this surrogate, the relaxed constraint set REL(G; φ) takes the form LOCAL(G; φ) := τ ∈ Rd | + xs τs (xs ) = 1, xt τst (xs , xt ) = τs (xs ) , whereas the entropy approximation B ∗ is of the “convexified” Bethe form −B ∗ (τ ) = s∈V Hs (τs ) − ρst Ist (τst ). [sent-64, score-0.682]
35 (s,t)∈E (4) Here Hs and Ist are the singleton entropy and edge-based mutual information, respectively, and the weights ρst are derived from the graph structure so as to ensure convexity (see [11] for more details). [sent-65, score-0.177]
36 Analogous convex variational formulations underlie the reweighted generalized BP algorithm [13], as well as a log-determinant relaxation [12]. [sent-66, score-0.493]
37 Approximate parameter estimation using surrogate likelihoods: Consider the problem of estimating the parameter θ using i. [sent-67, score-0.522]
38 For an MRF of the form (1), the maximum likelihood estimate (MLE) is specified using the vector µ of empirical marginal distributions (singleton µs and pairwise µst ). [sent-74, score-0.086]
39 For the tree-reweighted Bethe surrogate, we have shown in previous work [10] that in the absence of regularization, the optimal parameter estimates θn have a very simple closed-form solution, specified in terms of the weights ρst and the empirical marginals µ. [sent-79, score-0.153]
40 d Joint estimation/prediction: Using such an estimator, we now consider the joint approach to estimation and prediction illustrated in Figure 2. [sent-81, score-0.279]
41 samples, we first use the surrogate likelihood (5) to construct a parameter estimate θn . [sent-85, score-0.427]
42 Given a new noisy or incomplete observation y, we wish to perform near-optimal prediction or data fusion using the fitted model (e. [sent-86, score-0.189]
43 In order to do so, we first incorporate the new observation into the model, and then use the message-passing algorithm associated with the convex surrogate B in order to compute approximate pseudomarginals τ . [sent-89, score-0.699]
44 These pseudomarginals can then be used to construct a prediction z(y; τ ), where the specifics of the prediction depend on the observation model. [sent-90, score-0.427]
45 4 Analysis Asymptotics of estimator: We begin by considering the asymptotic behavior of the parameter estmiator θn defined by the surrogate likelihood (5). [sent-92, score-0.475]
46 Since this parameter estimator is a particular type of M -estimator, the following result follows from standard techniques [8]: Proposition 1. [sent-93, score-0.127]
47 For a general graph with cycles, θn converges in probability to some fixed √ θ = θ∗ ; moreover, n[θn − θ] is asymptotically normal. [sent-94, score-0.069]
48 A key property of the estimator is its inconsistency—i. [sent-95, score-0.095]
49 Algorithmic stability: A desirable property of any algorithm—particularly one applied to statistical data—is that it exhibit an appropriate form of stability with respect to its inputs. [sent-99, score-0.088]
50 Previous experimental work has shown that methods based on convex relaxations, including reweighted belief propagation [10], Generic algorithm for joint parameter estimation and prediction: b 1. [sent-102, score-0.718]
51 Given a new set of observations y, incorporate them into the model: e bn θs ( · ; ys ) = θs ( · ) + log p(ys | · ). [sent-108, score-0.123]
52 Compute approximate marginals τ by using the message-passing algorithm associated with the convex surrogate B. [sent-110, score-0.627]
53 Use approximate marginals to construct prediction z (y; τ ) b of z based on the observation y and pseudomarginals τ . [sent-111, score-0.409]
54 Both the learning and prediction steps are approximate, but the key is that they are both based on the same underlying convex surrogate B. [sent-114, score-0.645]
55 Such a construction yields a provably beneficial cancellation of the two sources of error (learning and prediction). [sent-115, score-0.067]
56 reweighted generalized BP [13], and log-determinant relaxations [12] appear to be very stable. [sent-116, score-0.284]
57 Here we provide theoretical support for these empirical observations: in particular, we prove that, in sharp contrast to non-convex methods, any variational method based on a strongly convex entropy approximation is globally stable. [sent-117, score-0.461]
58 A function f : Rn → R is strongly convex if there exists a constant c > 0 such that c f (y) ≥ f (x) + ∇f (x)T y − x) + 2 y − x 2 for all x, y ∈ Rn . [sent-118, score-0.203]
59 Consider any variational method based on a strongly concave entropy approximation −B ∗ ; moreover, for any parameter θ ∈ Rd , let τ (θ) denote the associated set of pseudomarginals. [sent-121, score-0.362]
60 By our construction of the convex surrogate B, we have τ (θ) = ∇B(θ), so that the statement is equivalent to the assertion that the gradient ∇B is a Lipschitz function. [sent-124, score-0.501]
61 Finally, our assumption of strong convexity of B ∗ yields that the spectral norm of ∇2 B ∗ (τ ) is uniformly bounded away from zero, which yields the claim. [sent-129, score-0.089]
62 Many existing entropy approximations, including the convexifed Bethe entropy (4), can be shown to be strongly concave [9]. [sent-130, score-0.193]
63 5 Bounds on performance loss We now turn to theoretical analysis of the joint method for parameter estimation and prediction illustrated in Figure 2. [sent-131, score-0.392]
64 Note that given our setting of limited computation, the Bayes optimum is unattainable for two reasons: (a) it has knowledge of the exact parameter value θ∗ ; and (b) the prediction step (7) involves computing exact marginal probabilities µ. [sent-132, score-0.389]
65 Therefore, our ultimate goal is to bound the performance loss of our method relative to the unachievable Bayes optimum. [sent-133, score-0.135]
66 So as to obtain a concrete result, we focus on the special case of joint learning/prediction for a mixture-of-Gaussians; however, the ideas and techniques described here are more generally applicable. [sent-134, score-0.068]
67 Prediction for mixture of Gaussians: Suppose that the discrete random vector is a label vector for the components in a finite mixture of Gaussians: i. [sent-135, score-0.206]
68 , for each s ∈ V , the random 2 variable Zs is specified by p(Zs = zs | Xs = j; θ∗ ) ∼ N (νj , σj ), for j ∈ {0, 1, . [sent-137, score-0.165]
69 Suppose √ that we observe a noise-corrupted version of Zs —namely Ys = αZs + 1 − α2 Ws , where Ws ∼ N (0, 1) is additive Gaussian noise, and the parameter α ∈ [0, 1] specifies the signalto-noise ratio (SNR) of the observation model. [sent-142, score-0.104]
70 For this set-up, the approximate predictor zs (y; τ ) defined by our joint procedure in Figure 2 corresponds to replacing the exact marginals µ with the pseudomarginals τs (j; θ) obtained by solving the variational problem with θ. [sent-145, score-0.729]
71 Bounds on performance loss: We now turn to a comparison of the mean-squared error (MSE) of the Bayes optimal predictor z(Y ; µ) to the MSE of the surrogate-based predictor z(Y ; τ ). [sent-146, score-0.161]
72 More specifically, we provide an upper bound on the increase in MSE, where the bound is specified in terms of the coupling strength and the SNR parameter α. [sent-147, score-0.374]
73 Consequently, the MLE converges to the true parameter value θ∗ , whereas Proposition 1 guarantees that our approximate parameter estimate θn converges to the fixed quantity θ. [sent-149, score-0.177]
74 Let MSE(τ ) and MSE(µ) denote the mean-squared prediction errors of the surrogate-based predictor z(y; τ ), and the Bayes optimal estimate z(y; µ) respectively. [sent-156, score-0.238]
75 5 50 Performance loss 50 Performance loss Performance loss 50 40 30 20 10 1 Edge strength 0 1 0 0 0. [sent-160, score-0.255]
76 5 1 0 5 Performance loss Performance loss 5 4 4 3 2 1 1 0 0 0. [sent-164, score-0.108]
77 5 1 Edge strength SNR (d) Edge strength (c) BP 5 0 SNR (b) IND SNR 0. [sent-166, score-0.186]
78 5 1 Edge strength 0 Edge strength SNR (e) (f) Figure 3. [sent-170, score-0.186]
79 Surface plots of the percentage increase in MSE relative to Bayes optimum for different methods as a function of observation SNR and coupling strength. [sent-171, score-0.248]
80 Top row: 2 2 Gaussian mixture with components (ν0 , σ0 ) = (−1, 0. [sent-172, score-0.122]
81 Bot2 2 tom row: Gaussian mixture with components (ν0 , σ0 ) = (0, 1) and (ν0 , σ1 ) = (0, 9). [sent-175, score-0.122]
82 It can be seen that I(α) → 0 as α → 0+ and as α → 1− , so that the surrogate-based method is asymptotically optimal for both low and high SNR. [sent-179, score-0.064]
83 Experimental results: In order to test our joint estimation/prediction procedure, we have applied it to coupled Gaussian mixture models on different graphs, coupling strengths, observation SNRs, and mixture distributions. [sent-181, score-0.445]
84 Although our methods are more generally applicable, here we show representative results for m = 2 components, and two different mixture types. [sent-182, score-0.084]
85 In both cases, each mixture component is equally weighted. [sent-187, score-0.084]
86 Since the mixture variables have m = 2 states, the coupling distribution can be written as p(x; θ) ∝ exp s∈V θs xs + (s,t)∈E θst xs xt . [sent-189, score-1.201]
87 where x ∈ {−1, +1}N are spin variables indexing the mixture components. [sent-190, score-0.084]
88 For each coupling strength γ ∈ [0, 1], we chose edge parameters as θst ∼ U[0, γ], and we varied the SNR parameter α controlling the observation model in [0, 1]. [sent-192, score-0.364]
89 The prediction step reduces to performing BLSE at each node independently. [sent-194, score-0.144]
90 (b) The standard belief propagation (BP) approach is based on estimating parameters (see step (1) of Figure 2) using ρst = 1 for all edges (s, t), and using BP to compute the pseudomarginals. [sent-195, score-0.238]
91 (c) The tree-reweighted method (TRW) is based on 1 estimating parameters using the tree-reweighted surrogate [10] with weights ρst = 2 for all edges (s, t), and using the TRW sum-product algorithm to compute the pseudomarginals. [sent-196, score-0.337]
92 For weak coupling (γ ≈ 0), all three methods— including the independence model—perform quite well, as should be expected given the weak dependency. [sent-198, score-0.22]
93 As the coupling is increased, the BP method eventually deteriorates quite seriously; indeed, for large enough coupling and low/intermediate SNR, its performance can be worse than the independence model. [sent-200, score-0.287]
94 6 Conclusion We have described and analyzed joint methods for parameter estimation and prediction/smoothing using variational methods that are based on convex surrogates to the cumulant generating function. [sent-203, score-0.751]
95 Our results—both theoretical and experimental—confirm the intuition that in the computation-limited setting, in which errors arise from approximations made both during parameter estimation and subsequent prediction, it is provably beneficial to use an inconsistent parameter estimator. [sent-204, score-0.334]
96 Our experimental results on the coupled mixture of Gaussian model confirm the theory: the tree-reweighted sum-product algorithm yields prediction results close to the Bayes optimum, and substantially outperforms an analogous but heuristic method based on standard belief propagation. [sent-205, score-0.47]
97 Loopy belief propagation: Convergence and effects of message errors. [sent-222, score-0.117]
98 Joint estimation and prediction in Markov random fields: Benefits of inconsistency in the computation-limited regime. [sent-258, score-0.273]
99 Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudomoment matching. [sent-267, score-0.364]
100 Constructing free energy approximations and generalized belief propagation algorithms. [sent-297, score-0.32]
wordName wordTfidf (topN-words)
[('xs', 0.471), ('surrogate', 0.337), ('mse', 0.198), ('trw', 0.185), ('variational', 0.168), ('snr', 0.166), ('zs', 0.165), ('convex', 0.164), ('st', 0.145), ('prediction', 0.144), ('marg', 0.142), ('bp', 0.138), ('relaxations', 0.123), ('ys', 0.123), ('reweighted', 0.122), ('coupling', 0.121), ('propagation', 0.121), ('belief', 0.117), ('route', 0.105), ('cumulant', 0.105), ('rd', 0.103), ('pseudomarginals', 0.094), ('strength', 0.093), ('mixture', 0.084), ('rel', 0.075), ('lipschitz', 0.075), ('blse', 0.071), ('surrogates', 0.071), ('bethe', 0.07), ('bayes', 0.068), ('estimator', 0.068), ('joint', 0.068), ('estimation', 0.067), ('predictor', 0.067), ('marginals', 0.067), ('inconsistent', 0.066), ('inconsistency', 0.062), ('stability', 0.061), ('wainwright', 0.06), ('parameter', 0.059), ('approximate', 0.059), ('entropy', 0.058), ('ind', 0.056), ('marginal', 0.055), ('xt', 0.054), ('loss', 0.054), ('mrfs', 0.052), ('singleton', 0.052), ('optimum', 0.049), ('generating', 0.049), ('asymptotic', 0.048), ('unachievable', 0.047), ('edge', 0.046), ('relaxed', 0.045), ('independence', 0.045), ('january', 0.045), ('observation', 0.045), ('approximations', 0.043), ('markov', 0.043), ('coupled', 0.043), ('bene', 0.042), ('lb', 0.041), ('exact', 0.041), ('provably', 0.04), ('generalized', 0.039), ('cial', 0.039), ('proposition', 0.039), ('strongly', 0.039), ('concave', 0.038), ('components', 0.038), ('mle', 0.037), ('intractable', 0.037), ('asymptotically', 0.037), ('interpolation', 0.036), ('wrong', 0.036), ('ws', 0.035), ('convexity', 0.035), ('tted', 0.034), ('bound', 0.034), ('computationally', 0.033), ('differentiable', 0.033), ('hs', 0.033), ('increase', 0.033), ('berkeley', 0.032), ('globally', 0.032), ('graph', 0.032), ('smoothing', 0.031), ('likelihood', 0.031), ('column', 0.031), ('elds', 0.029), ('mrf', 0.029), ('gaussians', 0.029), ('analogous', 0.028), ('loopy', 0.028), ('yields', 0.027), ('property', 0.027), ('optimal', 0.027), ('weak', 0.027), ('establish', 0.027), ('heuristic', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1
2 0.26329279 58 nips-2005-Divergences, surrogate loss functions and experimental design
Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan
Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1
3 0.20152898 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
4 0.1575397 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey
Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties. 1
5 0.14768055 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
Author: Eizaburo Doi, Doru C. Balcan, Michael S. Lewicki
Abstract: Biological sensory systems are faced with the problem of encoding a high-fidelity sensory signal with a population of noisy, low-fidelity neurons. This problem can be expressed in information theoretic terms as coding and transmitting a multi-dimensional, analog signal over a set of noisy channels. Previously, we have shown that robust, overcomplete codes can be learned by minimizing the reconstruction error with a constraint on the channel capacity. Here, we present a theoretical analysis that characterizes the optimal linear coder and decoder for one- and twodimensional data. The analysis allows for an arbitrary number of coding units, thus including both under- and over-complete representations, and provides a number of important insights into optimal coding strategies. In particular, we show how the form of the code adapts to the number of coding units and to different data and noise conditions to achieve robustness. We also report numerical solutions for robust coding of highdimensional image data and show that these codes are substantially more robust compared against other image codes such as ICA and wavelets. 1
6 0.12024548 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
7 0.11698191 47 nips-2005-Consistency of one-class SVM and related algorithms
8 0.11528821 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
9 0.1088976 50 nips-2005-Convex Neural Networks
10 0.10370304 46 nips-2005-Consensus Propagation
11 0.090956159 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
12 0.087896042 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
13 0.081392974 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
14 0.079900309 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
15 0.076708838 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
16 0.075514503 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
17 0.070406683 52 nips-2005-Correlated Topic Models
18 0.064938791 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
19 0.064079553 184 nips-2005-Structured Prediction via the Extragradient Method
20 0.062663414 111 nips-2005-Learning Influence among Interacting Markov Chains
topicId topicWeight
[(0, 0.241), (1, 0.095), (2, 0.002), (3, -0.053), (4, 0.142), (5, -0.103), (6, -0.26), (7, -0.041), (8, -0.08), (9, -0.159), (10, 0.058), (11, -0.081), (12, -0.153), (13, 0.016), (14, 0.079), (15, -0.189), (16, 0.033), (17, 0.067), (18, -0.07), (19, -0.082), (20, 0.071), (21, -0.04), (22, -0.051), (23, 0.028), (24, -0.088), (25, -0.065), (26, -0.071), (27, -0.057), (28, 0.057), (29, -0.036), (30, -0.13), (31, 0.018), (32, -0.06), (33, 0.053), (34, 0.021), (35, -0.082), (36, 0.151), (37, 0.024), (38, -0.151), (39, -0.057), (40, -0.028), (41, -0.08), (42, -0.111), (43, -0.089), (44, 0.028), (45, -0.205), (46, 0.013), (47, 0.074), (48, -0.126), (49, 0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.94165945 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1
2 0.73983163 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey
Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties. 1
3 0.64661425 58 nips-2005-Divergences, surrogate loss functions and experimental design
Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan
Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1
4 0.5613054 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
5 0.46261767 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
Author: Eizaburo Doi, Doru C. Balcan, Michael S. Lewicki
Abstract: Biological sensory systems are faced with the problem of encoding a high-fidelity sensory signal with a population of noisy, low-fidelity neurons. This problem can be expressed in information theoretic terms as coding and transmitting a multi-dimensional, analog signal over a set of noisy channels. Previously, we have shown that robust, overcomplete codes can be learned by minimizing the reconstruction error with a constraint on the channel capacity. Here, we present a theoretical analysis that characterizes the optimal linear coder and decoder for one- and twodimensional data. The analysis allows for an arbitrary number of coding units, thus including both under- and over-complete representations, and provides a number of important insights into optimal coding strategies. In particular, we show how the form of the code adapts to the number of coding units and to different data and noise conditions to achieve robustness. We also report numerical solutions for robust coding of highdimensional image data and show that these codes are substantially more robust compared against other image codes such as ICA and wavelets. 1
6 0.43047187 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
7 0.42040563 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
8 0.39364707 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
9 0.37389672 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
10 0.36547154 46 nips-2005-Consensus Propagation
11 0.36526424 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
12 0.35717377 47 nips-2005-Consistency of one-class SVM and related algorithms
13 0.34254462 127 nips-2005-Mixture Modeling by Affinity Propagation
14 0.33318865 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
15 0.31792605 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
16 0.31785411 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
17 0.31244415 50 nips-2005-Convex Neural Networks
18 0.31114686 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
19 0.29800296 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
20 0.28641978 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
topicId topicWeight
[(3, 0.042), (10, 0.034), (11, 0.011), (18, 0.014), (27, 0.025), (31, 0.492), (34, 0.084), (55, 0.03), (69, 0.026), (73, 0.024), (88, 0.069), (91, 0.065)]
simIndex simValue paperId paperTitle
1 0.97903144 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks
Author: Viren Jain, Valentin Zhigulin, H. S. Seung
Abstract: There is little consensus about the computational function of top-down synaptic connections in the visual system. Here we explore the hypothesis that top-down connections, like bottom-up connections, reflect partwhole relationships. We analyze a recurrent network with bidirectional synaptic interactions between a layer of neurons representing parts and a layer of neurons representing wholes. Within each layer, there is lateral inhibition. When the network detects a whole, it can rigorously enforce part-whole relationships by ignoring parts that do not belong. The network can complete the whole by filling in missing parts. The network can refuse to recognize a whole, if the activated parts do not conform to a stored part-whole relationship. Parameter regimes in which these behaviors happen are identified using the theory of permitted and forbidden sets [3, 4]. The network behaviors are illustrated by recreating Rumelhart and McClelland’s “interactive activation” model [7]. In neural network models of visual object recognition [2, 6, 8], patterns of synaptic connectivity often reflect part-whole relationships between the features that are represented by neurons. For example, the connections of Figure 1 reflect the fact that feature B both contains simpler features A1, A2, and A3, and is contained in more complex features C1, C2, and C3. Such connectivity allows neurons to follow the rule that existence of the part is evidence for existence of the whole. By combining synaptic input from multiple sources of evidence for a feature, a neuron can “decide” whether that feature is present. 1 The synapses shown in Figure 1 are purely bottom-up, directed from simple to complex features. However, there are also top-down connections in the visual system, and there is little consensus about their function. One possibility is that top-down connections also reflect part-whole relationships. They allow feature detectors to make decisions using the rule that existence of the whole is evidence for existence of its parts. In this paper, we analyze the dynamics of a recurrent network in which part-whole relationships are stored as bidirectional synaptic interactions, rather than the unidirectional interactions of Figure 1. The network has a number of interesting computational capabilities. When the network detects a whole, it can rigorously enforce part-whole relationships 1 Synaptic connectivity may reflect other relationships besides part-whole. For example, invariances can be implemented by connecting detectors of several instances of the same feature to the same target, which is consequently an invariant detector of the feature. C1 C2 C3 B A1 A2 A3 Figure 1: The synaptic connections (arrows) of neuron B represent part-whole relationships. Feature B both contains simpler features and is contained in more complex features. The synaptic interactions are drawn one-way, as in most models of visual object recognition. Existence of the part is regarded as evidence for existence of the whole. This paper makes the interactions bidirectional, allowing the existence of the whole to be evidence for the existence of its parts. by ignoring parts that do not belong. The network can complete the whole by filling in missing parts. The network can refuse to recognize a whole, if the activated parts do not conform to a stored part-whole relationship. Parameter regimes in which these behaviors happen are identified using the recently developed theory of permitted and forbidden sets [3, 4]. Our model is closely related to the interactive activation model of word recognition, which was proposed by McClelland and Rumelhart to explain the word superiority effect studied by visual psychologists [7]. Here our concern is not to model a psychological effect, but to characterize mathematically how computations involving part-whole relationships can be carried out by a recurrent network. 1 Network model Suppose that we are given a set of part-whole relationships specified by a ξi = 1, if part i is contained in whole a 0, otherwise We assume that every whole contains at least one part, and every part is contained in at least one whole. The stimulus drives a layer of neurons that detect parts. These neurons also interact with a layer of neurons that detect wholes. We will refer to part-detectors as “P-neurons” and whole-detectors as “W-neurons.” The part-whole relationships are directly stored in the synaptic connections between P and a W neurons. If ξi = 1, the ith neuron in the P layer and the ath neuron in the W layer have a an excitatory interaction of strength γ. If ξi = 0, the neurons have an inhibitory interaction of strength σ. Furthermore, the P-neurons inhibit each other with strength β, and the Wneurons inhibit each other with strength α. All of these interactions are symmetric, and all activation functions are the rectification nonlinearity [z]+ = max{z, 0}. Then the dynamics of the network takes the form ˙ Wa + Wa a Pi ξ i − σ = γ i + a (1 − ξi )Pi − α i Wb , + ˙ Pi + Pi (1) b=a a Wa ξ i − σ = γ a a (1 − ξi )Wa − β a Pj + B i . j=i (2) where Bi is the input to the P layer from the stimulus. Figure 2 shows an example of a network with two wholes. Each whole contains two parts. One of the parts is contained in both wholes. -α Wa excitation γ -σ inhibition P1 B1 -β } W layer Wb -σ P2 -β B2 P3 } P layer B3 Figure 2: Model in example configuration: ξ = {(1, 1, 0), (0, 1, 1)}. When a stimulus is presented, it activates some of the P-neurons, which activate some of the W-neurons. The network eventually converges to a stable steady state. We will assume that α > 1. In the Appendix, we prove that this leads to unconditional winner-take-all behavior in the W layer. In other words, no more than one W-neuron can be active at a stable steady state. If a single W-neuron is active, then a whole has been detected. Potentially there are also many P-neurons active, indicating detection of parts. This representation may have different properties, depending on the choice of parameters β, γ, and σ. As discussed below, these include rigorous enforcement of part-whole relationships, completion of wholes by “filling in” missing parts, and non-recognition of parts that do not conform to a whole. 2 Enforcement of part-whole relationships Suppose that a single W-neuron is active at a stable steady state, so that a whole has been detected. Part-whole relationships are said to be enforced if the network always ignores parts that are not contained in the detected whole, despite potentially strong bottom-up evidence for them. It can be shown that enforcement follows from the inequality σ 2 + β 2 + γ 2 + 2σβγ > 1. (3) which guarantees that neuron i in the P layer is inactive, if neuron a in the W layer is a active and ξi = 0. When part-whole relations are enforced, prior knowledge about legal combinations of parts strictly constrains what may be perceived. This result is proven in the Appendix, and only an intuitive explanation is given here. Enforcement is easiest to understand when there is interlayer inhibition (σ > 0). In this case, the active W-neuron directly inhibits the forbidden P-neurons. The case of σ = 0 is more subtle. Then enforcement is mediated by lateral inhibition in the P layer. Excitatory feedback from the W-neuron has the effect of counteracting the lateral inhibition between the P-neurons that belong to the whole. As a result, these P-neurons become strongly activated enough to inhibit the rest of the P layer. 3 Completion of wholes by filling in missing parts If a W-neuron is active, it excites the P-neurons that belong to the whole. As a result, even if one of these P-neurons receives no bottom-up input (Bi = 0), it is still active. We call this phenomenon “completion,” and it is guaranteed to happen when (4) γ> β The network may thus “imagine” parts that are consistent with the recognized whole, but are not actually present in the stimulus. As with enforcement, this condition depends on top-down connections. √ In the special case γ = β, the interlayer excitation between a W-neuron and its P-neurons exactly cancels out the lateral inhibition between the P-neurons at a steady state. So the recurrent connections effectively vanish, letting the activity of the P-neurons be determined by their feedforward inputs. When the interlayer excitation is stronger than this, the inequality (4) holds, and completion occurs. 4 Non-recognition of a whole If there is no interlayer inhibition (σ = 0), then a single W-neuron is always active, assuming that there is some activity in the P layer. To see this, suppose for the sake of contradiction that all the W-neurons are inactive. Then they receive no inhibition to counteract the excitation from the P layer. This means some of them must be active, which contradicts our assumption. This means that the network always recognizes a whole, even if the stimulus is very different from any part-whole combination that is stored in the network. However, if interlayer inhibition is sufficiently strong (large σ), the network may refuse to recognize a whole. Neurons in the P layer are activated, but there is no activity in the W layer. Formal conditions on σ can be derived, but are not given here because of space limitations. In case of non-recognition, constraints on the P-layer are not enforced. It is possible for the network to detect a configuration of parts that is not consistent with any stored whole. 5 Example: Interactive Activation model To illustrate the computational capabilities of our network, we use it to recreate the interactive activation (IA) model of McClelland and Rumelhart. Figure 3 shows numerical simulations of a network containing three layers of neurons representing strokes, letters, and words, respectively. There are 16 possible strokes in each of four letter positions. For each stroke, there are two neurons, one signaling the presence of the stroke and the other signaling its absence. Letter neurons represent each letter of the alphabet in each of four positions. Word neurons represent each of 1200 common four letter words. The letter and word layers correspond to the P and W layers that were introduced previously. There are bidirectional interactions between the letter and word layers, and lateral inhibition within the layers. The letter neurons also receive input from the stroke neurons, but this interaction is unidirectional. Our network differs in two ways from the original IA model. First, all interactions involving letter and word neurons are symmetric. In the original model, the interactions between the letter and word layers were asymmetric. In particular, inhibitory connections only ran from letter neurons to word neurons, and not vice versa. Second, the only nonlinearity in our model is rectification. These two aspects allow us to apply the full machinery of the theory of permitted and forbidden sets. Figure 3 shows the result of presenting the stimulus “MO M” for four different settings of parameters. In each of the four cases, the word layer of the network converges to the same result, detecting the word “MOON”, which is the closest stored word to the stimulus. However, the activity in the letter layer is different in the four cases. input: P layer reconstruction W layer P layer reconstruction W layer completion noncompletion enforcement non-enforcement Figure 3: Simulation of 4 different parameter regimes in a letter- word recognition network. Within each panel, the middle column presents a feature- layer reconstruction based on the letter activity shown in the left column. W layer activity is shown in the right column. The top row shows the network state after 10 iterations of the dynamics. The bottom row shows the steady state. In the left column, the parameters obey the inequality (3), so that part- whole relationships are enforced. The activity of the letter layer is visualized by activating the strokes corresponding to each active letter neuron. The activated letters are part of the word “MOON”. In the top left, the inequality (4) is satisfied, so that the missing “O” in the stimulus is filled in. In the bottom left, completion does not occur. In the simulations of the right column, parameters are such that part- whole relationships are not enforced. Consequently, the word layer is much more active. Bottom- up input provides evidence for several other letters, which is not suppressed. In the top right, the inequality (4) is satisfied, so that the missing “O” in the stimulus is filled in. In the bottom right, the “O” neuron is not activated in the third position, so there is no completion. However, some letter neurons for the third position are activated, due to the input from neurons that indicate the absence of strokes. input: non-recognition event multi-stability Figure 4: Simulation of a non- recognition event and example of multistability. Figure 4 shows simulations for large σ, deep in the enforcement regime where non- recognition is a possibility. From one initial condition, the network converges to a state in which no W neurons are active, a non- recognition. From another initial condition, the network detects the word “NORM”. Deep in the enforcement regime, the top- down feedback can be so strong that the network has multiple stable states, many of which bear little resemblance to the stimulus at all. This is a problematic aspect of this network. It can be prevented by setting parameters at the edge of the enforcement regime. 6 Discussion We have analyzed a recurrent network that performs computations involving part- whole relationships. The network can fill in missing parts and suppress parts that do not belong. These two computations are distinct and can be dissociated from each other, as shown in Figure 3. While these two computations can also be performed by associative memory models, they are not typically dissociable in these models. For example, in the Hopfield model pattern completion and noise suppression are both the result of recall of one of a finite number of stereotyped activity patterns. We believe that our model is more appropriate for perceptual systems, because its behavior is piecewise linear, due its reliance on rectification nonlinearity. Therefore, analog aspects of computation are able to coexist with the part-whole relationships. Furthermore, in our model the stimulus is encoded in maintained synaptic input to the network, rather than as an initial condition of the dynamics. A Appendix: Permitted and forbidden sets Our mathematical results depend on the theory of permitted and forbidden sets [3, 4], which is summarized briefly here. The theory is applicable to neural networks with rectification nonlinearity, of the form xi + xi = [bi + j Wij xj ]+ . Neuron i is said to be active when ˙ xi > 0. For a network of N neurons, there are 2N possible sets of active neurons. For each active set, consider the submatrix of Wij corresponding to the synapses between active neurons. If all eigenvalues of this submatrix have real parts less than or equal to unity, then the active set is said to be permitted. Otherwise the active set is said to be forbidden. A set is permitted if and only if there exists an input vector b such that those neurons are active at a stable steady state. Permitted sets can be regarded as memories stored in the synaptic connections Wij . If Wij is a symmetric matrix, the nesting property holds: every subset of a permitted set is permitted, and every superset of a forbidden set is forbidden. The present model can be seen as a general method for storing permitted sets in a recurrent network. This method introduces a neuron for each permitted set, relying on a unary or “grandmother cell” representation. In contrast, Xie et al.[9] used lateral inhibition in a single layer of neurons to store permitted sets. By introducing extra neurons, the present model achieves superior storage capacity, much as unary models of associative memory [1] surpass distributed models [5]. A.1 Unconditional winner-take-all in the W layer The synapses between two W-neurons have strengths 0 −α −α 0 The eigenvalues of this matrix are ±α. Therefore two W-neurons constitute a forbidden set if α > 1. By the nesting property, it follows more than two W-neurons is also a forbidden set, and that the W layer has the unconditional winner-take-all property. A.2 Part-whole combinations as permitted sets Theorem 1. Suppose that β < 1. If γ 2 < β + (1 − β)/k then any combination of k ≥ 1 parts consistent with a whole corresponds to a permitted set. Proof. Consider k parts belonging to a whole. They are represented by one W-neuron and k P-neurons, with synaptic connections given by the (k + 1) × (k + 1) matrix M= −β(11T − I) γ1 , γ1T 0 (5) where 1 is the k- dimensional vector whose elements are all equal to one. Two eigenvectors of M are of the form (1T c), and have the same eigenvalues as the 2 × 2 matrix −β(k − 1) γk γ 0 This matrix has eigenvalues less than one when γ 2 < β + (1 − β)/k and β(k − 1) + 2 > 0. The other k − 1 eigenvectors are of the form (dT , 0), where dT 1 = 0. These have eigenvalues β. Therefore all eigenvalues of W are less than one if the condition of the theorem is satisfied. A.3 Constraints on combining parts Here, we derive conditions under which the network can enforce the constraint that steady state activity be confined to parts that constitute a whole. Theorem 2. Suppose that β > 0 and σ 2 +β 2 +γ 2 +2σβγ > 1 If a W- neuron is active, then only P- neurons corresponding to parts contained in the relevant whole can be active at a stable steady state. Proof. Consider P- neurons Pi , Pj , and W- neuron Wa . Supa a pose that ξi = 1 but ξj = 0. As shown in Figure 5, the matrix of connections is given by: W = 0 −β γ −β 0 −σ γ −σ 0 (6) Wa γ Pi -σ -β Pj Figure 5: A set of one W- neuron and two P- neurons is forbidden if one part belongs to the whole and the other does not. This set is permitted if all eigenvalues of W − I have negative real parts. The characteristic equation of I − W is λ3 + b1 λ2 + b2 λ + b3 = 0, where b1 = 3, b2 = 3 − σ 2 − β 2 − γ 2 and b3 = 1−2σβγ−σ 2 −β 2 −γ 2 . According to the Routh- Hurwitz theorem, all the eigenvalues have negative real parts if and only if b1 > 0, b3 > 0 and b1 b2 > b3 . Clearly, the first condition is always satisfied. The second condition is more restrictive than the third. It is satisfied only when σ 2 + β 2 + γ 2 + 2σβγ < 1. Hence, one of the eigenvalues has a positive real part when this condition is broken, i.e., when σ 2 +β 2 +γ 2 +2σβγ > 1. By the nesting property, any larger set of P- neurons inconsistent with the W- neuron is also forbidden. A.4 Completion of wholes √ Theorem 3. If γ > β and a single W- neuron a is active at a steady state, then Pi > 0 a for all i such that ξi = 1. Proof. Suppose that the detected whole has k parts. At the steady state Pi = a ξi Bi − (β − γ 2 )Ptot 1−β + where Ptot = Pi = i 1 1 − β + (β − γ 2 )k k a B i ξi i=1 (7) A.5 Preventing runaway If feedback loops cause the network activity to diverge, then the preceding analyses are not relevant. Here we give a sufficient condition guaranteeing that runaway instability does not happen. It is not a necessary condition. Interestingly, the condition implies the condition of Theorem 1. Theorem 4. Suppose that P and W obey the dynamics of Eqs. (1) and (2), and define the objective function E 1−α 2 = − 2 Wa a α + 2 Wa a 1−β + 2 a Pi Wa ξi + σ Bi Pi − γ i 2 ia Pi2 i 2 β + 2 Pi i a (1 − ξi )Pi Wa . (8) ia Then E is a Lyapunov like function that, given β > γ 2 − dynamics to a stable steady state. 1−γ 2 N −1 , ensures convergence of the Proof. (sketch) Differentiation of E with respect to time shows that that E is nonincreasing in the nonnegative orthant and constant only at steady states of the network dynamics. We must also show that E is radially unbounded, which is true if the quadratic part of E is copositive definite. Note that the last term of E is lower-bounded by zero and the previous term is upper bounded by γ ia Pi Wa . We assume α > 1. Thus, we can use Cauchy’s 2 2 inequality, i Pi2 ≥ ( i Pi ) /N , and the fact that a Wa ≤ ( a Wa )2 for Wa ≥ 0, to derive E≥ 1 2 Wa )2 + ( If β > γ 2 − unbounded. a 1−γ 2 N −1 , 1 − β + βN ( N Pi )2 − 2γ( i Wa a Pi ) i − Bi Pi . (9) i the quadratic form in the inequality is positive definite and E is radially References [1] E. B. Baum, J. Moody, and F. Wilczek. Internal representations for associative memory. Biol. Cybern., 59:217–228, 1988. [2] K. Fukushima. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern, 36(4):193–202, 1980. [3] R.H. Hahnloser, R. Sarpeshkar, M.A. Mahowald, R.J. Douglas, and H.S. Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947– 51, Jun 22 2000. [4] R.H. Hahnloser, H.S. Seung, and J.-J. Slotine. Permitted and forbidden sets in symmetric threshold-linear networks. Neural Computation, 15:621–638, 2003. [5] J.J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci U S A, 79(8):2554–8, Apr 1982. [6] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1:541–551, 1989. [7] J. L. McClelland and D. E. Rumelhart. An interactive activation model of context effects in letter perception: Part i. an account of basic findings. Psychological Review, 88(5):375–407, Sep 1981. [8] M Riesenhuber and T Poggio. Hierarchical models of object recognition in cortex. Nat Neurosci, 2(11):1019–25, Nov 1999. [9] X. Xie, R.H. Hahnloser, and H. S. Seung. Selectively grouping neurons in recurrent networks of lateral inhibition. Neural Computation, 14:2627–2646, 2002.
2 0.96634984 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation
Author: Dmitry Malioutov, Alan S. Willsky, Jason K. Johnson
Abstract: This paper presents a new framework based on walks in a graph for analysis and inference in Gaussian graphical models. The key idea is to decompose correlations between variables as a sum over all walks between those variables in the graph. The weight of each walk is given by a product of edgewise partial correlations. We provide a walk-sum interpretation of Gaussian belief propagation in trees and of the approximate method of loopy belief propagation in graphs with cycles. This perspective leads to a better understanding of Gaussian belief propagation and of its convergence in loopy graphs. 1
same-paper 3 0.93393546 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1
4 0.9312464 108 nips-2005-Layered Dynamic Textures
Author: Antoni B. Chan, Nuno Vasconcelos
Abstract: A dynamic texture is a video model that treats a video as a sample from a spatio-temporal stochastic process, specifically a linear dynamical system. One problem associated with the dynamic texture is that it cannot model video where there are multiple regions of distinct motion. In this work, we introduce the layered dynamic texture model, which addresses this problem. We also introduce a variant of the model, and present the EM algorithm for learning each of the models. Finally, we demonstrate the efficacy of the proposed model for the tasks of segmentation and synthesis of video.
5 0.92511594 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning
Author: Drew Bagnell, Andrew Y. Ng
Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1
6 0.69535071 78 nips-2005-From Weighted Classification to Policy Search
7 0.63937682 153 nips-2005-Policy-Gradient Methods for Planning
8 0.63193721 46 nips-2005-Consensus Propagation
9 0.60926229 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
10 0.59030473 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
11 0.58869016 111 nips-2005-Learning Influence among Interacting Markov Chains
12 0.57455814 187 nips-2005-Temporal Abstraction in Temporal-difference Networks
13 0.56525403 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
14 0.56231469 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
15 0.54943061 36 nips-2005-Bayesian models of human action understanding
16 0.54905796 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
17 0.54168898 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
18 0.53917027 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
19 0.52671987 181 nips-2005-Spiking Inputs to a Winner-take-all Network
20 0.52437592 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions