nips nips2005 nips2005-43 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bhaskar D. Rao, David P. Wipf
Abstract: Given a redundant dictionary of basis vectors (or atoms), our goal is to find maximally sparse representations of signals. Previously, we have argued that a sparse Bayesian learning (SBL) framework is particularly well-suited for this task, showing that it has far fewer local minima than other Bayesian-inspired strategies. In this paper, we provide further evidence for this claim by proving a restricted equivalence condition, based on the distribution of the nonzero generating model weights, whereby the SBL solution will equal the maximally sparse representation. We also prove that if these nonzero weights are drawn from an approximate Jeffreys prior, then with probability approaching one, our equivalence condition is satisfied. Finally, we motivate the worst-case scenario for SBL and demonstrate that it is still better than the most widely used sparse representation algorithms. These include Basis Pursuit (BP), which is based on a convex relaxation of the ℓ0 (quasi)-norm, and Orthogonal Matching Pursuit (OMP), a simple greedy strategy that iteratively selects basis vectors most aligned with the current residual. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Given a redundant dictionary of basis vectors (or atoms), our goal is to find maximally sparse representations of signals. [sent-4, score-0.485]
2 Previously, we have argued that a sparse Bayesian learning (SBL) framework is particularly well-suited for this task, showing that it has far fewer local minima than other Bayesian-inspired strategies. [sent-5, score-0.145]
3 In this paper, we provide further evidence for this claim by proving a restricted equivalence condition, based on the distribution of the nonzero generating model weights, whereby the SBL solution will equal the maximally sparse representation. [sent-6, score-0.557]
4 We also prove that if these nonzero weights are drawn from an approximate Jeffreys prior, then with probability approaching one, our equivalence condition is satisfied. [sent-7, score-0.43]
5 Finally, we motivate the worst-case scenario for SBL and demonstrate that it is still better than the most widely used sparse representation algorithms. [sent-8, score-0.181]
6 These include Basis Pursuit (BP), which is based on a convex relaxation of the ℓ0 (quasi)-norm, and Orthogonal Matching Pursuit (OMP), a simple greedy strategy that iteratively selects basis vectors most aligned with the current residual. [sent-9, score-0.094]
7 1 Introduction In recent years, there has been considerable interest in finding sparse signal representations from redundant dictionaries [1, 2, 3, 4, 5]. [sent-10, score-0.355]
8 t = Φw, (1) w where Φ ∈ RN ×M is a matrix whose columns represent an overcomplete or redundant basis (i. [sent-13, score-0.154]
9 , rank(Φ) = N and M > N ), w ∈ RM is the vector of weights to be learned, and t is the signal vector. [sent-15, score-0.158]
10 vector most aligned with the current signal residual. [sent-26, score-0.081]
11 At each step, a new approximant is formed by projecting t onto the range of all the selected dictionary atoms. [sent-27, score-0.162]
12 Previously [9], we have demonstrated an alternative algorithm for solving (1) using a sparse Bayesian learning (SBL) framework [6] that maintains several significant advantages over other, Bayesian-inspired strategies for finding sparse solutions [7, 8]. [sent-28, score-0.265]
13 The most basic formulation begins with an assumed likelihood model of the signal t given weights w, p(t|w) = (2πσ 2 )−N/2 exp − 1 t − Φw 2σ 2 2 2 . [sent-29, score-0.158]
14 (2) To provide a regularizing mechanism, SBL uses the parameterized weight prior M p(w; γ) = (2πγi ) i=1 −1/2 exp − 2 wi 2γi , (3) where γ = [γ1 , . [sent-30, score-0.174]
15 , γM ]T is a vector of M hyperparameters controlling the prior variance of each weight. [sent-33, score-0.073]
16 These hyperparameters can be estimated from the data by marginalizing over the weights and then performing ML optimization. [sent-34, score-0.141]
17 Given that t ∈ range(Φ) and assuming γ is initialized with all nonzero elements, then feasibility is enforced at every iteraˆ tion, i. [sent-41, score-0.172]
18 In comparing BP, OMP, and SBL, we would ultimately like to know in what situations a particular algorithm is likely to find the maximally sparse solution. [sent-46, score-0.254]
19 A variety of results stipulate rigorous conditions whereby BP and OMP are guaranteed to solve (1) [1, 4, 5]. [sent-47, score-0.129]
20 All of these conditions depend explicitly on the number of nonzero elements contained in the optimal solution. [sent-48, score-0.239]
21 Essentially, if this number is less than some Φ-dependent constant κ, the BP/OMP solution is proven to be equivalent to the minimum ℓ0 -norm solution. [sent-49, score-0.09]
22 But in practice, both approaches still perform well even when these equivalence conditions have been grossly violated. [sent-51, score-0.15]
23 This bound holds for “most” dictionaries in the limit as N becomes large [3], where “most” 1 Based on EM convergence properties, the algorithm will converge monotonically to a fixed point. [sent-53, score-0.107]
24 is with respect to dictionaries composed of columns drawn uniformly from the surface of an N -dimensional unit hypersphere. [sent-54, score-0.199]
25 For example, with M/N = 2, it is argued that BP is capable of resolving sparse solutions with roughly 0. [sent-55, score-0.239]
26 3N nonzero elements with probability approaching one as N → ∞. [sent-56, score-0.24]
27 This condition is dependent on the relative magnitudes of the nonzero elements embedded in optimal solutions to (1). [sent-58, score-0.352]
28 Additionally, we can leverage these ideas to motivate which sparse solutions are the most difficult to find. [sent-59, score-0.173]
29 2 Equivalence Conditions for SBL In this section, we establish conditions whereby wSBL will minimize (1). [sent-61, score-0.103]
30 First, we formally define a dictionary Φ = [φ1 , . [sent-63, score-0.162]
31 We say that a dictionary satisfies the unique representation property (URP) if every subset of ¯ N atoms forms a basis in RN . [sent-67, score-0.291]
32 We define w(i) as the i-th largest weight magnitude and w as the w 0 -dimensional vector containing all the nonzero weight magnitudes of w. [sent-68, score-0.36]
33 The diversity (or anti-sparsity) of each w∗ ∈ W ∗ is defined as D∗ w∗ 0 . [sent-70, score-0.079]
34 For a fixed dictionary Φ that satisfies the URP, there exists a set of M − 1 scaling constants νi ∈ (0, 1] (i. [sent-72, score-0.191]
35 The basic idea is that, as the magnitude differences between weights increase, at any given scale, the covariance Σt embedded in the SBL cost function is dominated by a single dictionary atom such that problematic local minimum are removed. [sent-79, score-0.334]
36 , SBL will find the unique, maximally sparse representation of the signal t. [sent-84, score-0.231]
37 These results are restrictive in the sense that the dictionary dependent constants νi significantly confine the class of signals t that we may represent. [sent-86, score-0.191]
38 But we have nonetheless solidified the notion that SBL is most capable of recovering weights of different scales (and it must still find all D∗ nonzero weights no matter how small some of them may be). [sent-88, score-0.544]
39 Additionally, we have specified conditions whereby we will find the unique w∗ even when the diversity is as large as D∗ = N − 1. [sent-89, score-0.215]
40 The tighter BP/OMP bound from [1, 4, 5] scales as O N −1/2 , although this latter bound is much more general in that it is independent of the magnitudes of the nonzero weights. [sent-90, score-0.283]
41 3 While these examples might seem slightly nuanced, the situations being illustrated can occur frequently in practice and the requisite column normalization introduces some complexity. [sent-95, score-0.076]
42 01 0 t = Φw∗ = ǫ √ 2 , 0 ǫ 1 + √2 (7) where Φ satisfies the URP and has columns φi of unit ℓ2 norm. [sent-99, score-0.065]
43 We now give an analogous example for BP, where we present a feasible smaller ℓ1 norm than the maximally sparse solution. [sent-108, score-0.231]
44 1), if we form a feasible solution using only φ1 , φ3 , and φ4 , we obtain the alternate solution w = √ √ T with w 1 ≈ 1 + 0. [sent-121, score-0.097]
45 02ǫ ℓ1 norm for all ǫ in the specified range, BP will necessarily fail and so again, we cannot reproduce the result for a similar reason as before. [sent-125, score-0.071]
46 At this point, it remains unclear what probability distributions are likely to produce weights that satisfy the conditions of Result 1. [sent-126, score-0.194]
47 This distribution has the unique property that the probability mass assigned to any given scaling is equal. [sent-128, score-0.062]
48 For a fixed Φ that satisfies the URP, let t be generated by t = Φw′ , where w′ has magnitudes drawn iid from J(a). [sent-137, score-0.142]
49 Then as a approaches zero, the probability that we obtain a w′ such that the conditions of Result 1 are satisfied approaches unity. [sent-138, score-0.085]
50 Then as a approaches zero, the probability that we satisfy the conditions of Corollary 1 approaches unity. [sent-145, score-0.11]
51 In conclusion, we have shown that a simple, (approximate) noninformative Jeffreys prior leads to sparse inverse problems that are optimally solved via SBL with high probability. [sent-146, score-0.149]
52 Interestingly, it is this same Jeffreys prior that forms the implicit weight prior of SBL (see [6], Section 5. [sent-147, score-0.121]
53 This is especially true if the weights are highly scaled, and no nontrivial equivalence results are known to exist for these procedures. [sent-155, score-0.191]
54 3 Worst-Case Scenario If the best-case scenario occurs when the nonzero weights are all of very different scales, it seems reasonable that the most difficult sparse inverse problem may involve weights of the same or even identical scale, e. [sent-156, score-0.56]
55 First, somewhat by considering the w we note that both the SBL cost function and update rules are independent of the overall ¯ ¯ scaling of the generating weights, meaning αw∗ is functionally equivalent to w∗ provided α is nonzero. [sent-163, score-0.066]
56 Therefore, we assume the weights are rescaled such that i wi = 1. [sent-165, score-0.191]
57 Given this restriction, we will find ¯∗ the distribution of weight magnitudes that is most different from the Jeffreys prior. [sent-166, score-0.137]
58 , wD∗ ) ¯∗ ¯∗ ∝ D∗ 1 D∗ i=1 wi ¯∗ for i=1 w1 = w2 ¯∗ ¯∗ wi = 1, wi ≥ 0, ∀i. [sent-170, score-0.264]
59 Consequently, equal weights are the absolute least likely to occur from the Jeffreys prior. [sent-175, score-0.152]
60 Hence, we may argue that the distribution that assigns wi = 1/D∗ with ¯∗ probability one is furthest from the constrained Jeffreys prior. [sent-176, score-0.122]
61 Nevertheless, because of the complexity of the SBL framework, it is difficult to prove ax¯ iomatically that w∗ ∼ 1 is overall the most problematic distribution with respect to sparse recovery. [sent-177, score-0.114]
62 As proven in [9], the global minimum of the SBL cost function is guaranteed to produce some w∗ ∈ W ∗ . [sent-179, score-0.121]
63 This minimum is achieved with the hyperparameters ∗ ∗ γi = (wi )2 , ∀i. [sent-180, score-0.07]
64 We can think of this solution as forming a collapsed, or degenerate covariance Σ∗ = ΦΓ∗ ΦT that occupies a proper D∗ -dimensional subspace of N -dimensional t signal space. [sent-181, score-0.118]
65 Moreover, this subspace must necessarily contain the signal vector t. [sent-182, score-0.113]
66 t Now consider an alternative covariance Σ⋄ that, although still full rank, is nonetheless illt conditioned (flattened), containing t within its high density region. [sent-184, score-0.099]
67 Thus, as we transition from Σ⋄ to Σ∗ , we t t t t necessarily reduce the density at t, thereby increasing the cost function L(γ). [sent-187, score-0.092]
68 The volume of the covariance within this subt ¯¯ ¯ ¯ ¯ space is given by ΦΓ∗ Φ∗T , where Φ∗ and Γ∗ are the basis vectors and hyperparameters ∗ ¯ associated with w . [sent-191, score-0.106]
69 The larger this volume, the higher the probability that other basis vectors will be suitably positioned so as to both (i), contain t within the high density portion and (ii), maintain a sufficient component that is misaligned with the optimal covariance. [sent-192, score-0.13]
70 Consequently, geometric considera¯∗ ¯∗ tions support the notion that deviance from the Jeffreys prior leads to difficulty recovering w∗ . [sent-196, score-0.082]
71 As previously mentioned, others have established deterministic equivalence conditions, dependent on D∗ , whereby BP and OMP are guaranteed to find the unique w∗ . [sent-199, score-0.24]
72 This is because, in the cases we have tested where BP/OMP equivalence is provably known to hold (e. [sent-201, score-0.088]
73 Given a fixed distribution for the nonzero elements of w∗ , we will assess which algorithm is best (at least empirically) for most dictionaries relative to a uniform measure on the unit sphere as discussed. [sent-205, score-0.342]
74 To this effect, a number of monte-carlo simulations were conducted, each consisting of the following: First, a random, overcomplete N × M dictionary Φ is created whose entries are each drawn uniformly from the surface of an N -dimensional hypersphere. [sent-206, score-0.216]
75 Next, sparse weight vectors w∗ are randomly generated with D∗ nonzero entries. [sent-207, score-0.359]
76 Nonzero amplitudes ¯ w∗ are drawn iid from an experiment-dependent distribution. [sent-208, score-0.105]
77 Under the specified conditions for the generation of Φ and t, all other feasible solutions w almost surely have a diversity greater than D∗ , so our synthetically generated w∗ must be maximally sparse. [sent-212, score-0.272]
78 With regard to particulars, there are essentially four variables with which to experiment: (i) ¯ the distribution of w∗ , (ii) the diversity D∗ , (iii) N , and (iv) M . [sent-214, score-0.079]
79 In each row of the figure, wi is drawn iid from ¯∗ a fixed distribution for all i; the first row uses wi = 1, the second has wi ∼ J(a = 0. [sent-216, score-0.38]
80 001), ¯∗ ¯∗ ∗ and the third uses wi ∼ N (0, 1), i. [sent-217, score-0.088]
81 In all cases, the signs of the nonzero ¯ weights are irrelevant due to the randomness inherent in the basis vectors. [sent-220, score-0.321]
82 The columns of Figure 1 are organized as follows: The first column is based on the values N = 50, D∗ = 16, while M is varied from N to 5N , testing the effects of an increasing level of dictionary redundancy, M/N . [sent-221, score-0.217]
83 The second fixes N = 50 and M = 100 while D∗ is varied from 10 to 30, exploring the ability of each algorithm to resolve an increasing number of nonzero weights. [sent-222, score-0.198]
84 The distribution of the nonzero weight amplitudes is labeled on the far left for each row, while the values for N , M , and D∗ are included on the top of each column. [sent-265, score-0.272]
85 The first row of plots essentially represents the worst-case scenario for SBL per our previous analysis, and yet performance is still consistently better than both BP and OMP. [sent-267, score-0.075]
86 The handful of failure events that do occur are because a is not sufficiently small and therefore, J(a) was not sufficiently close to a true Jeffreys prior to achieve perfect equivalence (see center plot). [sent-269, score-0.145]
87 Finally, the last row of plots, based on Gaussian distributed weight amplitudes, reflects a balance between these two extremes. [sent-271, score-0.081]
88 In general, we observe that SBL is capable of handling more redundant dictionaries (column one) and resolving a larger number of nonzero weights (column two). [sent-273, score-0.49]
89 Also, column three illustrates that both BP and SBL are able to resolve a number of weights that grows linearly in the signal dimension (≈ 0. [sent-274, score-0.209]
90 Finally, by comparing row one, two and three, we observe that the performance of BP is roughly independent of the weight distribution, with performance slightly below the worst- case SBL performance. [sent-278, score-0.103]
91 Like SBL, OMP results are highly dependent on the distribution; however, as the weight distribution approaches unity, performance is unsatisfactory. [sent-279, score-0.103]
92 5 Conclusions In this paper, we have related the ability to find maximally sparse solutions to the particular distribution of amplitudes that compose the nonzero elements. [sent-283, score-0.434]
93 At first glance, it may seem reasonable that the most difficult sparse inverse problems occur when some of the nonzero weights are extremely small, making them difficult to estimate. [sent-284, score-0.411]
94 In contrast, unit weights offer the most challenging task for SBL. [sent-286, score-0.138]
95 For a fixed dictionary and diversity D∗ , successful recovery of unit weights does not absolutely guarantee that any alternative weighting scheme will necessarily be recovered as well. [sent-288, score-0.406]
96 Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization,” Proc. [sent-294, score-0.221]
97 Fuchs, “On sparse representations in arbitrary redundant bases,” IEEE Transactions on Information Theory, vol. [sent-312, score-0.193]
98 Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. [sent-318, score-0.114]
99 Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm,” IEEE Transactions on Signal Processing, vol. [sent-331, score-0.109]
100 Rao, “ℓ0 -norm minimization for basis selection,” Advances in Neural Information Processing Systems 17, pp. [sent-344, score-0.073]
wordName wordTfidf (topN-words)
[('sbl', 0.714), ('omp', 0.285), ('jeffreys', 0.235), ('bp', 0.216), ('nonzero', 0.172), ('dictionary', 0.162), ('wsbl', 0.116), ('sparse', 0.114), ('dictionaries', 0.107), ('weights', 0.103), ('equivalence', 0.088), ('wi', 0.088), ('magnitudes', 0.086), ('diversity', 0.079), ('urp', 0.077), ('nonetheless', 0.071), ('wipf', 0.067), ('whereby', 0.064), ('maximally', 0.062), ('signal', 0.055), ('old', 0.051), ('redundant', 0.051), ('weight', 0.051), ('atoms', 0.05), ('amplitudes', 0.049), ('recovering', 0.047), ('basis', 0.046), ('pursuit', 0.045), ('redundancy', 0.045), ('scenario', 0.045), ('wd', 0.043), ('rao', 0.041), ('approaching', 0.04), ('particulars', 0.039), ('conditions', 0.039), ('hyperparameters', 0.038), ('cost', 0.037), ('solutions', 0.037), ('unit', 0.035), ('prior', 0.035), ('furthest', 0.034), ('focuss', 0.034), ('misaligned', 0.034), ('quasi', 0.034), ('resolving', 0.034), ('unique', 0.033), ('feasible', 0.033), ('solution', 0.032), ('minimum', 0.032), ('subspace', 0.031), ('argued', 0.031), ('satis', 0.031), ('row', 0.03), ('columns', 0.03), ('residual', 0.03), ('situations', 0.029), ('iid', 0.029), ('scaling', 0.029), ('transactions', 0.029), ('dependent', 0.029), ('tt', 0.029), ('donoho', 0.029), ('density', 0.028), ('representations', 0.028), ('elements', 0.028), ('likely', 0.027), ('drawn', 0.027), ('necessarily', 0.027), ('minimization', 0.027), ('overcomplete', 0.027), ('guaranteed', 0.026), ('aligned', 0.026), ('proven', 0.026), ('maxi', 0.026), ('resolve', 0.026), ('scales', 0.025), ('column', 0.025), ('satisfy', 0.025), ('evidence', 0.025), ('unimodal', 0.024), ('capable', 0.023), ('neither', 0.023), ('corollary', 0.023), ('approaches', 0.023), ('occurs', 0.023), ('march', 0.023), ('unity', 0.023), ('diego', 0.023), ('fail', 0.022), ('comparing', 0.022), ('vectors', 0.022), ('considerations', 0.022), ('motivate', 0.022), ('weaker', 0.022), ('surely', 0.022), ('xes', 0.022), ('occur', 0.022), ('norm', 0.022), ('rank', 0.022), ('nd', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
Author: Bhaskar D. Rao, David P. Wipf
Abstract: Given a redundant dictionary of basis vectors (or atoms), our goal is to find maximally sparse representations of signals. Previously, we have argued that a sparse Bayesian learning (SBL) framework is particularly well-suited for this task, showing that it has far fewer local minima than other Bayesian-inspired strategies. In this paper, we provide further evidence for this claim by proving a restricted equivalence condition, based on the distribution of the nonzero generating model weights, whereby the SBL solution will equal the maximally sparse representation. We also prove that if these nonzero weights are drawn from an approximate Jeffreys prior, then with probability approaching one, our equivalence condition is satisfied. Finally, we motivate the worst-case scenario for SBL and demonstrate that it is still better than the most widely used sparse representation algorithms. These include Basis Pursuit (BP), which is based on a convex relaxation of the ℓ0 (quasi)-norm, and Orthogonal Matching Pursuit (OMP), a simple greedy strategy that iteratively selects basis vectors most aligned with the current residual. 1
2 0.095341146 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
Author: Michael B. Wakin, Marco F. Duarte, Shriram Sarvotham, Dror Baron, Richard G. Baraniuk
Abstract: Compressed sensing is an emerging field based on the revelation that a small group of linear projections of a sparse signal contains enough information for reconstruction. In this paper we introduce a new theory for distributed compressed sensing (DCS) that enables new distributed coding algorithms for multi-signal ensembles that exploit both intra- and inter-signal correlation structures. The DCS theory rests on a new concept that we term the joint sparsity of a signal ensemble. We study three simple models for jointly sparse signals, propose algorithms for joint recovery of multiple signals from incoherent projections, and characterize theoretically and empirically the number of measurements per sensor required for accurate reconstruction. In some sense DCS is a framework for distributed compression of sources with memory, which has remained a challenging problem in information theory for some time. DCS is immediately applicable to a range of problems in sensor networks and arrays. 1
3 0.079636462 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
4 0.060539059 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
5 0.059184596 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1
6 0.054824114 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
7 0.051402114 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
8 0.04898579 121 nips-2005-Location-based activity recognition
9 0.04840235 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
10 0.046148833 188 nips-2005-Temporally changing synaptic plasticity
11 0.044083714 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
12 0.043700591 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
13 0.04284092 79 nips-2005-Fusion of Similarity Data in Clustering
14 0.042340524 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
15 0.041136179 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
16 0.040969156 136 nips-2005-Noise and the two-thirds power Law
17 0.040669937 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
18 0.040314324 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation
19 0.040272992 167 nips-2005-Robust design of biological experiments
20 0.039963525 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
topicId topicWeight
[(0, 0.147), (1, 0.021), (2, -0.005), (3, 0.004), (4, 0.041), (5, -0.036), (6, -0.112), (7, -0.028), (8, 0.011), (9, -0.022), (10, 0.003), (11, -0.014), (12, 0.005), (13, 0.016), (14, 0.041), (15, 0.013), (16, -0.04), (17, 0.025), (18, -0.043), (19, -0.055), (20, -0.079), (21, 0.011), (22, -0.004), (23, -0.011), (24, -0.036), (25, -0.063), (26, -0.057), (27, -0.016), (28, 0.077), (29, -0.074), (30, -0.025), (31, -0.007), (32, -0.033), (33, 0.063), (34, 0.065), (35, 0.014), (36, -0.096), (37, 0.015), (38, -0.04), (39, -0.011), (40, -0.044), (41, -0.11), (42, -0.099), (43, 0.067), (44, -0.005), (45, 0.015), (46, -0.043), (47, -0.16), (48, 0.059), (49, 0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.91043484 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
Author: Bhaskar D. Rao, David P. Wipf
Abstract: Given a redundant dictionary of basis vectors (or atoms), our goal is to find maximally sparse representations of signals. Previously, we have argued that a sparse Bayesian learning (SBL) framework is particularly well-suited for this task, showing that it has far fewer local minima than other Bayesian-inspired strategies. In this paper, we provide further evidence for this claim by proving a restricted equivalence condition, based on the distribution of the nonzero generating model weights, whereby the SBL solution will equal the maximally sparse representation. We also prove that if these nonzero weights are drawn from an approximate Jeffreys prior, then with probability approaching one, our equivalence condition is satisfied. Finally, we motivate the worst-case scenario for SBL and demonstrate that it is still better than the most widely used sparse representation algorithms. These include Basis Pursuit (BP), which is based on a convex relaxation of the ℓ0 (quasi)-norm, and Orthogonal Matching Pursuit (OMP), a simple greedy strategy that iteratively selects basis vectors most aligned with the current residual. 1
2 0.73077869 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
Author: Michael B. Wakin, Marco F. Duarte, Shriram Sarvotham, Dror Baron, Richard G. Baraniuk
Abstract: Compressed sensing is an emerging field based on the revelation that a small group of linear projections of a sparse signal contains enough information for reconstruction. In this paper we introduce a new theory for distributed compressed sensing (DCS) that enables new distributed coding algorithms for multi-signal ensembles that exploit both intra- and inter-signal correlation structures. The DCS theory rests on a new concept that we term the joint sparsity of a signal ensemble. We study three simple models for jointly sparse signals, propose algorithms for joint recovery of multiple signals from incoherent projections, and characterize theoretically and empirically the number of measurements per sensor required for accurate reconstruction. In some sense DCS is a framework for distributed compression of sources with memory, which has remained a challenging problem in information theory for some time. DCS is immediately applicable to a range of problems in sensor networks and arrays. 1
3 0.55110729 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
Author: Baback Moghaddam, Yair Weiss, Shai Avidan
Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied fields, from bio-informatics to finance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1
4 0.51888663 62 nips-2005-Efficient Estimation of OOMs
Author: Herbert Jaeger, Mingjie Zhao, Andreas Kolling
Abstract: A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. Based on observable operator models (OOMs), in the last few months a number of novel learning algorithms for similar purposes have been developed: (1,2) two versions of an ”efficiency sharpening” (ES) algorithm, which iteratively improves the statistical efficiency of a sequence of OOM estimators, (3) a constrained gradient descent ML estimator for transition-emitting HMMs (TE-HMMs). We give an overview on these algorithms and compare them with SE-HMM/EM learning on synthetic and real-life data. 1
5 0.43768328 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
Author: Yan Karklin, Michael S. Lewicki
Abstract: Linear implementations of the efficient coding hypothesis, such as independent component analysis (ICA) and sparse coding models, have provided functional explanations for properties of simple cells in V1 [1, 2]. These models, however, ignore the non-linear behavior of neurons and fail to match individual and population properties of neural receptive fields in subtle but important ways. Hierarchical models, including Gaussian Scale Mixtures [3, 4] and other generative statistical models [5, 6], can capture higher-order regularities in natural images and explain nonlinear aspects of neural processing such as normalization and context effects [6,7]. Previously, it had been assumed that the lower level representation is independent of the hierarchy, and had been fixed when training these models. Here we examine the optimal lower-level representations derived in the context of a hierarchical model and find that the resulting representations are strikingly different from those based on linear models. Unlike the the basis functions and filters learned by ICA or sparse coding, these functions individually more closely resemble simple cell receptive fields and collectively span a broad range of spatial scales. Our work unifies several related approaches and observations about natural image structure and suggests that hierarchical models might yield better representations of image structure throughout the hierarchy.
6 0.43031737 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
7 0.41944861 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
8 0.41723767 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
9 0.3953419 158 nips-2005-Products of ``Edge-perts
10 0.39072055 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
11 0.3878572 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
12 0.37901074 126 nips-2005-Metric Learning by Collapsing Classes
13 0.36314124 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
14 0.35917065 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
15 0.35300761 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
16 0.34702742 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
17 0.34427595 79 nips-2005-Fusion of Similarity Data in Clustering
18 0.34180921 162 nips-2005-Rate Distortion Codes in Sensor Networks: A System-level Analysis
19 0.33564967 35 nips-2005-Bayesian model learning in human visual perception
20 0.33555746 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling
topicId topicWeight
[(3, 0.06), (10, 0.046), (27, 0.038), (31, 0.071), (34, 0.095), (39, 0.011), (41, 0.018), (50, 0.021), (52, 0.236), (55, 0.033), (65, 0.012), (69, 0.066), (73, 0.03), (77, 0.011), (88, 0.073), (91, 0.09)]
simIndex simValue paperId paperTitle
1 0.87605584 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
Author: Anna Levina, Michael Herrmann
Abstract: There is experimental evidence that cortical neurons show avalanche activity with the intensity of firing events being distributed as a power-law. We present a biologically plausible extension of a neural network which exhibits a power-law avalanche distribution for a wide range of connectivity parameters. 1
same-paper 2 0.83382893 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
Author: Bhaskar D. Rao, David P. Wipf
Abstract: Given a redundant dictionary of basis vectors (or atoms), our goal is to find maximally sparse representations of signals. Previously, we have argued that a sparse Bayesian learning (SBL) framework is particularly well-suited for this task, showing that it has far fewer local minima than other Bayesian-inspired strategies. In this paper, we provide further evidence for this claim by proving a restricted equivalence condition, based on the distribution of the nonzero generating model weights, whereby the SBL solution will equal the maximally sparse representation. We also prove that if these nonzero weights are drawn from an approximate Jeffreys prior, then with probability approaching one, our equivalence condition is satisfied. Finally, we motivate the worst-case scenario for SBL and demonstrate that it is still better than the most widely used sparse representation algorithms. These include Basis Pursuit (BP), which is based on a convex relaxation of the ℓ0 (quasi)-norm, and Orthogonal Matching Pursuit (OMP), a simple greedy strategy that iteratively selects basis vectors most aligned with the current residual. 1
3 0.76884532 77 nips-2005-From Lasso regression to Feature vector machine
Author: Fan Li, Yiming Yang, Eric P. Xing
Abstract: Lasso regression tends to assign zero weights to most irrelevant or redundant features, and hence is a promising technique for feature selection. Its limitation, however, is that it only offers solutions to linear models. Kernel machines with feature scaling techniques have been studied for feature selection with non-linear models. However, such approaches require to solve hard non-convex optimization problems. This paper proposes a new approach named the Feature Vector Machine (FVM). It reformulates the standard Lasso regression into a form isomorphic to SVM, and this form can be easily extended for feature selection with non-linear models by introducing kernels defined on feature vectors. FVM generates sparse solutions in the nonlinear feature space and it is much more tractable compared to feature scaling kernel machines. Our experiments with FVM on simulated data show encouraging results in identifying the small number of dominating features that are non-linearly correlated to the response, a task the standard Lasso fails to complete.
4 0.75707954 45 nips-2005-Conditional Visual Tracking in Kernel Space
Author: Cristian Sminchisescu, Atul Kanujia, Zhiguo Li, Dimitris Metaxas
Abstract: We present a conditional temporal probabilistic framework for reconstructing 3D human motion in monocular video based on descriptors encoding image silhouette observations. For computational efÄ?Ĺš ciency we restrict visual inference to low-dimensional kernel induced non-linear state spaces. Our methodology (kBME) combines kernel PCA-based non-linear dimensionality reduction (kPCA) and Conditional Bayesian Mixture of Experts (BME) in order to learn complex multivalued predictors between observations and model hidden states. This is necessary for accurate, inverse, visual perception inferences, where several probable, distant 3D solutions exist due to noise or the uncertainty of monocular perspective projection. Low-dimensional models are appropriate because many visual processes exhibit strong non-linear correlations in both the image observations and the target, hidden state variables. The learned predictors are temporally combined within a conditional graphical model in order to allow a principled propagation of uncertainty. We study several predictors and empirically show that the proposed algorithm positively compares with techniques based on regression, Kernel Dependency Estimation (KDE) or PCA alone, and gives results competitive to those of high-dimensional mixture predictors at a fraction of their computational cost. We show that the method successfully reconstructs the complex 3D motion of humans in real monocular video sequences. 1 Introduction and Related Work We consider the problem of inferring 3D articulated human motion from monocular video. This research topic has applications for scene understanding including human-computer interfaces, markerless human motion capture, entertainment and surveillance. A monocular approach is relevant because in real-world settings the human body parts are rarely completely observed even when using multiple cameras. This is due to occlusions form other people or objects in the scene. A robust system has to necessarily deal with incomplete, ambiguous and uncertain measurements. Methods for 3D human motion reconstruction can be classiÄ?Ĺš ed as generative and discriminative. They both require a state representation, namely a 3D human model with kinematics (joint angles) or shape (surfaces or joint positions) and they both use a set of image features as observations for state inference. The computational goal in both cases is the conditional distribution for the model state given image observations. Generative model-based approaches [6, 16, 14, 13] have been demonstrated to Ä?Ĺš‚exibly reconstruct complex unknown human motions and to naturally handle problem constraints. However it is difÄ?Ĺš cult to construct reliable observation likelihoods due to the complexity of modeling human appearance. This varies widely due to different clothing and deformation, body proportions or lighting conditions. Besides being somewhat indirect, the generative approach further imposes strict conditional independence assumptions on the temporal observations given the states in order to ensure computational tractability. Due to these factors inference is expensive and produces highly multimodal state distributions [6, 16, 13]. Generative inference algorithms require complex annealing schedules [6, 13] or systematic non-linear search for local optima [16] in order to ensure continuing tracking. These difÄ?Ĺš culties motivate the advent of a complementary class of discriminative algorithms [10, 12, 18, 2], that approximate the state conditional directly, in order to simplify inference. However, inverse, observation-to-state multivalued mappings are difÄ?Ĺš cult to learn (see e.g. Ä?Ĺš g. 1a) and a probabilistic temporal setting is necessary. In an earlier paper [15] we introduced a probabilistic discriminative framework for human motion reconstruction. Because the method operates in the originally selected state and observation spaces that can be task generic, therefore redundant and often high-dimensional, inference is more expensive and can be less robust. To summarize, reconstructing 3D human motion in a Figure 1: (a, Left) Example of 180o ambiguity in predicting 3D human poses from silhouette image features (center). It is essential that multiple plausible solutions (e.g. F 1 and F2 ) are correctly represented and tracked over time. A single state predictor will either average the distant solutions or zig-zag between them, see also tables 1 and 2. (b, Right) A conditional chain model. The local distributions p(yt |yt−1 , zt ) or p(yt |zt ) are learned as in Ä?Ĺš g. 2. For inference, the predicted local state conditional is recursively combined with the Ä?Ĺš ltered prior c.f . (1). conditional temporal framework poses the following difÄ?Ĺš culties: (i) The mapping between temporal observations and states is multivalued (i.e. the local conditional distributions to be learned are multimodal), therefore it cannot be accurately represented using global function approximations. (ii) Human models have multivariate, high-dimensional continuous states of 50 or more human joint angles. The temporal state conditionals are multimodal which makes efÄ?Ĺš cient Kalman Ä?Ĺš ltering algorithms inapplicable. General inference methods (particle Ä?Ĺš lters, mixtures) have to be used instead, but these are expensive for high-dimensional models (e.g. when reconstructing the motion of several people that operate in a joint state space). (iii) The components of the human state and of the silhouette observation vector exhibit strong correlations, because many repetitive human activities like walking or running have low intrinsic dimensionality. It appears wasteful to work with high-dimensional states of 50+ joint angles. Even if the space were truly high-dimensional, predicting correlated state dimensions independently may still be suboptimal. In this paper we present a conditional temporal estimation algorithm that restricts visual inference to low-dimensional, kernel induced state spaces. To exploit correlations among observations and among state variables, we model the local, temporal conditional distributions using ideas from Kernel PCA [11, 19] and conditional mixture modeling [7, 5], here adapted to produce multiple probabilistic predictions. The corresponding predictor is referred to as a Conditional Bayesian Mixture of Low-dimensional Kernel-Induced Experts (kBME). By integrating it within a conditional graphical model framework (Ä?Ĺš g. 1b), we can exploit temporal constraints probabilistically. We demonstrate that this methodology is effective for reconstructing the 3D motion of multiple people in monocular video. Our contribution w.r.t. [15] is a probabilistic conditional inference framework that operates over a non-linear, kernel-induced low-dimensional state spaces, and a set of experiments (on both real and artiÄ?Ĺš cial image sequences) that show how the proposed framework positively compares with powerful predictors based on KDE, PCA, or with the high-dimensional models of [15] at a fraction of their cost. 2 Probabilistic Inference in a Kernel Induced State Space We work with conditional graphical models with a chain structure [9], as shown in Ä?Ĺš g. 1b, These have continuous temporal states yt , t = 1 . . . T , observations zt . For compactness, we denote joint states Yt = (y1 , y2 , . . . , yt ) or joint observations Zt = (z1 , . . . , zt ). Learning and inference are based on local conditionals: p(yt |zt ) and p(yt |yt−1 , zt ), with yt and zt being low-dimensional, kernel induced representations of some initial model having state xt and observation rt . We obtain zt , yt from rt , xt using kernel PCA [11, 19]. Inference is performed in a low-dimensional, non-linear, kernel induced latent state space (see Ä?Ĺš g. 1b and Ä?Ĺš g. 2 and (1)). For display or error reporting, we compute the original conditional p(x|r), or a temporally Ä?Ĺš ltered version p(xt |Rt ), Rt = (r1 , r2 , . . . , rt ), using a learned pre-image state map [3]. 2.1 Density Propagation for Continuous Conditional Chains For online Ä?Ĺš ltering, we compute the optimal distribution p(yt |Zt ) for the state yt , conditioned by observations Zt up to time t. The Ä?Ĺš ltered density can be recursively derived as: p(yt |Zt ) = p(yt |yt−1 , zt )p(yt−1 |Zt−1 ) (1) yt−1 We compute using a conditional mixture for p(yt |yt−1 , zt ) (a Bayesian mixture of experts c.f . §2.2) and the prior p(yt−1 |Zt−1 ), each having, say M components. We integrate M 2 pairwise products of Gaussians analytically. The means of the expanded posterior are clustered and the centers are used to initialize a reduced M -component Kullback-Leibler approximation that is reÄ?Ĺš ned using gradient descent [15]. The propagation rule (1) is similar to the one used for discrete state labels [9], but here we work with multivariate continuous state spaces and represent the local multimodal state conditionals using kBME (Ä?Ĺš g. 2), and not log-linear models [9] (these would require intractable normalization). This complex continuous model rules out inference based on Kalman Ä?Ĺš ltering or dynamic programming [9]. 2.2 Learning Bayesian Mixtures over Kernel Induced State Spaces (kBME) In order to model conditional mappings between low-dimensional non-linear spaces we rely on kernel dimensionality reduction and conditional mixture predictors. The authors of KDE [19] propose a powerful structured unimodal predictor. This works by decorrelating the output using kernel PCA and learning a ridge regressor between the input and each decorrelated output dimension. Our procedure is also based on kernel PCA but takes into account the structure of the studied visual problem where both inputs and outputs are likely to be low-dimensional and the mapping between them multivalued. The output variables xi are projected onto the column vectors of the principal space in order to obtain their principal coordinates y i . A z ∈ P(Fr ) O p(y|z) kP CA ĂŽĹšr (r) ⊂ Fr O / y ∈ P(Fx ) O QQQ QQQ QQQ kP CA QQQ Q( ĂŽĹšx (x) ⊂ Fx x ≈ PreImage(y) O ĂŽĹšr ĂŽĹšx r ∈ R ⊂ Rr x ∈ X ⊂ Rx p(x|r) ≈ p(x|y) Figure 2: The learned low-dimensional predictor, kBME, for computing p(x|r) â‰Ä„ p(xt |rt ), ∀t. (We similarly learn p(xt |xt−1 , rt ), with input (x, r) instead of r – here we illustrate only p(x|r) for clarity.) The input r and the output x are decorrelated using Kernel PCA to obtain z and y respectively. The kernels used for the input and output are ĂŽĹš r and ĂŽĹšx , with induced feature spaces Fr and Fx , respectively. Their principal subspaces obtained by kernel PCA are denoted by P(Fr ) and P(Fx ), respectively. A conditional Bayesian mixture of experts p(y|z) is learned using the low-dimensional representation (z, y). Using learned local conditionals of the form p(yt |zt ) or p(yt |yt−1 , zt ), temporal inference can be efÄ?Ĺš ciently performed in a low-dimensional kernel induced state space (see e.g. (1) and Ä?Ĺš g. 1b). For visualization and error measurement, the Ä?Ĺš ltered density, e.g. p(yt |Zt ), can be mapped back to p(xt |Rt ) using the pre-image c.f . (3). similar procedure is performed on the inputs ri to obtain zi . In order to relate the reduced feature spaces of z and y (P(Fr ) and P(Fx )), we estimate a probability distribution over mappings from training pairs (zi , yi ). We use a conditional Bayesian mixture of experts (BME) [7, 5] in order to account for ambiguity when mapping similar, possibly identical reduced feature inputs to very different feature outputs, as common in our problem (Ä?Ĺš g. 1a). This gives a model that is a conditional mixture of low-dimensional kernel-induced experts (kBME): M g(z|δ j )N (y|Wj z, ĂŽĹ j ) p(y|z) = (2) j=1 where g(z|δ j ) is a softmax function parameterized by δ j and (Wj , ĂŽĹ j ) are the parameters and the output covariance of expert j, here a linear regressor. As in many Bayesian settings [17, 5], the weights of the experts and of the gates, Wj and δ j , are controlled by hierarchical priors, typically Gaussians with 0 mean, and having inverse variance hyperparameters controlled by a second level of Gamma distributions. We learn this model using a double-loop EM and employ ML-II type approximations [8, 17] with greedy (weight) subset selection [17, 15]. Finally, the kBME algorithm requires the computation of pre-images in order to recover the state distribution x from it’s image y ∈ P(Fx ). This is a closed form computation for polynomial kernels of odd degree. For more general kernels optimization or learning (regression based) methods are necessary [3]. Following [3, 19], we use a sparse Bayesian kernel regressor to learn the pre-image. This is based on training data (xi , yi ): p(x|y) = N (x|AĂŽĹšy (y), â„Ĺš) (3) with parameters and covariances (A, â„Ĺš). Since temporal inference is performed in the low-dimensional kernel induced state space, the pre-image function needs to be calculated only for visualizing results or for the purpose of error reporting. Propagating the result from the reduced feature space P(Fx ) to the output space X pro- duces a Gaussian mixture with M elements, having coefÄ?Ĺš cients g(z|δ j ) and components N (x|AĂŽĹšy (Wj z), AJĂŽĹšy ĂŽĹ j JĂŽĹšy A + â„Ĺš), where JĂŽĹšy is the Jacobian of the mapping ĂŽĹšy . 3 Experiments We run experiments on both real image sequences (Ä?Ĺš g. 5 and Ä?Ĺš g. 6) and on sequences where silhouettes were artiÄ?Ĺš cially rendered. The prediction error is reported in degrees (for mixture of experts, this is w.r.t. the most probable one, but see also Ä?Ĺš g. 4a), and normalized per joint angle, per frame. The models are learned using standard cross-validation. Pre-images are learned using kernel regressors and have average error 1.7o . Training Set and Model State Representation: For training we gather pairs of 3D human poses together with their image projections, here silhouettes, using the graphics package Maya. We use realistically rendered computer graphics human surface models which we animate using human motion capture [1]. Our original human representation (x) is based on articulated skeletons with spherical joints and has 56 skeletal d.o.f. including global translation. The database consists of 8000 samples of human activities including walking, running, turns, jumps, gestures in conversations, quarreling and pantomime. Image Descriptors: We work with image silhouettes obtained using statistical background subtraction (with foreground and background models). Silhouettes are informative for pose estimation although prone to ambiguities (e.g. the left / right limb assignment in side views) or occasional lack of observability of some of the d.o.f. (e.g. 180o ambiguities in the global azimuthal orientation for frontal views, e.g. Ä?Ĺš g. 1a). These are multiplied by intrinsic forward / backward monocular ambiguities [16]. As observations r, we use shape contexts extracted on the silhouette [4] (5 radial, 12 angular bins, size range 1/8 to 3 on log scale). The features are computed at different scales and sizes for points sampled on the silhouette. To work in a common coordinate system, we cluster all features in the training set into K = 50 clusters. To compute the representation of a new shape feature (a point on the silhouette), we ‘project’ onto the common basis by (inverse distance) weighted voting into the cluster centers. To obtain the representation (r) for a new silhouette we regularly sample 200 points on it and add all their feature vectors into a feature histogram. Because the representation uses overlapping features of the observation the elements of the descriptor are not independent. However, a conditional temporal framework (Ä?Ĺš g. 1b) Ä?Ĺš‚exibly accommodates this. For experiments, we use Gaussian kernels for the joint angle feature space and dot product kernels for the observation feature space. We learn state conditionals for p(yt |zt ) and p(yt |yt−1 , zt ) using 6 dimensions for the joint angle kernel induced state space and 25 dimensions for the observation induced feature space, respectively. In Ä?Ĺš g. 3b) we show an evaluation of the efÄ?Ĺš cacy of our kBME predictor for different dimensions in the joint angle kernel induced state space (the observation feature space dimension is here 50). On the analyzed dancing sequence, that involves complex motions of the arms and the legs, the non-linear model signiÄ?Ĺš cantly outperforms alternative PCA methods and gives good predictions for compact, low-dimensional models.1 In tables 1 and 2, as well as Ä?Ĺš g. 4, we perform quantitative experiments on artiÄ?Ĺš cially rendered silhouettes. 3D ground truth joint angles are available and this allows a more 1 Running times: On a Pentium 4 PC (3 GHz, 2 GB RAM), a full dimensional BME model with 5 experts takes 802s to train p(xt |xt−1 , rt ), whereas a kBME (including the pre-image) takes 95s to train p(yt |yt−1 , zt ). The prediction time is 13.7s for BME and 8.7s (including the pre-image cost 1.04s) for kBME. The integration in (1) takes 2.67s for BME and 0.31s for kBME. The speed-up for kBME is signiÄ?Ĺš cant and likely to increase with original models having higher dimensionality. Prediction Error Number of Clusters 100 1000 100 10 1 1 2 3 4 5 6 7 8 Degree of Multimodality kBME KDE_RVM PCA_BME PCA_RVM 10 1 0 20 40 Number of Dimensions 60 Figure 3: (a, Left) Analysis of ‘multimodality’ for a training set. The input zt dimension is 25, the output yt dimension is 6, both reduced using kPCA. We cluster independently in (yt−1 , zt ) and yt using many clusters (2100) to simulate small input perturbations and we histogram the yt clusters falling within each cluster in (yt−1 , zt ). This gives intuition on the degree of ambiguity in modeling p(yt |yt−1 , zt ), for small perturbations in the input. (b, Right) Evaluation of dimensionality reduction methods for an artiÄ?Ĺš cial dancing sequence (models trained on 300 samples). The kBME is our model §2.2, whereas the KDE-RVM is a KDE model learned with a Relevance Vector Machine (RVM) [17] feature space map. PCA-BME and PCA-RVM are models where the mappings between feature spaces (obtained using PCA) is learned using a BME and a RVM. The non-linearity is signiÄ?Ĺš cant. Kernel-based methods outperform PCA and give low prediction error for 5-6d models. systematic evaluation. Notice that the kernelized low-dimensional models generally outperform the PCA ones. At the same time, they give results competitive to the ones of high-dimensional BME predictors, while being lower-dimensional and therefore signiÄ?Ĺš cantly less expensive for inference, e.g. the integral in (1). In Ä?Ĺš g. 5 and Ä?Ĺš g. 6 we show human motion reconstruction results for two real image sequences. Fig. 5 shows the good quality reconstruction of a person performing an agile jump. (Given the missing observations in a side view, 3D inference for the occluded body parts would not be possible without using prior knowledge!) For this sequence we do inference using conditionals having 5 modes and reduced 6d states. We initialize tracking using p(yt |zt ), whereas for inference we use p(yt |yt−1 , zt ) within (1). In the second sequence in Ä?Ĺš g. 6, we simultaneously reconstruct the motion of two people mimicking domestic activities, namely washing a window and picking an object. Here we do inference over a product, 12-dimensional state space consisting of the joint 6d state of each person. We obtain good 3D reconstruction results, using only 5 hypotheses. Notice however, that the results are not perfect, there are small errors in the elbow and the bending of the knee for the subject at the l.h.s., and in the different wrist orientations for the subject at the r.h.s. This reÄ?Ĺš‚ects the bias of our training set. Walk and turn Conversation Run and turn left KDE-RR 10.46 7.95 5.22 RVM 4.95 4.96 5.02 KDE-RVM 7.57 6.31 6.25 BME 4.27 4.15 5.01 kBME 4.69 4.79 4.92 Table 1: Comparison of average joint angle prediction error for different models. All kPCA-based models use 6 output dimensions. Testing is done on 100 video frames for each sequence, the inputs are artiÄ?Ĺš cially generated silhouettes, not in the training set. 3D joint angle ground truth is used for evaluation. KDE-RR is a KDE model with ridge regression (RR) for the feature space mapping, KDE-RVM uses an RVM. BME uses a Bayesian mixture of experts with no dimensionality reduction. kBME is our proposed model. kPCAbased methods use kernel regressors to compute pre-images. Expert Prediction Frequency − Closest to Ground truth Frequency − Close to ground truth 30 25 20 15 10 5 0 1 2 3 4 Expert Number 14 10 8 6 4 2 0 5 1st Probable Prev Output 2nd Probable Prev Output 3rd Probable Prev Output 4th Probable Prev Output 5th Probable Prev Output 12 1 2 3 4 Current Expert 5 Figure 4: (a, Left) Histogram showing the accuracy of various expert predictors: how many times the expert ranked as the k-th most probable by the model (horizontal axis) is closest to the ground truth. The model is consistent (the most probable expert indeed is the most accurate most frequently), but occasionally less probable experts are better. (b, Right) Histograms show the dynamics of p(yt |yt−1 , zt ), i.e. how the probability mass is redistributed among experts between two successive time steps, in a conversation sequence. Walk and turn back Run and turn KDE-RR 7.59 17.7 RVM 6.9 16.8 KDE-RVM 7.15 16.08 BME 3.6 8.2 kBME 3.72 8.01 Table 2: Joint angle prediction error computed for two complex sequences with walks, runs and turns, thus more ambiguity (100 frames). Models have 6 state dimensions. Unimodal predictors average competing solutions. kBME has signiÄ?Ĺš cantly lower error. Figure 5: Reconstruction of a jump (selected frames). Top: original image sequence. Middle: extracted silhouettes. Bottom: 3D reconstruction seen from a synthetic viewpoint. 4 Conclusion We have presented a probabilistic framework for conditional inference in latent kernelinduced low-dimensional state spaces. Our approach has the following properties: (a) Figure 6: Reconstructing the activities of 2 people operating in an 12-d state space (each person has its own 6d state). Top: original image sequence. Bottom: 3D reconstruction seen from a synthetic viewpoint. Accounts for non-linear correlations among input or output variables, by using kernel nonlinear dimensionality reduction (kPCA); (b) Learns probability distributions over mappings between low-dimensional state spaces using conditional Bayesian mixture of experts, as required for accurate prediction. In the resulting low-dimensional kBME predictor ambiguities and multiple solutions common in visual, inverse perception problems are accurately represented. (c) Works in a continuous, conditional temporal probabilistic setting and offers a formal management of uncertainty. We show comparisons that demonstrate how the proposed approach outperforms regression, PCA or KDE alone for reconstructing the 3D human motion in monocular video. Future work we will investigate scaling aspects for large training sets and alternative structured prediction methods. References [1] CMU Human Motion DataBase. Online at http://mocap.cs.cmu.edu/search.html, 2003. [2] A. Agarwal and B. Triggs. 3d human pose from silhouettes by Relevance Vector Regression. In CVPR, 2004. [3] G. Bakir, J. Weston, and B. Scholkopf. Learning to Ä?Ĺš nd pre-images. In NIPS, 2004. [4] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 24, 2002. [5] C. Bishop and M. Svensen. Bayesian mixtures of experts. In UAI, 2003. [6] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by Annealed Particle Filtering. In CVPR, 2000. [7] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, (6):181–214, 1994. [8] D. Mackay. Bayesian interpolation. Neural Computation, 4(5):720–736, 1992. [9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, 2000. [10] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps. In NIPS, 2002. [11] B. Sch¨ lkopf, A. Smola, and K. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299–1319, 1998. [12] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter Sensitive Hashing. In ICCV, 2003. [13] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard. Tracking Loose-limbed People. In CVPR, 2004. [14] C. Sminchisescu and A. Jepson. Generative Modeling for Continuous Non-Linearly Embedded Visual Inference. In ICML, pages 759–766, Banff, 2004. [15] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative Density Propagation for 3D Human Motion Estimation. In CVPR, 2005. [16] C. Sminchisescu and B. Triggs. Kinematic Jump Processes for Monocular 3D Human Tracking. In CVPR, volume 1, pages 69–76, Madison, 2003. [17] M. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. JMLR, 2001. [18] C. Tomasi, S. Petrov, and A. Sastry. 3d tracking = classiÄ?Ĺš cation + interpolation. In ICCV, 2003. [19] J. Weston, O. Chapelle, A. Elisseeff, B. Scholkopf, and V. Vapnik. Kernel dependency estimation. In NIPS, 2002.
5 0.6215905 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
Author: O. P. Kreidl, Alan S. Willsky
Abstract: Given a directed graphical model with binary-valued hidden nodes and real-valued noisy observations, consider deciding upon the maximum a-posteriori (MAP) or the maximum posterior-marginal (MPM) assignment under the restriction that each node broadcasts only to its children exactly one single-bit message. We present a variational formulation, viewing the processing rules local to all nodes as degrees-of-freedom, that minimizes the loss in expected (MAP or MPM) performance subject to such online communication constraints. The approach leads to a novel message-passing algorithm to be executed offline, or before observations are realized, which mitigates the performance loss by iteratively coupling all rules in a manner implicitly driven by global statistics. We also provide (i) illustrative examples, (ii) assumptions that guarantee convergence and efficiency and (iii) connections to active research areas. 1
6 0.60926014 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
7 0.60738987 78 nips-2005-From Weighted Classification to Policy Search
8 0.60555798 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
9 0.59768105 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
10 0.59310097 30 nips-2005-Assessing Approximations for Gaussian Process Classification
11 0.59007102 46 nips-2005-Consensus Propagation
12 0.58748996 184 nips-2005-Structured Prediction via the Extragradient Method
13 0.58696449 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
14 0.58613604 144 nips-2005-Off-policy Learning with Options and Recognizers
15 0.58473587 47 nips-2005-Consistency of one-class SVM and related algorithms
16 0.58464217 58 nips-2005-Divergences, surrogate loss functions and experimental design
17 0.58454496 105 nips-2005-Large-Scale Multiclass Transduction
18 0.58394009 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
19 0.58168024 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
20 0.58057481 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction