nips nips2008 nips2008-198 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher
Abstract: This paper addresses the problem of sparsity pattern detection for unknown ksparse n-dimensional signals observed through m noisy, random linear measurements. Sparsity pattern recovery arises in a number of settings including statistical model selection, pattern detection, and image acquisition. The main results in this paper are necessary and sufficient conditions for asymptotically-reliable sparsity pattern recovery in terms of the dimensions m, n and k as well as the signal-tonoise ratio (SNR) and the minimum-to-average ratio (MAR) of the nonzero entries of the signal. We show that m > 2k log(n − k)/(SNR · MAR) is necessary for any algorithm to succeed, regardless of complexity; this matches a previous sufficient condition for maximum likelihood estimation within a constant factor under certain scalings of k, SNR and MAR with n. We also show a sufficient condition for a computationally-trivial thresholding algorithm that is larger than the previous expression by only a factor of 4(1 + SNR) and larger than the requirement for lasso by only a factor of 4/MAR. This provides insight on the precise value and limitations of convex programming-based algorithms.
Reference: text
sentIndex sentText sentNum sentScore
1 Fletcher,† Sundeep Rangan,‡ and Vivek K Goyal§ Abstract This paper addresses the problem of sparsity pattern detection for unknown ksparse n-dimensional signals observed through m noisy, random linear measurements. [sent-2, score-0.311]
2 Sparsity pattern recovery arises in a number of settings including statistical model selection, pattern detection, and image acquisition. [sent-3, score-0.236]
3 The main results in this paper are necessary and sufficient conditions for asymptotically-reliable sparsity pattern recovery in terms of the dimensions m, n and k as well as the signal-tonoise ratio (SNR) and the minimum-to-average ratio (MAR) of the nonzero entries of the signal. [sent-4, score-0.669]
4 We show that m > 2k log(n − k)/(SNR · MAR) is necessary for any algorithm to succeed, regardless of complexity; this matches a previous sufficient condition for maximum likelihood estimation within a constant factor under certain scalings of k, SNR and MAR with n. [sent-5, score-0.198]
5 We also show a sufficient condition for a computationally-trivial thresholding algorithm that is larger than the previous expression by only a factor of 4(1 + SNR) and larger than the requirement for lasso by only a factor of 4/MAR. [sent-6, score-0.437]
6 1 Introduction Sparse signal models have been used successfully in a variety of applications including waveletbased image processing and pattern recognition. [sent-8, score-0.107]
7 Recent research has shown that certain naturallyoccurring neurological processes may exploit sparsity as well [1–3]. [sent-9, score-0.198]
8 Due to the nonlinear nature of sparse signal models, developing and analyzing algorithms for sparse signal processing has been a major research challenge. [sent-11, score-0.178]
9 To keep the analysis general, we consider the following abstract estimation problem: An unknown sparse signal x is modeled as an n-dimensional real vector with k nonzero components. [sent-15, score-0.211]
10 The locations of the nonzero components is called the sparsity pattern. [sent-16, score-0.269]
11 We consider the problem of detecting the sparsity pattern of x from an m-dimensional measurement vector y = Ax + d, where A ∈ Rm×n is a known measurement matrix and d ∈ Rm is an additive noise vector with a known distribution. [sent-17, score-0.315]
12 While optimal sparsity pattern detection is NP-hard [4], greedy heuristics (matching pursuit [5] and its variants) and convex relaxations (basis pursuit [6], lasso [7], and others) have been widely-used since at least the mid 1990s. [sent-33, score-0.494]
13 Some remarkable recent results are sets of conditions that can guarantee exact sparsity recovery based on certain simple “incoherence” conditions on the measurement matrix A [8–10]. [sent-35, score-0.403]
14 The main theoretical result are conditions that guarantee sparse detection with convex programming methods. [sent-40, score-0.107]
15 (1) is necessary and sufficient for lasso to detect the sparsity pattern when A has Gaussian entries, provided the SNR scales to infinity. [sent-42, score-0.488]
16 This paper presents new necessary and sufficient conditions, summarized in Table 1 along with Wainwright’s lasso scaling (1). [sent-44, score-0.289]
17 The necessary condition applies to all algorithms, regardless of complexity. [sent-47, score-0.105]
18 As described in Section 3, our new necessary condition is stronger than previous bounds in certain important regimes. [sent-50, score-0.121]
19 The sufficient condition is derived for a computationally-trivial thresholding estimator. [sent-51, score-0.245]
20 By comparing with the lasso scaling, we argue that main benefits of more sophisticated methods, such as lasso, is not generally in the scaling with respect to k and n but rather in the dependence on the minimum-to-average ratio. [sent-52, score-0.302]
21 Denote the sparsity pattern of x (positions of nonzero entries) by the set Itrue , which is a k-element subset of the set of indices {1, 2, . [sent-60, score-0.344]
22 Estimates of the sparsity ˆ pattern will be denoted by I with subscripts indicating the type of estimator. [sent-64, score-0.243]
23 In addition to the signal dimensions, m, n and k, we will show that there are two variables that dictate the ability to detect the sparsity pattern reliably: the signal-to-noise ratio (SNR), and what we will call the minimum-to-average ratio (MAR). [sent-66, score-0.409]
24 N (0, 1/m), so columns ai ∈ Rm and aj ∈ Rm of A satisfy E[a′ aj ] = δij . [sent-71, score-0.112]
25 Therefore, the signal energy is given by i SNR = E Ax 2 E [a′ aj xi xj ] = i = i,j∈Itrue xi xj δij = x 2. [sent-72, score-0.171]
26 x 2 /k (5) Since x 2 /k is the average of {|xj |2 | j ∈ Itrue }, MAR ∈ (0, 1] with the upper limit occurring when all the nonzero entries of x have the same magnitude. [sent-75, score-0.119]
27 One final value that will be important is the minimum component SNR, defined as 1 1 SNRmin = min E aj xj 2 = min |xj |2 . [sent-76, score-0.117]
28 E d 2 j∈Itrue m j∈Itrue (6) The quantity SNRmin has a natural interpretation: The numerator, min E aj xj 2 , is the signal power due to the smallest nonzero component of x, while the denominator, E d 2 , is the total noise power. [sent-77, score-0.245]
29 The ratio SNRmin thus represents the contribution to the SNR from the smallest nonzero component of the unknown vector x. [sent-78, score-0.18]
30 : the entries of A have variance 1/n in [13, 19]; the entries of A have unit variance and the variance of d is a variable σ 2 in [14, 17, 20,21]; and our scaling of A and a noise variance of σ 2 are used in [22]. [sent-83, score-0.142]
31 3 Necessary Condition for Sparsity Recovery We first consider sparsity recovery without being concerned with computational complexity of the estimation algorithm. [sent-88, score-0.315]
32 Estimation of the sparsity pattern is the selection of one of these subspaces, and since the noise d is Gaussian, the probability of error is minimized by choosing the subspace closest to the observed vector y. [sent-90, score-0.264]
33 The ML estimate of the sparsity pattern is ˆ IML = arg max PJ y 2 , J : |J|=k where |J| denotes the cardinality of J. [sent-97, score-0.243]
34 That is, the ML estimate is the set of k indices such that the subspace spanned by the corresponding columns of A contain the maximum signal energy of y. [sent-98, score-0.128]
35 However, the performance of ML estimation provides a lower bound on the number of measurements needed by any algorithm that cannot exploit a priori information on x other than it being k-sparse. [sent-100, score-0.149]
36 ML estimation for sparsity recovery was first examined in [17]. [sent-101, score-0.315]
37 There, it was shown that there exists a constant C > 0 such that the condition m > C max log(n − k) SNRmin , k log n k = C max k log(n − k) n , k log SNR · MAR k (8) is sufficient for ML to asymptotically reliably recover the sparsity pattern. [sent-102, score-0.381]
38 Then even the ML estimator asymptotically cannot detect the sparsity pattern, i. [sent-106, score-0.259]
39 The theorem provides a simple lower bound on the minimum number of measurements required to recover the sparsity pattern in terms of k, n and the minimum component SNR, SNRmin . [sent-112, score-0.467]
40 The theorem strengthens an earlier necessary condition in [18] which showed that there exists a C > 0 such that C m= k log(n − k) SNR is necessary for asymptotic reliable recovery. [sent-116, score-0.261]
41 In particular, under linear sparsity (k = αn for some constant α), the theorem shows that 2α n log n m≍ MAR · SNR measurements are necessary for sparsity recovery. [sent-120, score-0.636]
42 Similarly, if m/n is bounded above by a constant, then sparsity recovery will certainly fail unless k = O (SNR · MAR · n/ log n) . [sent-121, score-0.358]
43 In particular, when SNR · MAR is bounded, the sparsity ratio k/n must approach zero. [sent-122, score-0.225]
44 In the case where SNR · MAR and the sparsity ratio k/n are both constant, the sufficient condition (8) reduces to m = (C/(SNR · MAR))k log(n − k), which matches the necessary condition in (9) within a constant factor. [sent-124, score-0.411]
45 In the case of MAR · SNR < 1, the bound (9) improves upon the necessary condition of [14] for the asymptotic success of lasso by the factor (MAR · SNR)−1 . [sent-126, score-0.315]
46 When the sparsity ratio k/n and SNR are both fixed, m can satisfy (10) while growing only linearly with k. [sent-148, score-0.225]
47 In contrast, Theorem 1 shows that with sparsity ratio and SNR · MAR fixed, m = Ω(k log(n−k)) is necessary for reliable sparsity recovery. [sent-149, score-0.498]
48 That is, the number of measurements must grow superlinearly in k in the linear sparsity regime with bounded SNR. [sent-150, score-0.336]
49 In the sublinear regime where k = o(n), the capacity-based bound (10) may be stronger than (9) depending on the scaling of SNR, MAR and other terms. [sent-151, score-0.161]
50 The previous results showed that with fixed SNR as defined here, sparsity recovery with m = Θ(k) must fail. [sent-154, score-0.296]
51 √ That sufficient condition holds for scaling that gives linear sparsity and MAR · √ SNR = Ω( n log n). [sent-158, score-0.35]
52 For √ MAR · SNR = n log n, Theorem 1 shows that fewer than m ≍ 2 k log k measurements will cause ML decoding to fail, while [21, Thm. [sent-159, score-0.204]
53 The necessary condition (9) shows a dependence on the minimum-to-average ratio MAR instead of just the average power through SNR. [sent-163, score-0.19]
54 Signals with MAR < 1 are created by having one small nonzero component and k − 1 equal, larger nonzero components. [sent-172, score-0.194]
55 Empirically, for the (small) value of n = 20, it seems that with MAR · SNR held fixed, sparsity recovery becomes easier as SNR increases (and MAR decreases). [sent-176, score-0.296]
56 Define the thresholding estimate as ˆ Ithresh = j : |a′ y|2 > µ , (11) j where µ > 0 represents a threshold level. [sent-179, score-0.201]
57 This algorithm simply correlates the observed signal y with all the frame vectors aj and selects the indices j where the correlation energy exceeds a certain level µ. [sent-180, score-0.157]
58 It is significantly simpler than both lasso and matching pursuit and is not meant to be proposed as a competitive alternative. [sent-181, score-0.2]
59 Rather, we consider thresholding simply to illustrate what precise benefits lasso and more sophisticated methods bring. [sent-182, score-0.384]
60 Sparsity pattern recovery by thresholding was studied in [24], which proves a sufficient condition when there is no noise. [sent-183, score-0.42]
61 Then, there exists a sequence of threshold levels µ = µ(n), such that thresholding asymptotically detects the sparsity pattern, i. [sent-186, score-0.407]
62 Comparing (9) and (12), we see that thresholding requires a factor of at most 4(1 + SNR) more measurements than ML estimation. [sent-193, score-0.334]
63 Thus, for a fixed SNR, the optimal scaling not only does not require ML estimation, it does not even require lasso or matching pursuit—it can be achieved with a remarkably simply method. [sent-194, score-0.244]
64 Nevertheless, the gap between thresholding and ML of 4(1 + SNR) measurements can be large. [sent-196, score-0.354]
65 For ML estimation, the lower bound on the number of measurements required by ML decreases to k − 1 as SNR → ∞. [sent-198, score-0.13]
66 1 In contrast, with thresholding, increasing the SNR has diminishing returns: as SNR → ∞, the bound on the number of measurements in (12) approaches 8 m> k log(n − k). [sent-199, score-0.156]
67 (13) MAR Thus, even with SNR → ∞, the minimum number of measurements is not improved from m = Ω(k log(n − k)). [sent-200, score-0.131]
68 This diminishing returns for improved SNR exhibited by thresholding is also a problem for more sophisticated methods such as lasso. [sent-201, score-0.245]
69 For example, as discussed earlier, the analysis of [14] shows that when SNR · MAR → ∞, lasso requires m > 2k log(n − k) + k + 1 (14) for reliable recovery. [sent-202, score-0.18]
70 Therefore, like thresholding, lasso does not achieve a scaling better than m = O(k log(n − k)), even at infinite SNR. [sent-203, score-0.228]
71 There is also a gap between thresholding and lasso. [sent-205, score-0.242]
72 Comparing (13) and (14), we see that, at high SNR, thresholding requires a factor of up to 4/MAR more measurements than lasso. [sent-206, score-0.334]
73 This factor is largest when MAR is small, which occurs when there are relatively small nonzero components. [sent-207, score-0.108]
74 Thus, in the high SNR regime, the main benefit of lasso is its ability to detect small coefficients, even when they are much below the average power. [sent-208, score-0.184]
75 However, if the range of component magnitudes is not large, so MAR is close to one, lasso and thresholding have equal performance within a constant factor. [sent-209, score-0.406]
76 1 Of course, at least k + 1 measurements are necessary. [sent-210, score-0.112]
77 5 Conclusions We have considered the problem of detecting the sparsity pattern of a sparse vector from noisy random linear measurements. [sent-218, score-0.331]
78 Necessary and sufficient scaling laws for the number of measurements to recover the sparsity pattern for different detection algorithms were derived. [sent-219, score-0.494]
79 The analysis reveals the effect of two key factors: the total signal-to-noise ratio (SNR), as well as the minimum-toaverage ratio (MAR), which is a measure of the spread of component magnitudes. [sent-220, score-0.106]
80 Our main conclusions are: • Tight scaling laws for constant SNR and MAR. [sent-222, score-0.123]
81 In the regime where SNR = Θ(1) and MAR = Θ(1), our results show that the scaling of the number of measurements m = O(k log(n − k)) is both necessary and sufficient for asymptotically reliable sparsity pattern detection. [sent-223, score-0.59]
82 Moreover, the scaling can be achieved with a thresholding algorithm, which is computationally simpler than even lasso or basis pursuit. [sent-224, score-0.429]
83 Under the additional assumption of linear sparsity (k/n fixed), this scaling is a larger number of measurements than predicted by previous informationtheoretic bounds. [sent-225, score-0.372]
84 While the number of measurements required for exhaustive ML estimation and simple thresholding have the same dependence on n and k with the SNR fixed, the dependence on SNR differs significantly. [sent-227, score-0.433]
85 Specifically, thresholding requires a factor of up to 4(1 + SNR) more measurements than ML. [sent-228, score-0.334]
86 Moreover, as SNR → ∞, the number of measurements required by ML may be as low as m = k + 1. [sent-229, score-0.112]
87 In contrast, even letting SNR → ∞, thresholding and lasso still require m = O(k log(n − k)) measurements. [sent-230, score-0.351]
88 There is a potential gap between thresholding and lasso, but the gap is smaller than the gap to ML. [sent-233, score-0.324]
89 Specifically, in the high SNR regime, thresholding requires at most 4/MAR more measurements than lasso. [sent-234, score-0.313]
90 Thus, the benefit of lasso over simple thresholding is its ability to detect the sparsity pattern even in the presence of relatively small nonzero coefficients (i. [sent-235, score-0.715]
91 However, when the components of the unknown vector have similar magnitudes (MAR close to one), the gap between lasso and simple thresholding is reduced. [sent-238, score-0.426]
92 While our results provide both necessary and sufficient scaling laws, there is clearly a gap in terms of the scaling with the SNR. [sent-239, score-0.258]
93 A second open issue is to determine conditions for partial sparsity recovery. [sent-242, score-0.213]
94 The above results define conditions for recovering all the positions in the sparsity pattern. [sent-243, score-0.228]
95 Neither the limits of partial sparsity recovery nor the performance of practical algorithms are completely understood, though some results have been reported in [19–21, 25]. [sent-245, score-0.332]
96 Sparse coding via thresholding and local competition in neural circuits. [sent-268, score-0.223]
97 Stable recovery of sparse overcomplete representations in the presence of noise. [sent-312, score-0.157]
98 Near-optimal signal recovery from random projections: Universal e encoding strategies? [sent-353, score-0.16]
99 Sharp thresholds for high-dimensional and noisy recovery of sparsity. [sent-361, score-0.145]
100 Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. [sent-395, score-0.363]
wordName wordTfidf (topN-words)
[('snr', 0.695), ('mar', 0.447), ('thresholding', 0.201), ('sparsity', 0.182), ('ml', 0.177), ('itrue', 0.175), ('lasso', 0.15), ('snrmin', 0.132), ('recovery', 0.114), ('measurements', 0.112), ('nonzero', 0.087), ('scaling', 0.078), ('fletcher', 0.064), ('pattern', 0.061), ('necessary', 0.061), ('rangan', 0.058), ('aj', 0.056), ('ax', 0.055), ('log', 0.046), ('signal', 0.046), ('goyal', 0.044), ('condition', 0.044), ('sparse', 0.043), ('ratio', 0.043), ('dependence', 0.042), ('regime', 0.042), ('gap', 0.041), ('theorem', 0.036), ('limits', 0.036), ('suf', 0.035), ('pursuit', 0.034), ('detect', 0.034), ('sensing', 0.033), ('arxiv', 0.033), ('compressive', 0.033), ('detection', 0.033), ('entries', 0.032), ('limn', 0.031), ('noisy', 0.031), ('conditions', 0.031), ('reliable', 0.03), ('compressed', 0.03), ('email', 0.03), ('measurement', 0.029), ('akcakaya', 0.029), ('alyson', 0.029), ('iml', 0.029), ('ithresh', 0.029), ('strengthens', 0.029), ('laws', 0.028), ('diminishing', 0.026), ('energy', 0.025), ('subspaces', 0.024), ('asymptotically', 0.024), ('march', 0.023), ('sublinear', 0.023), ('ieee', 0.023), ('april', 0.023), ('coding', 0.022), ('reliably', 0.022), ('spanned', 0.022), ('xj', 0.022), ('wainwright', 0.021), ('factor', 0.021), ('success', 0.021), ('subspace', 0.021), ('cand', 0.021), ('succeed', 0.021), ('component', 0.02), ('expressions', 0.02), ('matches', 0.02), ('signals', 0.019), ('estimator', 0.019), ('rm', 0.019), ('minimum', 0.019), ('estimation', 0.019), ('sub', 0.019), ('california', 0.018), ('visual', 0.018), ('bound', 0.018), ('sophisticated', 0.018), ('magnitudes', 0.018), ('cient', 0.018), ('constant', 0.017), ('exhaustive', 0.017), ('fail', 0.016), ('unknown', 0.016), ('donoho', 0.016), ('matching', 0.016), ('certain', 0.016), ('dimensions', 0.015), ('pj', 0.015), ('tail', 0.015), ('precise', 0.015), ('positions', 0.015), ('indices', 0.014), ('smallest', 0.014), ('comparing', 0.014), ('detecting', 0.014), ('capacity', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher
Abstract: This paper addresses the problem of sparsity pattern detection for unknown ksparse n-dimensional signals observed through m noisy, random linear measurements. Sparsity pattern recovery arises in a number of settings including statistical model selection, pattern detection, and image acquisition. The main results in this paper are necessary and sufficient conditions for asymptotically-reliable sparsity pattern recovery in terms of the dimensions m, n and k as well as the signal-tonoise ratio (SNR) and the minimum-to-average ratio (MAR) of the nonzero entries of the signal. We show that m > 2k log(n − k)/(SNR · MAR) is necessary for any algorithm to succeed, regardless of complexity; this matches a previous sufficient condition for maximum likelihood estimation within a constant factor under certain scalings of k, SNR and MAR with n. We also show a sufficient condition for a computationally-trivial thresholding algorithm that is larger than the previous expression by only a factor of 4(1 + SNR) and larger than the requirement for lasso by only a factor of 4/MAR. This provides insight on the precise value and limitations of convex programming-based algorithms.
2 0.46463087 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
Author: John W. Roberts, Russ Tedrake
Abstract: Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. We confirm that SNR is a good predictor of long-term learning performance, and that in our episodic formulation, the cost-to-go function is indeed the optimal baseline. We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpreted as following the natural gradient of the cost function. Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a ‘shell’ distribution with a constant magnitude and uniform distribution in direction. We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments. 1
3 0.13956702 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
Author: Volkan Cevher, Marco F. Duarte, Chinmay Hegde, Richard Baraniuk
Abstract: Compressive Sensing (CS) combines sampling and compression into a single subNyquist linear measurement process for sparse and compressible signals. In this paper, we extend the theory of CS to include signals that are concisely represented in terms of a graphical model. In particular, we use Markov Random Fields (MRFs) to represent sparse signals whose nonzero coefficients are clustered. Our new model-based recovery algorithm, dubbed Lattice Matching Pursuit (LaMP), stably recovers MRF-modeled signals using many fewer measurements and computations than the current state-of-the-art algorithms.
4 0.13219996 179 nips-2008-Phase transitions for high-dimensional joint support recovery
Author: Sahand Negahban, Martin J. Wainwright
Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1
5 0.13157827 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
6 0.11052889 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
8 0.094873235 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
9 0.092963181 99 nips-2008-High-dimensional support union recovery in multivariate regression
10 0.076588728 214 nips-2008-Sparse Online Learning via Truncated Gradient
11 0.0645006 106 nips-2008-Inferring rankings under constrained sensing
12 0.061650116 75 nips-2008-Estimating vector fields using sparse basis field expansions
13 0.060636591 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
14 0.059667539 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
15 0.056730054 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
16 0.052728333 216 nips-2008-Sparse probabilistic projections
17 0.048269842 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
18 0.048188977 62 nips-2008-Differentiable Sparse Coding
19 0.043284953 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
20 0.041203368 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
topicId topicWeight
[(0, -0.133), (1, 0.011), (2, -0.067), (3, 0.128), (4, 0.134), (5, 0.045), (6, -0.164), (7, -0.034), (8, -0.121), (9, 0.149), (10, -0.218), (11, -0.135), (12, -0.004), (13, 0.028), (14, -0.09), (15, 0.0), (16, -0.034), (17, 0.01), (18, 0.037), (19, -0.058), (20, 0.247), (21, 0.07), (22, 0.214), (23, 0.017), (24, -0.07), (25, 0.07), (26, -0.318), (27, -0.207), (28, 0.139), (29, -0.125), (30, -0.291), (31, 0.142), (32, 0.039), (33, 0.176), (34, -0.02), (35, -0.057), (36, 0.05), (37, -0.002), (38, -0.042), (39, 0.043), (40, 0.128), (41, -0.005), (42, 0.037), (43, 0.023), (44, 0.058), (45, 0.097), (46, -0.0), (47, -0.029), (48, -0.011), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.97191012 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher
Abstract: This paper addresses the problem of sparsity pattern detection for unknown ksparse n-dimensional signals observed through m noisy, random linear measurements. Sparsity pattern recovery arises in a number of settings including statistical model selection, pattern detection, and image acquisition. The main results in this paper are necessary and sufficient conditions for asymptotically-reliable sparsity pattern recovery in terms of the dimensions m, n and k as well as the signal-tonoise ratio (SNR) and the minimum-to-average ratio (MAR) of the nonzero entries of the signal. We show that m > 2k log(n − k)/(SNR · MAR) is necessary for any algorithm to succeed, regardless of complexity; this matches a previous sufficient condition for maximum likelihood estimation within a constant factor under certain scalings of k, SNR and MAR with n. We also show a sufficient condition for a computationally-trivial thresholding algorithm that is larger than the previous expression by only a factor of 4(1 + SNR) and larger than the requirement for lasso by only a factor of 4/MAR. This provides insight on the precise value and limitations of convex programming-based algorithms.
2 0.85493588 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
Author: John W. Roberts, Russ Tedrake
Abstract: Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. We confirm that SNR is a good predictor of long-term learning performance, and that in our episodic formulation, the cost-to-go function is indeed the optimal baseline. We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpreted as following the natural gradient of the cost function. Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a ‘shell’ distribution with a constant magnitude and uniform distribution in direction. We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments. 1
3 0.41356459 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
Author: Volkan Cevher, Marco F. Duarte, Chinmay Hegde, Richard Baraniuk
Abstract: Compressive Sensing (CS) combines sampling and compression into a single subNyquist linear measurement process for sparse and compressible signals. In this paper, we extend the theory of CS to include signals that are concisely represented in terms of a graphical model. In particular, we use Markov Random Fields (MRFs) to represent sparse signals whose nonzero coefficients are clustered. Our new model-based recovery algorithm, dubbed Lattice Matching Pursuit (LaMP), stably recovers MRF-modeled signals using many fewer measurements and computations than the current state-of-the-art algorithms.
4 0.27653009 214 nips-2008-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
5 0.27303725 179 nips-2008-Phase transitions for high-dimensional joint support recovery
Author: Sahand Negahban, Martin J. Wainwright
Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1
6 0.24912925 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
7 0.24832378 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
8 0.23955582 202 nips-2008-Robust Regression and Lasso
10 0.2277679 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
11 0.22542806 99 nips-2008-High-dimensional support union recovery in multivariate regression
12 0.21239434 75 nips-2008-Estimating vector fields using sparse basis field expansions
13 0.19828831 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
14 0.19801606 106 nips-2008-Inferring rankings under constrained sensing
15 0.19653788 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
16 0.17754026 62 nips-2008-Differentiable Sparse Coding
17 0.17462043 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
18 0.16314499 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
19 0.16243583 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
20 0.15731871 134 nips-2008-Mixed Membership Stochastic Blockmodels
topicId topicWeight
[(6, 0.091), (7, 0.098), (12, 0.02), (16, 0.025), (24, 0.02), (28, 0.112), (34, 0.013), (53, 0.011), (57, 0.043), (59, 0.358), (63, 0.025), (77, 0.04), (83, 0.027)]
simIndex simValue paperId paperTitle
1 0.88714617 23 nips-2008-An ideal observer model of infant object perception
Author: Charles Kemp, Fei Xu
Abstract: Before the age of 4 months, infants make inductive inferences about the motions of physical objects. Developmental psychologists have provided verbal accounts of the knowledge that supports these inferences, but often these accounts focus on categorical rather than probabilistic principles. We propose that infant object perception is guided in part by probabilistic principles like persistence: things tend to remain the same, and when they change they do so gradually. To illustrate this idea we develop an ideal observer model that incorporates probabilistic principles of rigidity and inertia. Like previous researchers, we suggest that rigid motions are expected from an early age, but we challenge the previous claim that the inertia principle is relatively slow to develop [1]. We support these arguments by modeling several experiments from the developmental literature. Over the past few decades, ingenious experiments [1, 2] have suggested that infants rely on systematic expectations about physical objects when interpreting visual scenes. Looking time studies suggest, for example, that infants expect objects to follow continuous trajectories through time and space, and understand that two objects cannot simultaneously occupy the same location. Many of these studies have been replicated several times, but there is still no consensus about the best way to characterize the knowledge that gives rise to these findings. Two main approaches can be found in the literature. The verbal approach uses natural language to characterize principles of object perception [1, 3]: for example, Spelke [4] proposes that object perception is consistent with principles including continuity (“a moving object traces exactly one connected path over space and time”) and cohesion (“a moving object maintains its connectedness and boundaries”). The mechanistic approach proposes that physical knowledge is better characterized by describing the mechanisms that give rise to behavior, and researchers working in this tradition often develop computational models that support their theoretical proposals [5]. We pursue a third approach—the ideal observer approach [6, 7, 8]—that combines aspects of both previous traditions. Like the verbal approach, our primary goal is to characterize principles that account for infant behavior, and we will not attempt to characterize the mechanisms that produce this behavior. Like the mechanistic approach, we emphasize the importance of formal models, and suggest that these models can capture forms of knowledge that are difficult for verbal accounts to handle. Ideal observer models [6, 9] specify the conclusions that normatively follow given a certain source of information and a body of background knowledge. These models can therefore address questions about the information and the knowledge that support perception. Approaches to the information question characterize the kinds of perceptual information that human observers use. For example, Geisler [9] discusses which components of the information available at the retina contribute to visual perception, and Banks and Shannon [10] use ideal observer models to study the perceptual consequences of immaturities in the retina. Approaches to the knowledge question characterize the background assumptions that are combined with the available input in order to make inductive inferences. For example, Weiss and Adelson [7] describe several empirical phenomena that are consistent with the a priori assumption that motions tend to be slow and smooth. There are few previous attempts to develop ideal observer models of infant perception, and most of them focus only on the information question [10]. This paper addresses the knowledge question, and proposes that the ideal observer approach can help to identify the minimal set of principles needed to account for the visual competence of young infants. Most verbal theories of object perception focus on categorical principles [4], or principles that make a single distinction between possible and impossible scenes. We propose that physical knowledge in infancy is also characterized by probabilistic principles, or expectations that make some possible scenes more surprising than others. We demonstrate the importance of probabilistic principles by focusing on two examples: the rigidity principle states that objects usually maintain their shape and size when they move, and the inertia principle states that objects tend to maintain the same pattern of motion over time. Both principles capture important regularities, but exceptions to these regularities are relatively common. Focusing on rigidity and inertia allows us to demonstrate two contributions that probabilistic approaches can make. First, probabilistic approaches can reinforce current proposals about infant perception. Spelke [3] suggests that rigidity is a core principle that guides object perception from a very early age, and we demonstrate how this idea can be captured by a model that also tolerates exceptions, such as non-rigid biological motion. Second, probabilistic approaches can identify places where existing proposals may need to be revised. Spelke [3] argues that the principle of inertia is slow to develop, but we suggest that a probabilistic version of this principle can help to account for inferences made early in development. 1 An ideal observer approach An ideal observer approach to object perception can be formulated in terms of a generative model for scenes. Scenes can be generated in three steps. First we choose the number n of objects that will appear in the scene, and generate the shape, visual appearance, and initial location of each object. We then choose a velocity field for each object which specifies how the object moves and changes shape over time. Finally, we create a visual scene by taking a two-dimensional projection of the moving objects generated in the two previous steps. An ideal observer approach explores the idea that the inferences made by infants approximate the optimal inferences with respect to this generative model. We work within this general framework but make two simplifications. We will not discuss how the shapes and visual appearances of objects are generated, and we make the projection step simple by working with a two-dimensional world. These simplifications allow us to focus on the expectations about velocity fields that guide motion perception in infants. The next two sections present two prior distributions that can be used to generate velocity fields. The first is a baseline prior that does not incorporate probabilistic principles, and the second incorporates probabilistic versions of rigidity and inertia. The two priors capture different kinds of knowledge, and we argue that the second provides the more accurate characterization of the knowledge that infants bring to object perception. 1.1 A baseline prior on velocity fields Our baseline prior is founded on five categorical principles that are closely related to principles discussed by Spelke [3, 4]. The principles we consider rely on three basic notions: space, time, and matter. We also refer to particles, which are small pieces of matter that occupy space-time points. Particles satisfy several principles: C1. Temporal continuity. Particles are not created or destroyed. In other words, every particle that exists at time t1 must also exist at time t2 . C2. Spatial continuity. Each particle traces a continuous trajectory through space. C3. Exclusion. No two particles may occupy the same space-time point. An object is a collection of particles, and these collections satisfy two principles: C4. Discreteness. Each particle belongs to exactly one object. C5. Cohesion. At each point in time, the particles belonging to an object occupy a single connected region of space. Suppose that we are interested in a space-time window specified by a bounded region of space and a bounded interval of time. For simplicity, we will assume that space is two-dimensional, and that the space-time window corresponds to the unit cube. Suppose that a velocity field v assigns a velocity (vx , vy ) to each particle in the space-time window, and let vi be the field created by considering only particles that belong to object i. We develop a theory of object perception by defining a prior distribution p(v) on velocity fields. Consider first the distribution p(v1 ) on fields for a single object. Any field that violates one or more of principles C1–C5 is assigned zero probability. For instance, fields where part of an object winks out of existence violate the principle of temporal continuity, and fields where an object splits into two distinct pieces violate the principle of cohesion. Many fields, however, remain, including fields that specify non-rigid motions and jagged trajectories. For now, assume that we are working with a space of fields that is bounded but very large, and that the prior distribution over this space is uniform for all fields consistent with principles C1–C5: p(v1 ) ∝ f (v1 ) = 0 1 if v1 violates C1–C5 otherwise. (1) Consider now the distribution p(v1 , v2 ) on fields for pairs of objects. Principles C1 through C5 rule out some of these fields, but again we must specify a prior distribution on those that remain. Our prior is induced by the following principle: C6. Independence. Velocity fields for multiple objects are independently generated subject to principles C1 through C5. More formally, the independence principle specifies how the prior for the multiple object case is related to the prior p(v1 ) on velocity fields for a single object (Equation 1): p(v1 , . . . , vn ) ∝ f (v1 , . . . , vn ) = 0 if {vi } collectively violate C1–C5 f (v1 ) . . . f (vn ) otherwise. (2) 1.2 A smoothness prior on velocity fields We now develop a prior p(v) that incorporates probabilistic expectations about the motion of physical objects. Consider again the prior p(v1 ) on the velocity field v1 of a single object. Principles C1–C5 make a single cut that distinguishes possible from impossible fields, but we need to consider whether infants have additional knowledge that makes some of the possible fields less surprising than others. One informal idea that seems relevant is the notion of persistence[11]: things tend to remain the same, and when they change they do so gradually. We focus on two versions of this idea that may guide expectations about velocity fields: S1. Spatial smoothness. Velocity fields tend to be smooth in space. S2. Temporal smoothness. Velocity fields tend to be smooth in time. A field is “smooth in space” if neighboring particles tend to have similar velocities at any instant of time. The smoothest possible field will be one where all particles have the same velocity at any instant—in other words, where an object moves rigidly. The principle of spatial smoothness therefore captures the idea that objects tend to maintain the same shape and size. A field is “smooth in time” if any particle tends to have similar velocities at nearby instants of time. The smoothest possible field will be one where each particle maintains the same velocity throughout the entire interval of interest. The principle of temporal smoothness therefore captures the idea that objects tend to maintain their initial pattern of motion. For instance, stationary objects tend to remain stationary, moving objects tend to keep moving, and a moving object following a given trajectory tends to continue along that trajectory. Principles S1 and S2 are related to two principles— rigidity and inertia—that have been discussed in the developmental literature. The rigidity principle states that objects “tend to maintain their size and shape over motion”[3], and the inertia principle states that objects move smoothly in the absence of obstacles [4]. Some authors treat these principles rather differently: for instance, Spelke suggests that rigidity is one of the core principles that guides object perception from a very early age [3], but that the principle of inertia is slow to develop and is weak or fragile once acquired. Since principles S1 and S2 seem closely related, the suggestion that one develops much later than the other seems counterintuitive. The rest of this paper explores the idea that both of these principles are needed to characterize infant perception. Our arguments will be supported by formal analyses, and we therefore need formal versions of S1 and S2. There may be different ways to formalize these principles, but we present a simple L1 L2 U b) 200 L1 L2 U 0 log “ p(H1 |v) p(H2 |v) ” a) −200 baseline smoothness Figure 1: (a) Three scenes inspired by the experiments of Spelke and colleagues [12, 13]. Each scene can be interpreted as a single object, or as a small object on top of a larger object. (b) Relative preferences for the one-object and two-object interpretations according to two models. The baseline model prefers the one-object interpretation in all three cases, but the smoothness model prefers the one-object interpretation only for scenes L1 and L2. approach that builds on existing models of motion perception in adults [7, 8]. We define measures of instantaneous roughness that capture how rapidly a velocity field v varies in space and time: Rspace (v, t) = ∂v(x, y, t) ∂x 2 ∂v(x, y, t) ∂t 1 vol(O(t)) 2 + ∂v(x, y, t) ∂y 2 dxdy (3) O(t) Rtime (v, t) = 1 vol(O(t)) dxdy (4) O(t) where O(t) is the set of all points that are occupied by the object at time t, and vol(O(t)) is the volume of the object at time t. Rspace (v, t) will be large if neighboring particles at time t tend to have different velocities, and Rtime (v, t) will be large if many particles are accelerating at time t. We combine our two roughness measures to create a single smoothness function S(·) that measures the smoothness of a velocity field: S(v) = −λspace Rspace (v, t)dt − λtime Rtime (v, t)dt (5) where λspace and λtime are positive weights that capture the importance of spatial smoothness and temporal smoothness. For all analyses in this paper we set λspace = 10000 and λtime = 250, which implies that violations of spatial smoothness are penalized more harshly than violations of temporal smoothness. We now replace Equation 1 with a prior on velocity fields that takes smoothness into account: 0 if v1 violates C1–C5 p(v1 ) ∝ f (v1 ) = (6) exp (S(v1 )) otherwise. Combining Equation 6 with Equation 2 specifies a model of object perception that incorporates probabilistic principles of rigidity and inertia. 2 Empirical findings: spatial smoothness There are many experiments where infants aged 4 months and younger appear to make inferences that are consistent with the principle of rigidity. This section suggests that the principle of spatial smoothness can account for these results. We therefore propose that a probabilistic principle (spatial smoothness) can explain all of the findings previously presented in support of a categorical principle (rigidity), and can help in addition to explain how infants perceive non-rigid motion. One set of studies explores inferences about the number of objects in a scene. When a smaller block is resting on top of a larger block (L1 in Figure 1a), 3-month-olds infer that the scene includes a single object [12]. The same result holds when the small and large blocks are both moving in the same direction (L2 in Figure 1a) [13]. When these blocks are moving in opposite directions (U in Figure 1a), however, infants appear to infer that the scene contains two objects [13]. Results like these suggest that infants may have a default expectation that objects tend to move rigidly. We compared the predictions made by two models about the scenes in Figure 1a. The smoothness model uses a prior p(v1 ) that incorporates principles S1 and S2 (Equation 6), and the baseline model is identical except that it sets λspace = λtime = 0. Both models therefore incorporate principles C1– C6, but only the smoothness model captures the principle of spatial smoothness. Given any of the scenes in Figure 1a, an infant must solve two problems: she must compute the velocity field v for the scene and must decide whether this field specifies the motion of one or two objects. Here we focus on the second problem, and assume that the infant’s perceptual system has already computed a veridical velocity field for each scene that we consider. In principle, however, the smoothness prior in Equation 6 can address both problems. Previous authors have shown how smoothness priors can be used to compute velocity fields given raw image data [7, 8]. Let H1 be the hypothesis that a given velocity field corresponds to a single object, and let H2 be the hypothesis that the field specifies the motions of two objects. We assume that the prior probabilities of these hypotheses are equal, and that P (H1 ) = P (H2 ) = 0.5. An ideal observer can use the posterior odds ratio to choose between these hypotheses: P (H1 |v) P (v|H1 ) P (H1 ) = ≈ P (H2 |v) P (v|H2 ) P (H2 ) f (v) f (v1 )dv1 f (v1 , v2 )dv1 dv2 f (vA , vB ) (7) Equation 7 follows from Equations 2 and 6, and from approximating P (v|H2 ) by considering only the two object interpretation (vA , vB ) with maximum posterior probability. For each scene in Figure 1a, the best two object interpretation will specify a field vA for the small upper block, and a field vB for the large lower block. To approximate the posterior odds ratio in Equation 7 we compute rough approximations of f (v1 )dv1 and f (v1 , v2 )dv1 dv2 by summing over a finite space of velocity fields. As described in the supporting material, we consider all fields that can be built from objects with 5 possible shapes, 900 possible starting locations, and 10 possible trajectories. For computational tractability, we convert each continuous velocity field to a discrete field defined over a space-time grid with 45 cells along each spatial dimension and 21 cells along the temporal dimension. Our results show that both models prefer the one-object hypothesis H1 when presented with scenes L1 and L2 (Figure 1b). Since there are many more two-object scenes than one-object scenes, any typical two-object interpretation is assigned lower prior probability than a typical one-object interpretation. This preference for simpler interpretations is a consequence of the Bayesian Occam’s razor. The baseline model makes the same kind of inference about scene U, and again prefers the one-object interpretation. Like infants, however, the smoothness model prefers the two-object interpretation of scene U. This model assigns low probability to a one-object interpretation where adjacent points on the object have very different velocities, and this preference for smooth motion is strong enough to overcome the simplicity preference that makes the difference when interpreting the other two scenes. Other experiments from the developmental literature have produced results consistent with the principle of spatial smoothness. For example, 3.5-month olds are surprised when a tall object is fully hidden behind a short screen, 4 month olds are surprised when a large object appears to pass through a small slot, and 4.5-month olds expect a swinging screen to be interrupted when an object is placed in its path [1, 2]. All three inferences appear to rely on the expectation that objects tend not to shrink or to compress like foam rubber. Many of these experiments are consistent with an account that simply rules out non-rigid motion instead of introducing a graded preference for spatial smoothness. Biological motions, however, are typically non-rigid, and experiments suggest that infants can track and make inferences about objects that follow non-rigid trajectories [14]. Findings like these call for a theory like ours that incorporates a preference for rigid motion, but recognizes that non-rigid motions are possible. 3 Empirical findings: temporal smoothness We now turn to the principle of temporal smoothness (S2) and discuss some of the experimental evidence that bears on this principle. Some researchers suggest that a closely related principle (inertia) is slow to develop, but we argue that expectations about temporal smoothness are needed to capture inferences made before the age of 4 months. Baillargeon and DeVos [15] describe one relevant experiment that explores inferences about moving objects and obstacles. During habituation, 3.5-month-old infants saw a car pass behind an occluder and emerge from the other side (habituation stimulus H in Figure 2a). An obstacle was then placed in the direct path of the car (unlikely scenes U1 and U2) or beside this direct path (likely scene L), and the infants again saw the car pass behind the occluder and emerge from the other side. Looking L U1 U2 p(L) p(U 1) ” log “ p(L) p(U 2) ” 400 H “ 600 a) log log “ pH (L) pH (U 1) ” log “ pH (L) pH (U 2) ” b) 200 X X X baseline 0 smoothness Figure 2: (a) Stimuli inspired by the experiments of [15]. The habituation stimulus H shows a block passing behind a barrier and emerging on the other side. After habituation, a new block is added either out of the direct path of the first block (L) or directly in the path of the first block (U1 and U2). In U1, the first block leaps over the second block, and in U2 the second block hops so that the first block can pass underneath. (b) Relative probabilities of scenes L, U1 and U2 according to two models. The baseline model finds all three scenes equally likely a priori, and considers L and U2 equally likely after habituation. The smoothness model considers L more likely than the other scenes both before and after habituation. a) H1 H2 L U b) ” log p(L) p(U ) 300 log “ pH1 (L) pH1 (U ) ” 200 c) “ log “ pH2 (L) pH2 (U ) ” 100 0 −100 X X baseline smoothness Figure 3: (a) Stimuli inspired by the experiments of Spelke et al. [16]. (b) Model predictions. After habituation to H1, the smoothness model assigns roughly equal probabilities to L and U. After habituation to H2, the model considers L more likely. (c) A stronger test of the inertia principle. Now the best interpretation of stimulus U involves multiple changes of direction. time measurements suggested that the infants were more surprised to see the car emerge when the obstacle lay within the direct path of the car. This result is consistent with the principle of temporal smoothness, which suggests that infants expected the car to maintain a straight-line trajectory, and the obstacle to remain stationary. We compared the smoothness model and the baseline model on a schematic version of this task. To model this experiment, we again assume that the infant’s perceptual system has recovered a veridical velocity field, but now we must allow for occlusion. An ideal observer approach that treats a two dimensional scene as a projection of a three dimensional world can represent the occluder as an object in its own right. Here, however, we continue to work with a two dimensional world, and treat the occluded parts of the scene as missing data. An ideal observer approach should integrate over all possible values of the missing data, but for computational simplicity we approximate this approach by considering only one or two high-probability interpretations of each occluded scene. We also need to account for habituation, and for cases where the habituation stimulus includes occlusion. We assume that an ideal observer computes a habituation field vH , or the velocity field with maximum posterior probability given the habituation stimulus. In Figure 2a, the inferred habituation field vH specifies a trajectory where the block moves smoothly from the left to the right of the scene. We now assume that the observer expects subsequent velocity fields to be similar to vH . Formally, we use a product-of-experts approach to define a post-habituation distribution on velocity fields: pH (v) ∝ p(v)p(v|vH ) (8) The first expert p(v) uses the prior distribution in Equation 6, and the second expert p(v|vH ) assumes that field v is drawn from a Gaussian distribution centered on vH . Intuitively, after habituation to vH the second expert expects that subsequent velocity fields will be similar to vH . More information about this model of habituation is provided in the supporting material. Given these assumptions, the black and dark gray bars in Figure 2 indicate relative a priori probabilities for scenes L, U1 and U2. The baseline model considers all three scenes equally probable, but the smoothness model prefers L. After habituation, the baseline model is still unable to account for the behavioral data, since it considers scenes L and U2 to be equally probable. The smoothness model, however, continues to prefer L. We previously mentioned three consequences of the principle of temporal smoothness: stationary objects tend to remain stationary, moving objects tend to keep moving, and moving objects tend to maintain a steady trajectory. The “car and obstacle” task addresses the first and third of these proposals, but other tasks provide support for the second. Many authors have studied settings where one moving object comes to a stop, and a second object starts to move [17]. Compared to the case where the first object collides with the second, infants appear to be surprised by the “no-contact” case where the two objects never touch. This finding is consistent with the temporal smoothness principle, which predicts that infants expect the first object to continue moving until forced to stop, and expect the second object to remain stationary until forced to start. Other experiments [18] provide support for the principle of temporal smoothness, but there are also studies that appear inconsistent with this principle. In one of these studies [16], infants are initially habituated to a block that moves from one corner of an enclosure to another (H1 in Figure 3a). After habituation, infants see a block that begins from a different corner, and now the occluder is removed to reveal the block in a location consistent with a straight-line trajectory (L) or in a location that matches the final resting place during the habituation phase (U). Looking times suggest that infants aged 4-12 months are no more surprised by the inertia-violating outcome (U) than the inertia-consistent outcome (L). The smoothness model, however, can account for this finding. The outcome in U is contrary to temporal smoothness but consistent with habituation, and the tradeoff between these factors leads the model to assign roughly the same probability to scenes L and U (Figure 3b). Only one of the inertia experiments described by Spelke et al. [16] and Spelke et al. [1] avoids this tradeoff between habituation and smoothness. This experiment considers a case where the habituation stimulus (H2 in Figure 3a) is equally similar to the two test stimuli. The results suggest that 8 month olds are now surprised by the inertia-violating outcome, and the predictions of our model are consistent with this finding (Figure 3b). 4 and 6 month olds, however, continue to look equally at the two outcomes. Note, however, that the trajectories in Figure 3 include at most one inflection point. Experiments that consider trajectories with many inflection points can provide a more powerful way of exploring whether 4 month olds have expectations about temporal smoothness. One possible experiment is sketched in Figure 3c. The task is very similar to the task in Figure 3a, except that a barrier is added after habituation. In order for the block to end up in the same location as before, it must now follow a tortuous path around the barrier (U). Based on the principle of temporal smoothness, we predict that 4-month-olds will be more surprised to see the outcome in stimulus U than the outcome in stimulus L. This experimental design is appealing in part because previous work shows that infants are surprised by a case similar to U where the barrier extends all the way from one wall to the other [16], and our proposed experiment is a minor variant of this task. Although there is room for debate about the status of temporal smoothness, we presented two reasons to revisit the conclusion that this principle develops relatively late. First, some version of this principle seems necessary to account for experiments like the car and obstacle experiment in Figure 2. Second, most of the inertia experiments that produced null results use a habituation stimulus which may have prevented infants from revealing their default expectations, and the one experiment that escapes this objection considers a relatively minor violation of temporal smoothness. Additional experiments are needed to explore this principle, but we predict that the inertia principle will turn out to be yet another example of knowledge that is available earlier than researchers once thought. 4 Discussion and Conclusion We argued that characterizations of infant knowledge should include room for probabilistic expectations, and that probabilistic expectations about spatial and temporal smoothness appear to play a role in infant object perception. To support these claims we described an ideal observer model that includes both categorical (C1 through C5) and probabilistic principles (S1 and S2), and demonstrated that the categorical principles alone are insufficient to account for several experimental findings. Our two probabilistic principles are related to principles (rigidity and inertia) that have previously been described as categorical principles. Although rigidity and inertia appear to play a role in some early inferences, formulating these principles as probabilistic expectations helps to explain how infants deal with non-rigid motion and violations of inertia. Our analysis focused on some of the many existing experiments in the developmental literature, but new experiments will be needed to explore our probabilistic approach in depth. Categorical versions of a given principle (e.g. rigidity) allow room for only two kinds of behavior depending on whether the principle is violated or not. Probabilistic principles can be violated to a greater or lesser extent, and our approach predicts that violations of different magnitude may lead to different behaviors. Future studies of rigidity and inertia can consider violations of these principles that range from mild (Figure 3a) to severe (Figure 3c), and can explore whether infants respond to these violations differently. Future work should also consider whether the categorical principles we described (C1 through C5) are better characterized as probabilistic expectations. In particular, future studies can explore whether young infants consider large violations of cohesion (C5) or spatial continuity (C2) more surprising than smaller violations of these principles. Although we did not focus on learning, our approach allows us to begin thinking formally about how principles of object perception might be acquired. First, we can explore how parameters like the smoothness parameters in our model (λspace and λtime ) might be tuned by experience. Second, we can use statistical model selection to explore transitions between different sets of principles. For instance, if a learner begins with the baseline model we considered (principles C1–C6), we can explore which subsequent observations provide the strongest statistical evidence for smoothness principles S1 and S2, and how much of this evidence is required before an ideal learner would prefer our smoothness model over the baseline model. It is not yet clear which principles of object perception could be learned, but the ideal observer approach can help to resolve this question. References [1] E. S. Spelke, K. Breinlinger, J. Macomber, and K. Jacobson. Origins of knowledge. Psychological Review, 99:605–632, 1992. [2] R. Baillargeon, L. Kotovsky, and A. Needham. The acquisition of physical knowledge in infancy. In D. Sperber, D. Premack, and A. J. Premack, editors, Causal Cognition: A multidisciplinary debate, pages 79–116. Clarendon Press, Oxford, 1995. [3] E. S. Spelke. Principles of object perception. Cognitive Science, 14:29–56, 1990. [4] E. Spelke. Initial knowledge: six suggestions. Cognition, 50:431–445, 1994. [5] D. Mareschal and S. P. Johnson. Learning to perceive object unity: a connectionist account. Developmental Science, 5:151–172, 2002. [6] D. Kersten and A. Yuille. Bayesian models of object perception. Current opinion in Neurobiology, 13: 150–158, 2003. [7] Y. Weiss and E. H. Adelson. Slow and smooth: a Bayesian theory for the combination of local motion signals in human vision. Technical Report A.I Memo No. 1624, MIT, 1998. [8] A. L. Yuille and N. M. Grzywacz. A mathematical analysis of the motion coherence theory. International Journal of Computer Vision, 3:155–175, 1989. [9] W. S. Geisler. Physical limits of acuity and hyperacuity. Journal of the Optical Society of America, 1(7): 775–782, 1984. [10] M. S. Banks and E. Shannon. Spatial and chromatic visual efficiency in human neonates. In Visual perception and cognition in infancy, pages 1–46. Lawrence Erlbaum Associates, Hillsdale, NJ, 1993. [11] R. Baillargeon. Innate ideas revisited: for a principle of persistence in infants’ physical reasoning. Perspectives on Psychological Science, 3(3):2–13, 2008. [12] R. Kestenbaum, N. Termine, and E. S. Spelke. Perception of objects and object boundaries by threemonth-old infants. British Journal of Developmental Psychology, 5:367–383, 1987. [13] E. S. Spelke, C. von Hofsten, and R. Kestenbaum. Object perception and object-directed reaching in infancy: interaction of spatial and kinetic information for object boundaries. Developmental Psychology, 25:185–196, 1989. [14] G. Huntley-Fenner, S. Carey, and A. Solimando. Objects are individuals but stuff doesn’t count: perceived rigidity and cohesiveness influence infants’ representations of small groups of discrete entities. Cognition, 85:203–221, 2002. [15] R. Baillargeon and J. DeVos. Object permanence in young infants: further evidence. Child Development, 61(6):1227–1246, 1991. [16] E. S. Spelke, G. Katz, S. E. Purcell, S. M. Ehrlich, and K. Breinlinger. Early knowledge of object motion: continuity and inertia. Cognition, 51:131–176, 1994. [17] L. Kotovsky and R. Baillargeon. Reasoning about collisions involving inert objects in 7.5-month-old infants. Developmental Science, 3(3):344–359, 2000. [18] T. Wilcox and A. Schweinle. Infants’ use of speed information to individuate objects in occlusion events. Infant Behavior and Development, 26:253–282, 2003.
same-paper 2 0.84632236 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher
Abstract: This paper addresses the problem of sparsity pattern detection for unknown ksparse n-dimensional signals observed through m noisy, random linear measurements. Sparsity pattern recovery arises in a number of settings including statistical model selection, pattern detection, and image acquisition. The main results in this paper are necessary and sufficient conditions for asymptotically-reliable sparsity pattern recovery in terms of the dimensions m, n and k as well as the signal-tonoise ratio (SNR) and the minimum-to-average ratio (MAR) of the nonzero entries of the signal. We show that m > 2k log(n − k)/(SNR · MAR) is necessary for any algorithm to succeed, regardless of complexity; this matches a previous sufficient condition for maximum likelihood estimation within a constant factor under certain scalings of k, SNR and MAR with n. We also show a sufficient condition for a computationally-trivial thresholding algorithm that is larger than the previous expression by only a factor of 4(1 + SNR) and larger than the requirement for lasso by only a factor of 4/MAR. This provides insight on the precise value and limitations of convex programming-based algorithms.
3 0.7670269 171 nips-2008-Online Prediction on Large Diameter Graphs
Author: Mark Herbster, Guy Lever, Massimiliano Pontil
Abstract: We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the “best of both worlds”: it performs well locally in the presence of cluster structure, and globally on large diameter graphs. 1
4 0.76416159 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
Author: Dotan D. Castro, Dmitry Volkinshtein, Ron Meir
Abstract: Actor-critic algorithms for reinforcement learning are achieving renewed popularity due to their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through cortical and basal ganglia loops. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming widely separated time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms. 1
5 0.63688523 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering
Author: Shai Ben-David, Margareta Ackerman
Abstract: Aiming towards the development of a general clustering theory, we discuss abstract axiomatization for clustering. In this respect, we follow up on the work of Kleinberg, ([1]) that showed an impossibility result for such axiomatization. We argue that an impossibility result is not an inherent feature of clustering, but rather, to a large extent, it is an artifact of the specific formalism used in [1]. As opposed to previous work focusing on clustering functions, we propose to address clustering quality measures as the object to be axiomatized. We show that principles like those formulated in Kleinberg’s axioms can be readily expressed in the latter framework without leading to inconsistency. A clustering-quality measure (CQM) is a function that, given a data set and its partition into clusters, returns a non-negative real number representing how strong or conclusive the clustering is. We analyze what clustering-quality measures should look like and introduce a set of requirements (axioms) for such measures. Our axioms capture the principles expressed by Kleinberg’s axioms while retaining consistency. We propose several natural clustering quality measures, all satisfying the proposed axioms. In addition, we analyze the computational complexity of evaluating the quality of a given clustering and show that, for the proposed CQMs, it can be computed in polynomial time. 1
6 0.53825635 179 nips-2008-Phase transitions for high-dimensional joint support recovery
7 0.51892316 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
8 0.50899595 84 nips-2008-Fast Prediction on a Tree
9 0.50073183 202 nips-2008-Robust Regression and Lasso
10 0.49970788 99 nips-2008-High-dimensional support union recovery in multivariate regression
11 0.48920035 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
12 0.48917696 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
13 0.48700157 75 nips-2008-Estimating vector fields using sparse basis field expansions
14 0.48574746 136 nips-2008-Model selection and velocity estimation using novel priors for motion patterns
15 0.48077101 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
16 0.47854072 10 nips-2008-A rational model of preference learning and choice prediction by children
17 0.4778856 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
18 0.47781551 66 nips-2008-Dynamic visual attention: searching for coding length increments
19 0.47685111 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
20 0.47457361 245 nips-2008-Unlabeled data: Now it helps, now it doesn't