nips nips2007 nips2007-8 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. [sent-4, score-0.484]
2 Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e. [sent-6, score-0.213]
3 This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. [sent-9, score-0.263]
4 The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). [sent-11, score-0.391]
5 We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. [sent-13, score-0.284]
6 Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. [sent-14, score-0.492]
7 In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. [sent-15, score-0.321]
8 When large numbers of features are present relative to the signal dimension, the estimation problem is fundamentally ill-posed. [sent-17, score-0.113]
9 Automatic relevance determination (ARD) addresses this problem by regularizing the solution space using a parameterized, data-dependent prior distribution that effectively prunes away redundant or superfluous features [10]. [sent-18, score-0.307]
10 Here we will describe a special case of ARD called sparse Bayesian learning (SBL) that has been very successful in a variety of applications [15]. [sent-19, score-0.229]
11 n×m m The basic ARD prior incorporated by SBL is p(x; γ) = N (x; 0, diag[γ]), where γ ∈ R m is a vector + of m non-negative hyperperparameters governing the prior variance of each unknown coefficient. [sent-21, score-0.208]
12 Mathematically, this is equivalent to minimizing L(γ) ∗ − log p(y|x)p(x; γ)dx = − log p(y; γ) ≡ log |Σy | + y T Σ−1 y, y This research was supported by NIH grants R01DC04855 and R01DC006435. [sent-23, score-0.15]
13 (6) −1 1 − γi Σii However, neither the EM nor MacKay updates are guaranteed to converge to a local minimum or even a saddle point of L(γ); both have fixed points whenever a γi = 0, whether at a minimizing solution or not. [sent-40, score-0.4]
14 Finally, a third algorithm has recently been proposed that optimally updates a single hyperparameter γi at a time, which can be done very efficiently in closed form [16]. [sent-41, score-0.126]
15 While extremely fast to implement, as a greedy-like method it can sometimes be more prone to becoming trapped in local minima when the number of features is large, e. [sent-42, score-0.359]
16 Additionally, none of these methods are easily extended to more general problems such as non-negative sparse coding, covariance component estimation, and classification without introducing additional approximations. [sent-45, score-0.183]
17 A second issue pertaining to the ARD model involves its connection with more traditional maximum a posteriori (MAP) estimation methods for extracting sparse, relevant features using fixed, sparsity promoting prior distributions (i. [sent-46, score-0.324]
18 Presently, it is unclear how ARD, which invokes a parameterized prior and transfers the estimation problem to hyperparameter space, relates to MAP approaches which operate directly in x space. [sent-49, score-0.174]
19 This paper introduces an alternative formulation of the ARD cost function using auxiliary functions that naturally addresses the above issues. [sent-51, score-0.263]
20 The result is an efficient algorithm that can be implemented using standard convex programming methods and is guaranteed to converge to a local minimum (or saddle point) of L(γ). [sent-53, score-0.342]
21 We then show that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. [sent-55, score-0.284]
22 Additionally, these results suggest modifications of ARD for selecting relevant features and promoting sparse solutions in a variety of general situations. [sent-56, score-0.37]
23 In particular, the methodology readily extends to handle problems involving non-negative sparse coding, covariance component estimation, and classification as discussed in Section 4. [sent-57, score-0.321]
24 2 ARD/SBL Optimization via Iterative Re-Weighted Minimum 1 In this section we re-express L(γ) using auxiliary functions which leads to an alternative update procedure that circumvents the limitations of current approaches. [sent-58, score-0.182]
25 In fact, a wide variety of alternative update rules can be derived by decoupling L(γ) using upper bounding functions that are more conveniently optimized. [sent-59, score-0.151]
26 The utility of this selection being that many powerful convex programming toolboxes have already been developed for solving these types of problems, especially when structured dictionaries Φ are being used. [sent-61, score-0.115]
27 This leads to the following upperbounding auxiliary cost function (9) L(γ, z) z T γ − g ∗ (z) + y T Σ−1 y ≥ L(γ). [sent-66, score-0.145]
28 The objective function in (11) can be minimized by solving the weighted convex 1 regularized cost function x∗ = arg min y − Φx x and then setting γi → zi −1/2 2 2 1/2 + 2λ zi |xi | (12) i |x∗,i | for all i (note that each zi will always be positive). [sent-85, score-0.422]
29 (13) y x λ γi i This leads to an upper-bounding auxiliary function for Lz (γ) given by x2 1 2 y − Φx 2 ≥ Lz (γ), Lz (γ, x) zi γ i + i + γi λ i (14) which is jointly convex in x and γ (see Example 3. [sent-87, score-0.182]
30 When solved for x, the global minimum of (14) yields the global minimum of (11) via the stated transformation. [sent-91, score-0.354]
31 We allow A(·) to be a point-to-set mapping to handle the case where the global minimum of (11) is not unique, which could occur, for example, if two columns of Φ are identical. [sent-100, score-0.219]
32 From any initialization point γ(0) ∈ Rm the sequence of hyperparameter estimates + {γ(k) } generated via γ(k+1) ∈ A(γ(k+1) ) is guaranteed to converge monotonically to a local minimum (or saddle point) of (2). [sent-102, score-0.341]
33 At any non-minimizing γ the auxiliary cost function Lz (γ) obtained from Step 3 will be strictly tangent to L(γ) at γ . [sent-108, score-0.186]
34 It will therefore necessarily have a minimum elsewhere since the slope at γ is nonzero by definition. [sent-109, score-0.186]
35 Moreover, because the log | · | function is strictly concave, at this minimum the actual cost function will be reduced still further. [sent-110, score-0.254]
36 The algorithm could of course theoretically converge to a saddle point, but this is rare and any minimal perturbation leads to escape. [sent-114, score-0.12]
37 Both EM and MacKay updates provably fail to satisfy one or more of the above criteria and so global convergence cannot be guaranteed. [sent-115, score-0.196]
38 In contrast, the MacKay updates do not even guarantee cost function decrease. [sent-118, score-0.119]
39 Consequently, both methods can become trapped at a solution such as γ = 0; a fixed point of the updates but not a stationary point or local minimum of L(γ). [sent-119, score-0.313]
40 Finally, the fast Tipping updates could potentially satisfy the conditions for global convergence, although this matter is not discussed in [16]. [sent-122, score-0.133]
41 Then the ARD coefficients m 1 from (3) solve the MAP problem xARD = arg min y − Φx 2 + λh∗ (x2 ), (15) 2 x where h (x ) is the concave conjugate of h(γ −1 ) function of x. [sent-136, score-0.154]
42 Omitting some details for the sake of brevity, using (13) we can create a strict upper bounding auxiliary function on L(γ): 1 x2 i L(γ, x) = y − Φx 2 + + log |Σy |. [sent-138, score-0.135]
43 The regularization term in (15), and hence the implicit prior distribution on x given 1 by p(x) ∝ exp[− 2 h∗ (x2 )], is not generally factorable, meaning p(x) = i pi (xi ). [sent-143, score-0.248]
44 ), this prior is explicitly dependent on both the dictionary Φ and the regularization term λ. [sent-147, score-0.146]
45 The only exception occurs when ΦT Φ = I; here h∗ (x2 ) factors and can be expressed in closed form independently of Φ, although λ dependency remains. [sent-149, score-0.127]
46 1 Properties of the implicit ARD prior To begin at the most superficial level, the Φ dependency of the ARD prior leads to scale invariant solutions, meaning the value of xARD is not affected if we rescale Φ, i. [sent-151, score-0.37]
47 Rather, any rescaling D only affects the implicit initialization of the algorithm, not the shape of the cost function. [sent-154, score-0.162]
48 More significantly, the ARD prior is particularly well-designed for finding sparse solutions. [sent-155, score-0.287]
49 We should note that concave, non-decreasing regularization functions are well-known to encourage sparse representations. [sent-156, score-0.225]
50 However, when selecting highly sparse subsets of features, the factorial 0 quasi-norm is often invoked as the ideal regularization term given unlimited computational resources. [sent-158, score-0.4]
51 By applying a exp[−1/2(·)] transformation, we obtain the implicit (improper) prior distribution. [sent-160, score-0.206]
52 The difficulty here is that (17) is nearly impossible to solve in general; it is NP-hard owing to a combinatorial number of local minima and so the traditional idea is to replace · 0 with a tractable approximation. [sent-162, score-0.286]
53 For this purpose, the 1 norm is the optimal or tightest convex relaxation of the 0 quasi-norm, and therefore it is commonly used leading to the Lasso algorithm [14]. [sent-163, score-0.112]
54 3 we demonstrate that the non-factorable, λ-dependent h∗ (x2 ) provides a tighter, albeit non-convex, approximation that promotes greater sparsity than x 1 while conveniently producing many fewer local minima than using x 0 directly. [sent-167, score-0.384]
55 We also show that, in certain settings, no λ-independent, factorial regularization term can achieve similar results. [sent-168, score-0.179]
56 , p x p i |xi | , p < 1 [2], or the Gaussian entropy measure i log |xi | based on the Jeffreys prior [4] provably fail in this regard. [sent-171, score-0.186]
57 2 Benefits of λ dependency To explore the properties of h∗ (x2 ) regarding λ dependency alone, we adopt the simplifying assumption ΦT Φ = I. [sent-173, score-0.152]
58 ) In this special case, h∗ (x2 ) is factorable and can be expressed in closed form via h∗ (x2 ) = 2|xi | h∗ (x2 ) ∝ i i i |xi | + x2 + 4λ i + log 2λ + x2 + |xi | i x2 + 4λ , i (18) which is independent of Φ. [sent-175, score-0.19]
59 In the limit as λ → 0, h∗ (x2 ) converges to a scaled version of the 0 quasi-norm, yet no local minimum exist because the likelihood in this case only permits a single feasible solution with x = ΦT y. [sent-183, score-0.318]
60 In contrast, when λ is large, the likelihood is less constrained and a looser prior is required to avoid local minima troubles, which will arise whenever the now relatively diffuse likelihood intersects the sharp spines of a highly sparse prior. [sent-184, score-0.717]
61 The implicit ARD prior naturally handles this transition becoming sparser as λ decreases and vice versa. [sent-186, score-0.271]
62 When ΦT Φ = I, (15) has no local minima whereas (17) has 2M local minima. [sent-188, score-0.342]
63 Use of the 1 norm in place of h∗ (x2 ) also yields no local minima; however, it is a much looser approximation of 0 and penalizes coefficients linearly unlike h∗ (x2 ). [sent-189, score-0.162]
64 As a final point of comparison, the actual weight estimate obtained from solving (15) when Φ T Φ = I is equivalent to the non-negative garrote estimator that has been advocated for wavelet shrinkage [5, 18]. [sent-191, score-0.177]
65 6 PSfrag replacements xi − log p(xi ) I[xi = 0] 1. [sent-194, score-0.19]
66 5 2 0 −8 maximally sparse solution −6 −4 −2 0 2 4 6 8 α xi Figure 1: Left: 1D example of the implicit ARD prior. [sent-215, score-0.48]
67 Right: Plot of the ARD prior across the feasible region as parameterized by α. [sent-217, score-0.164]
68 A factorial prior given by − log p(x) ∝ i |xi |0. [sent-218, score-0.291]
69 Both approximations to the 0 norm retain the correct global minimum, but only ARD smooths out local minima. [sent-220, score-0.239]
70 3 Benefits of a non-factorial prior In contrast, the benefits the typically non-factorial nature of h∗ (x2 ) are most pronounced when m > n, meaning there are more features than the signal dimension y. [sent-222, score-0.179]
71 In this limiting situation, the canonical sparse MAP estimation problem (17) reduces to finding x0 arg min x 0 s. [sent-224, score-0.292]
72 (19) x By simple extension of results in [18], the global minimum of (15) in the limit as λ → 0 will equal x0 , assuming the latter is unique. [sent-227, score-0.177]
73 The real distinction then is regarding the number of local minimum. [sent-228, score-0.118]
74 In this capacity the ARD MAP problem is superior to any possible factorial variant: Theorem 3. [sent-229, score-0.137]
75 In the limit as λ → 0 and assuming m > n, no factorial prior p(x) = 2 i exp[−1/2fi (xi )] exists such that the corresponding MAP problem min x y − Φx 2 + λ i fi (xi ) is: (i) Always globally minimized by a maximally sparse solution x0 and, (ii) Has fewer local minima than when solving (15). [sent-230, score-0.959]
76 First, for any factorial prior and associated regularization term i fi (xi ), the only way to satisfy (i) is if ∂fi (xi )/∂xi → ∞ as xi → 0. [sent-232, score-0.43]
77 It is then straightforward to show that any fi (xi ) with this property will necessarily have between m−1 + 1, m local minimum. [sent-234, score-0.126]
78 n n Using results from [18], this is provably an upper bound on the number of local minimum to (15). [sent-235, score-0.221]
79 Moreover, with the exception of very contrived situations, the number of ARD local minima will be considerably less. [sent-236, score-0.256]
80 In general, this result speaks directly to the potential limitations of restricting oneself to factorial priors when maximal feature pruning is paramount. [sent-237, score-0.218]
81 While generally difficult to visualize, in restricted situations it is possible to explicitly illustrate the type of smoothing over local minima that is possible using non-factorial priors. [sent-238, score-0.256]
82 Consequently, any feasible solution to y = Φx can be expressed as x = x + αv, where v ∈ Null(Φ), α is any real-valued scalar, and x is any fixed, feasible solution (e. [sent-240, score-0.216]
83 We can now plot any prior distribution p(x), or equivalently − log p(x), over the 1D feasible region of x space as a function of α to view the local minima profile. [sent-243, score-0.47]
84 Figure 1 (right) displays the plots of two example priors in the feasible region of y = Φx: (i) the non-factorial implicit ARD prior, and (ii) the prior p(x) ∝ exp(− 1 i |xi |p ), p = 0. [sent-246, score-0.312]
85 The 2 later is a factorial prior which converges to the ideal sparsity penalty when p → 0. [sent-248, score-0.29]
86 From the figure, we observe that, while both priors peak at the x0 , the ARD prior has substantially smoothed away local minima. [sent-249, score-0.236]
87 But much of the analysis can be extended to handle a variety of alternative data likelihoods and priors. [sent-252, score-0.119]
88 The assumed likelihood model and prior are p(Y |X) ∝ exp − 1 Y − ΦX 2λ 2 F , dγ 1 p(X) ∝ exp − trace X T Σ−1 X x 2 , Σx γi Ci . [sent-255, score-0.141]
89 i=1 (20) Here each of the dγ matrices Ci ’s are known covariance components of which the irrelevant ones are pruned by minimizing the analogous type-II likelihood function L(γ) = log |λI + ΦΣx ΦT | + trace 1 XX T λI + ΦΣx ΦT t −1 . [sent-256, score-0.118]
90 5 Discussion While ARD-based approaches have enjoyed remarkable success in a number of disparate fields, they remain hampered to some degree by implementational limitations and a lack of clarity regarding the nature of the cost function and existing update rules. [sent-265, score-0.194]
91 This paper addresses these issues by presenting a principled alternative algorithm based on auxiliary functions and a dual representation of the ARD objective. [sent-266, score-0.169]
92 This produces a highly efficient, global competition among features that is potentially superior to the sequential (greedy) updates of [16] in terms of local minima avoidance in certain cases when Φ is highly overcomplete (i. [sent-270, score-0.428]
93 The analysis used in deriving this algorithm reveals that ARD is exactly equivalent to performing MAP estimation in x space using a principled, sparsity-inducing prior that is non-factorable and dependent on both the feature set and noise parameter. [sent-277, score-0.142]
94 We have shown that these qualities allow it to promote maximally sparse solutions at the global minimum while relenting drastically fewer local minima than competing priors. [sent-278, score-0.672]
95 This might possibly explain the superior performance of ARD/SBL over Lasso in a variety of disparate disciplines where sparsity is crucial [11, 12, 18]. [sent-279, score-0.131]
96 These ideas raise a key question: If we do not limit ourselves to factorable, Φ- and λ-independent regularization terms/priors as is commonly done, then what is the optimal prior p(x) in the context of feature selection? [sent-280, score-0.146]
97 Asgharzadeh, “Wavelet footprints and sparse Bayesian learning for DNA copy number change analysis,” Int. [sent-355, score-0.183]
98 Lemos, “Selecting landmark points for sparse manifold learning,” Advances in Neural Information Processing Systems 18, pp. [sent-367, score-0.183]
99 Faul, “Fast marginal likelihood maximisation for sparse Bayesian models,” Ninth Int. [sent-383, score-0.22]
100 Baraniuk, “Recovery of jointly sparse signals from a few random projections,” Advances in Neural Information Processing Systems 18, pp. [sent-394, score-0.183]
wordName wordTfidf (topN-words)
[('ard', 0.684), ('sparse', 0.183), ('minima', 0.17), ('lz', 0.147), ('factorial', 0.137), ('mackay', 0.136), ('lasso', 0.135), ('xard', 0.122), ('xi', 0.107), ('prior', 0.104), ('minimum', 0.103), ('implicit', 0.102), ('sbl', 0.098), ('local', 0.086), ('auxiliary', 0.085), ('concave', 0.083), ('saddle', 0.081), ('map', 0.075), ('global', 0.074), ('factorable', 0.073), ('promoting', 0.064), ('spines', 0.064), ('zi', 0.064), ('methodology', 0.062), ('cost', 0.06), ('dependency', 0.06), ('feasible', 0.06), ('updates', 0.059), ('shrinkage', 0.056), ('maximally', 0.056), ('addresses', 0.053), ('coef', 0.052), ('tipping', 0.051), ('log', 0.05), ('nonzero', 0.05), ('sparsity', 0.049), ('jeffreys', 0.049), ('toolboxes', 0.049), ('wipf', 0.049), ('em', 0.049), ('cients', 0.048), ('priors', 0.046), ('variety', 0.046), ('globally', 0.045), ('wavelet', 0.045), ('conveniently', 0.043), ('garrote', 0.043), ('hyperprior', 0.043), ('neuroimaging', 0.043), ('psfrag', 0.043), ('regularization', 0.042), ('handle', 0.042), ('strictly', 0.041), ('relevance', 0.04), ('fi', 0.04), ('norm', 0.04), ('min', 0.04), ('determination', 0.039), ('smooths', 0.039), ('tightest', 0.039), ('converge', 0.039), ('features', 0.039), ('selecting', 0.038), ('additionally', 0.038), ('estimation', 0.038), ('likelihood', 0.037), ('lemma', 0.037), ('disparate', 0.036), ('looser', 0.036), ('promotes', 0.036), ('signal', 0.036), ('limitations', 0.035), ('closed', 0.035), ('ts', 0.035), ('readily', 0.034), ('naturally', 0.034), ('bayesian', 0.033), ('convex', 0.033), ('minimized', 0.033), ('solving', 0.033), ('ii', 0.033), ('reformulation', 0.033), ('replacements', 0.033), ('slope', 0.033), ('trapped', 0.033), ('regarding', 0.032), ('desirable', 0.032), ('hyperparameter', 0.032), ('hyperparameters', 0.032), ('provably', 0.032), ('expressed', 0.032), ('solution', 0.032), ('arg', 0.031), ('convergence', 0.031), ('becoming', 0.031), ('pruned', 0.031), ('consequently', 0.031), ('alternative', 0.031), ('update', 0.031), ('traditional', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 8 nips-2007-A New View of Automatic Relevance Determination
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
2 0.18283783 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
3 0.18254393 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
4 0.12701955 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.
5 0.10150715 53 nips-2007-Compressed Regression
Author: Shuheng Zhou, Larry Wasserman, John D. Lafferty
Abstract: Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by a random linear transformation to m n examples in p dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for 1 -regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence.” In addition, we show that 1 -regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence.” Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the rate of information communicated between the compressed and uncompressed data that decay to zero. 1
6 0.096251667 43 nips-2007-Catching Change-points with Lasso
7 0.088655353 63 nips-2007-Convex Relaxations of Latent Variable Training
8 0.086140968 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
9 0.080821335 99 nips-2007-Hierarchical Penalization
10 0.075919583 61 nips-2007-Convex Clustering with Exemplar-Based Models
11 0.071430095 179 nips-2007-SpAM: Sparse Additive Models
12 0.069797359 182 nips-2007-Sparse deep belief net model for visual area V2
13 0.065605611 195 nips-2007-The Generalized FITC Approximation
14 0.064317286 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
15 0.063038252 156 nips-2007-Predictive Matrix-Variate t Models
16 0.062999569 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
17 0.062843978 135 nips-2007-Multi-task Gaussian Process Prediction
18 0.060167979 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
19 0.060011461 162 nips-2007-Random Sampling of States in Dynamic Programming
20 0.05776396 32 nips-2007-Bayesian Co-Training
topicId topicWeight
[(0, -0.229), (1, 0.043), (2, -0.056), (3, 0.018), (4, -0.04), (5, -0.032), (6, -0.087), (7, 0.039), (8, -0.047), (9, 0.019), (10, 0.088), (11, -0.027), (12, 0.109), (13, -0.009), (14, 0.115), (15, 0.015), (16, 0.01), (17, 0.141), (18, -0.052), (19, -0.211), (20, -0.185), (21, -0.017), (22, 0.035), (23, 0.044), (24, -0.013), (25, 0.148), (26, 0.014), (27, 0.04), (28, -0.166), (29, 0.049), (30, 0.098), (31, -0.101), (32, -0.02), (33, -0.032), (34, -0.04), (35, -0.153), (36, -0.041), (37, -0.033), (38, 0.013), (39, 0.104), (40, 0.028), (41, 0.065), (42, 0.081), (43, 0.036), (44, -0.022), (45, 0.122), (46, -0.057), (47, -0.085), (48, -0.02), (49, 0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.94437969 8 nips-2007-A New View of Automatic Relevance Determination
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
2 0.78086412 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
3 0.65653288 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
4 0.62171274 53 nips-2007-Compressed Regression
Author: Shuheng Zhou, Larry Wasserman, John D. Lafferty
Abstract: Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by a random linear transformation to m n examples in p dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for 1 -regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence.” In addition, we show that 1 -regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence.” Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the rate of information communicated between the compressed and uncompressed data that decay to zero. 1
5 0.57918924 179 nips-2007-SpAM: Sparse Additive Models
Author: Han Liu, Larry Wasserman, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data. 1
6 0.57381105 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
7 0.56470084 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
8 0.56138241 99 nips-2007-Hierarchical Penalization
9 0.50593716 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
10 0.48467764 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
11 0.47868651 43 nips-2007-Catching Change-points with Lasso
12 0.46887535 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
13 0.43794459 156 nips-2007-Predictive Matrix-Variate t Models
14 0.4345299 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
15 0.41661283 195 nips-2007-The Generalized FITC Approximation
16 0.39106461 61 nips-2007-Convex Clustering with Exemplar-Based Models
17 0.3733376 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
18 0.37003687 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
19 0.35263297 63 nips-2007-Convex Relaxations of Latent Variable Training
20 0.35192639 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
topicId topicWeight
[(5, 0.038), (13, 0.022), (16, 0.018), (19, 0.011), (21, 0.061), (31, 0.016), (34, 0.021), (35, 0.028), (47, 0.077), (49, 0.01), (83, 0.112), (85, 0.013), (90, 0.5)]
simIndex simValue paperId paperTitle
same-paper 1 0.94794524 8 nips-2007-A New View of Automatic Relevance Determination
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
2 0.94399846 85 nips-2007-Experience-Guided Search: A Theory of Attentional Control
Author: David Baldwin, Michael C. Mozer
Abstract: People perform a remarkable range of tasks that require search of the visual environment for a target item among distractors. The Guided Search model (Wolfe, 1994, 2007), or GS, is perhaps the best developed psychological account of human visual search. To prioritize search, GS assigns saliency to locations in the visual field. Saliency is a linear combination of activations from retinotopic maps representing primitive visual features. GS includes heuristics for setting the gain coefficient associated with each map. Variants of GS have formalized the notion of optimization as a principle of attentional control (e.g., Baldwin & Mozer, 2006; Cave, 1999; Navalpakkam & Itti, 2006; Rao et al., 2002), but every GS-like model must be ’dumbed down’ to match human data, e.g., by corrupting the saliency map with noise and by imposing arbitrary restrictions on gain modulation. We propose a principled probabilistic formulation of GS, called Experience-Guided Search (EGS), based on a generative model of the environment that makes three claims: (1) Feature detectors produce Poisson spike trains whose rates are conditioned on feature type and whether the feature belongs to a target or distractor; (2) the environment and/or task is nonstationary and can change over a sequence of trials; and (3) a prior specifies that features are more likely to be present for target than for distractors. Through experience, EGS infers latent environment variables that determine the gains for guiding search. Control is thus cast as probabilistic inference, not optimization. We show that EGS can replicate a range of human data from visual search, including data that GS does not address. 1
3 0.92019153 119 nips-2007-Learning with Tree-Averaged Densities and Distributions
Author: Sergey Kirshner
Abstract: We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. 1
4 0.90973949 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.
5 0.83736724 182 nips-2007-Sparse deep belief net model for visual area V2
Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng
Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1
6 0.60815114 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
7 0.60670197 202 nips-2007-The discriminant center-surround hypothesis for bottom-up saliency
8 0.58768022 156 nips-2007-Predictive Matrix-Variate t Models
9 0.5603829 128 nips-2007-Message Passing for Max-weight Independent Set
10 0.54828286 96 nips-2007-Heterogeneous Component Analysis
11 0.54428935 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
12 0.54372126 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
13 0.54282397 185 nips-2007-Stable Dual Dynamic Programming
14 0.54041362 113 nips-2007-Learning Visual Attributes
15 0.53829038 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
16 0.53828812 63 nips-2007-Convex Relaxations of Latent Variable Training
17 0.53243113 155 nips-2007-Predicting human gaze using low-level saliency combined with face detection
18 0.52764142 49 nips-2007-Colored Maximum Variance Unfolding
19 0.5268355 174 nips-2007-Selecting Observations against Adversarial Objectives
20 0.52572072 7 nips-2007-A Kernel Statistical Test of Independence