nips nips2012 nips2012-104 knowledge-graph by maker-knowledge-mining

104 nips-2012-Dual-Space Analysis of the Sparse Linear Model


Source: pdf

Author: Yi Wu, David P. Wipf

Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. [sent-3, score-0.327]

2 These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. [sent-4, score-0.103]

3 Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. [sent-5, score-0.126]

4 Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. [sent-8, score-0.147]

5 In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. [sent-10, score-0.298]

6 For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. [sent-11, score-0.295]

7 It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. [sent-12, score-0.168]

8 Recently, there has been a growing interest in models that employ sparse priors p(x) to encourage solutions x with mostly small or zero-valued coefficients and a few large or unrestricted values, i. [sent-16, score-0.261]

9 , we are assuming the generative x is a sparse vector. [sent-18, score-0.162]

10 Such solutions can be favored by using 1 1 p(x) ∝ exp − g(xi ) = (2) exp − h x2 , i 2 2 i i with h concave and non-decreasing on [0, ∞) [15, 16]. [sent-19, score-0.197]

11 Virtually all sparse priors of interest can be expressed in this manner, including the popular Laplacian, Jeffreys, Student’s t, and generalized 1 Gaussian distributions. [sent-20, score-0.265]

12 Roughly speaking, the ‘more concave’ h, the more sparse we expect x to √ be. [sent-21, score-0.162]

13 For example, with h(z) = z, we recover a Gaussian, which is not sparse at all, while h(z) = z gives a Laplacian distribution, with characteristic heavy tails and a sharp peak at zero. [sent-22, score-0.162]

14 All sparse priors of the form (2) can be conveniently framed in terms of a collection of non-negative latent variables or hyperparameters γ [γ1 , . [sent-23, score-0.313]

15 For the purpose of obtaining sparse point estimates of x, which will be our primary focus herein, models with latent variable sparse priors are frequently handled in one of two ways. [sent-28, score-0.389]

16 Examples include minimum ℓp -norm approaches [4, 11, 16], Jeffreys prior-based methods sometimes called FOCUSS [7, 6, 9], algorithms for computing the basis pursuit (BP) or Lasso solution [6, 16, 18], and iterative reweighted ℓ1 methods [3]. [sent-30, score-0.304]

17 Once γ(II) is obtained, the conditional distribution p(x|y; γ(II) ) is Gaussian, and a point estimate for x naturally emerges as the posterior mean −1 x(II) = E x|y; γ(II) = Γ(II) ΦT λI + ΦΓ(II) ΦT y. [sent-32, score-0.111]

18 (6) Pertinent examples include sparse Bayesian learning and the relevance vector machine (RVM) [19], automatic relevance determination (ARD) [14], methods for learning overcomplete dictionaries [8], and large-scale experimental design [17]. [sent-33, score-0.442]

19 While initially these two approaches may seem vastly different, both can be directly compared using a dual-space view [22] of the underlying cost functions. [sent-34, score-0.141]

20 The dual-space view is advantageous for several reasons, such as establishing connections between algorithms, developing efficient update rules, or handling more general (non-Gaussian) likelihood functions. [sent-36, score-0.164]

21 In Section 3, we utilize γ-space cost functions to develop a principled method for choosing the tradeoff parameter λ (which accompanies the Gaussian likelihood model and essentially balances sparsity and data fit) and demonstrate its effectiveness via simulations. [sent-37, score-0.301]

22 Section 4 then derives a new Type II-inspired algorithm in x-space that can compute maximally sparse (minimal ℓ0 norm) solutions even with highly coherent dictionaries, proving a result for clustered dictionaries that previously has only been shown empirically [21]. [sent-38, score-0.516]

23 Finally, Section 5 leverages duality to address Type II methods with generalized likelihood functions that previously were rendered untenable because of intractable integrals. [sent-39, score-0.193]

24 In general, some tasks and analyses are easier to undertake in γ-space (Section 3), while others are more transparent in x-space (Sections 4 and 5). [sent-40, score-0.118]

25 Here we consider both with the goal of advancing the proper understanding and full utilization of the sparse linear model. [sent-41, score-0.162]

26 The dual-space view defines a corresponding γ-space cost function for Type I and a x-space cost function for Type II to complete the symmetry. [sent-43, score-0.159]

27 For many applications of the sparse linear model, the primary ˆ goal is simply a point estimate that exhibits some degree of sparsity, meaning many elements of x near zero and a few relatively large coefficients. [sent-46, score-0.193]

28 This requires a penalty function g(x) that is concave and non-decreasing in x2 [x2 , . [sent-47, score-0.241]

29 In the context of Type I, any prior p(x) expressible via m 1 (2) will satisfy this condition by definition; such priors are said to be strongly super-Gaussian and will always have positive kurtosis [15]. [sent-51, score-0.102]

30 Regarding Type II, because the associated x-space penalty (9) is represented as a minimum of upper-bounding hyperplanes with respect to x2 (and the slopes are all non-negative given γ 0), it must therefore be concave and non-decreasing in x2 [1]. [sent-52, score-0.274]

31 For compression, interpretability, or other practical reasons, it is sometimes desirable to have exactly sparse point estimates, with many (or most) elements of x equal to exactly zero. [sent-53, score-0.162]

32 This then necessitates a penalty function g(x) that is concave and non-decreasing in |x| [|x1 |, . [sent-54, score-0.287]

33 In the case of Type I, if log γ + f (γ) is concave and non-decreasing in γ, then g(x) = i g(xi ) satisfies this condition. [sent-58, score-0.231]

34 The Type II analog, which emerges by further inspection of (9) stipulates that if log |Σy | + i f (γi ) = log λ−1 ΦT Φ + Γ−1 + log |Γ| + f (γi ) (10) i is a concave and non-decreasing function of γ, then g(II) (x) will be a concave, non-decreasing function of |x|. [sent-59, score-0.408]

35 For this purpose it is sufficient, but not necessary, that f be a concave and nondecreasing function. [sent-60, score-0.163]

36 Regardless, it is now very transparent how Type II may promote sparsity akin to Type I. [sent-62, score-0.157]

37 The dual-space view also leads to efficient, convergent algorithms such as iterative reweighted ℓ1 minimization and its variants as discussed in [22]. [sent-63, score-0.366]

38 1 Although a more recent, step-wise variant of the RVM has been shown to be substantially faster [20], the original version is still germane since it can easily be extended to handle more general structured sparsity problems. [sent-69, score-0.162]

39 Let γ(I) arg minγ 0 L(I) (γ), with m L(I) (γ) y T Σ−1 y + log |Γ| + y f (γi ). [sent-72, score-0.151]

40 This motivates a penalty on λ with similar correspondence, leading to the objective n m y T Σ−1 y + y L(I) (γ, λ) j=1 i=1 m = [log λ + f (λ)] [log γi + f (γi )] + y T Σ−1 y + y [log γi + f (γi )] + n log λ + nf (λ). [sent-88, score-0.196]

41 This is important because the optimal λ will be highly dependent on both the true noise level, and crucially, the particular sparse prior assumed p(x) (as reflected by f ). [sent-90, score-0.231]

42 It can then be shown that solving (4), with λ fixed to the value that minimizes (14), is equivalent to solving min x,u g(xi ) + ng i 1 √ u n 2 , s. [sent-92, score-0.116]

43 Note that if we 2 were just performing maximum likelihood estimation of λ given x∗ , the optimal value would reduce to simply λ∗ = 1/n u∗ 2 , with no influence from the prior on x. [sent-96, score-0.167]

44 2 Solving (15), or equivalently (14), can be accomplished using simple iterative reweighted least squares, or if g is concave in |xi |, an iterative reweighted second-order-cone (SOC) minimization. [sent-98, score-0.705]

45 In this context, ϕ(γi ) is interpreted a hyperprior on γi , and an equivalent distribution is assumed on the noise variance λ. [sent-101, score-0.119]

46 As in many applications of sparse reconstruction, here we are only concerned with accurately estimating x, whose nonzero entries may have physical significance (e. [sent-104, score-0.254]

47 2 Simulations are helpful for evaluation purposes since we then have access to the true sparse generating vector. [sent-109, score-0.201]

48 Figure 1 compares the estimation performance obtained by minimizing (15) with two different selections for g: g(x) = x p = i |xi |p , with p = 0. [sent-110, score-0.171]

49 Data generation proceeds p as follows: We create a random 100 × 50 dictionary Φ, with ℓ2 -normalized, iid Gaussian columns. [sent-113, score-0.164]

50 Results using other noise levels, problem dimensions n and m, sparsity levels x 0 , and sparsity penalties g are similar. [sent-122, score-0.246]

51 ˆ ˆ Figure 1 (Bottom) shows the average sparsity of estimates x, as quantified by the ℓ0 norm x 0 , across λ values ( x 0 returns a count of the number of nonzero elements in x). [sent-124, score-0.23]

52 The ‘+’ indicates ˆ the average sparsity of each x for the learned λ as before. [sent-125, score-0.107]

53 01) penalty produces a much sparser estimate, very near the true value of x 0 = 10 at the optimal λ. [sent-127, score-0.122]

54 The ℓ1 penalty, which is substantially less concave/sparsity-inducing, still sets some elements to exactly zero, but also substantially shrinks nonzero coefficients in achieving a similar overall reconstruction error. [sent-128, score-0.245]

55 This highlights the importance of learning a λ via a penalty that is properly matched to the prior on x: if we instead tried to force a particular sparsity value (in this case 10), then the ℓ1 solution would be very suboptimal. [sent-129, score-0.261]

56 Finally we note that maximum likelihood (ML) estimation of λ performs very poorly (not shown), except in the special case where the ML estimate is equivalent to solving (14) as occurs when f (γ) = 0 (see [6]). [sent-130, score-0.197]

57 The proposed method can be viewed as adding a principled hyperprior on λ, properly matched to p(x), that compensates for this shortcoming of standard ML. [sent-131, score-0.171]

58 Type II λ estimation has been explored elsewhere for the special case where f (γ) = 0 [19], which renders the factor of n in (16) irrelevant; however, for other selections we have found this factor to improve performance (not shown). [sent-132, score-0.171]

59 4 Maximally Sparse Estimation With the advent of compressive sensing and other related applications, there has been growing interest in finding maximally sparse signal representations from redundant dictionaries (m ≫ n) [3, 5]. [sent-135, score-0.591]

60 The canonical form of this problem involves solving x0 arg min x 0 , s. [sent-136, score-0.201]

61 01) penalty produces a much sparser estimate at the optimal λ value. [sent-156, score-0.153]

62 While (17) is NP-hard, whenever the dictionary Φ satisfies a restricted isometry property (RIP) [2] or a related structural assumption, meaning that each x0 0 columns of Φ are sufficiently close to orthonormal (i. [sent-157, score-0.206]

63 ) In this regime the optimization problem reduces to x(II) = lim arg min g(II) (x), λ→0 s. [sent-165, score-0.127]

64 (18) x If log |Σy | + i f (γi ) is concave, then (18) can be minimized using reweighted ℓ1 minimization. [sent-168, score-0.296]

65 With initial weight vector w(0) = 1, the (k + 1)-th iteration involves computing x(k+1) ← arg min x: y=Φx (k) i w(k+1) ← wi |xi |, ∂g(II) (x) ∂|xi | . [sent-169, score-0.165]

66 It remains unclear however in what circumstances this type of update can lead to guaranteed improvement nor if the functions ηi (x; 0, 1/2) are even the optimal choice. [sent-174, score-0.432]

67 We will now demonstrate that for certain selections of α and q, we can guarantee that reweighted ℓ1 using ηi (x; α, q) is guaranteed to recover x0 exactly if Φ is drawn from what we call a clustered dictionary model. [sent-175, score-0.607]

68 Clustered Dictionary Model: Let Φuncorr denote any dictionary such that ℓ1 mini(d,ǫ) mization succeeds in solving (17) for all x0 0 ≤ d. [sent-177, score-0.2]

69 Let Φcorr denote any dictionary obtained (d) by replacing each column of Φuncorr with a “cluster” of mi basis vectors such that the angle between any two vectors within a cluster is less than some ǫ > 0. [sent-178, score-0.209]

70 , m} as the set of cluster indices whereby x0 has at least one nonzero element. [sent-182, score-0.137]

71 Theorem 2 implies that even though ℓ1 may fail to find the maximally sparse x0 because of severe RIP violations (high correlations between groups of dictionary columns as dictated by ǫ lead directly to a poor RIP), a Type II-inspired method can still be successful. [sent-186, score-0.469]

72 Moreover, because whenever ℓ1 does succeed, Type II will always succeed as well (assuming a reweighted ℓ1 implementation), the converse (RIP violation leading to Type II failure but not ℓ1 failure) can never happen. [sent-187, score-0.312]

73 Recent work from [21] has argued that Type II may be useful for addressing the sparse recovery problem with correlated dictionaries, and empirical evidence is provided showing vastly superior performance on clustered dictionaries. [sent-188, score-0.287]

74 However, we stress that no results proving global convergence to the correct, maximally sparse solution have been shown before in the case of structured dictionaries (except in special cases with strong, unverifiable constraints on coefficient magnitudes [21]). [sent-189, score-0.481]

75 Moreover, the proposed weighting strategy ηi (x; λ, q) accomplishes this without any particular tuning to the clustered dictionary model under consideration and thus likely holds in many other cases as well. [sent-190, score-0.248]

76 5 Generalized Likelihood functions Type I methods naturally accommodate alternative likelihood functions. [sent-191, score-0.129]

77 Fortunately, the dual x-space view provides a natural mechanism for generalizing the basic Type II methodology to address alternative likelihood functions in a more principled manner. [sent-196, score-0.176]

78 In the case of classification problems, we might want to replace the Gaussian likelihood p(y|x) implied by (1) with a multivariate Bernoulli distribution p(y|x) ∝ log[−ψ(y, x)] where ψ(y, x) is the function ψ (y, x) j (yj log [σj (x)] + (1 − yj ) log [1 − σj (x)]) . [sent-197, score-0.275]

79 This function j· may be naturally substituted into the x-space Type II cost function (8) giving us the candidate penalized logistic regression function min ψ (y, x) + λg(II) (x). [sent-199, score-0.191]

80 Thus we may conclude that (22) provides a principled approximation to (5) when a Bernoulli likelihood function is used for classification purposes. [sent-210, score-0.135]

81 Additionally, for other sparse priors, or equivalently other selections for f , we can still perform optimization and analyze cost functions without any conjugacy requirements on the implicit p(x). [sent-213, score-0.352]

82 If log |Σy | + i f (γi ) is a concave, non-decreasing function of γ (as will be the case if f is concave and non-decreasing), then every local optimum of (24) is achieved at a solution with at most n nonzero elements in γ and therefore x(II) . [sent-215, score-0.323]

83 In contrast, if − log p(x) is convex, then (24) can be globally solved via a convex program. [sent-216, score-0.1]

84 Despite the practical success of the RVM and related Bayesian techniques, and empirical evidence of sparse solutions, there is currently no proof that the standard variants of these classification methods will always produce exactly sparse estimates. [sent-217, score-0.324]

85 Finally, if we take (22) as our starting point, we may naturally consider modifications tailored to specific sparse classification tasks (that may or may not retain an explicit connection with the original Type II probabilistic model). [sent-219, score-0.234]

86 For example, suppose we would like to obtain a maximally sparse classifier, where regularization is provided by a x 0 penalty. [sent-220, score-0.262]

87 Direct optimization is combinatorial because of what we call the global zero attraction property: Whenever any individual coefficient xi goes to zero, we are necessarily at a local minimum with respect to this coefficient because of the infinite slope (discontinuity) of the ℓ0 norm at zero. [sent-221, score-0.181]

88 Consider the Type II-inspired minimization problem ˆ ˆ x, γ = arg min ψ (y, x) + α1 x;γ 0 i x2 i + log α2 I + ΦΓΦT γi (25) which is equivalent to (22) with f (γ) = 0 when α1 = α2 = λ. [sent-224, score-0.249]

89 For some α1 and α2 sufficiently ˆ small (but not necessarily equal), the support3 of x will match the support of arg minx ψ (y, x) + λ x 0 . [sent-225, score-0.122]

90 Thus Type II affords the possibility of mimicking the ℓ0 norm in the presence of generalized likelihoods but with the advantageous potential for drastically fewer local minima. [sent-227, score-0.102]

91 Additionally, while here we have focused our attention on classification via logistic regression, these ideas can presumably be extended to other likelihood functions provided certain conditions are met. [sent-229, score-0.127]

92 To the best of our knowledge, while already demonstrably successful in an empirical setting, Type II classifiers and other related Bayesian generalized likelihood models have never been analyzed in the context of sparse estimation as we have done in this section. [sent-230, score-0.33]

93 6 Conclusion The dual-space view of sparse linear or generalized linear models naturally allows us to transition x-space ideas originally developed for Type I and apply them to Type II, and conversely, apply γspace techniques from Type II to Type I. [sent-231, score-0.317]

94 Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization,” Proc. [sent-266, score-0.298]

95 Godsill, “Blind separation of sparse sources using Jeffreys inverse prior and e the EM algorithm,” Proc. [sent-279, score-0.199]

96 Girolami, “A variational method for learning sparse and overcomplete representations,” Neural Computation, vol. [sent-285, score-0.194]

97 Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm,” IEEE Transactions on Signal Processing, vol. [sent-293, score-0.164]

98 Sejnowski, “Dictionary learning algorithms for sparse representation,” Neural Computation, vol. [sent-315, score-0.162]

99 Nickisch, “Large scale Bayesian inference and experimental design for sparse linear models,” SIAM J. [sent-363, score-0.162]

100 Faul, “Fast marginal likelihood maximisation for sparse Bayesian models,” Ninth Int. [sent-382, score-0.252]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('type', 0.389), ('ii', 0.374), ('rvm', 0.231), ('reweighted', 0.228), ('dictionary', 0.164), ('concave', 0.163), ('sparse', 0.162), ('dictionaries', 0.136), ('rip', 0.132), ('selections', 0.131), ('sparsity', 0.107), ('maximally', 0.1), ('wipf', 0.1), ('jeffreys', 0.1), ('coef', 0.099), ('nonzero', 0.092), ('likelihood', 0.09), ('hyperprior', 0.087), ('clustered', 0.084), ('mse', 0.083), ('arg', 0.083), ('rao', 0.081), ('penalty', 0.078), ('compressive', 0.076), ('cients', 0.075), ('analyses', 0.068), ('bayesian', 0.068), ('log', 0.068), ('attendant', 0.065), ('rendered', 0.065), ('uncorr', 0.065), ('priors', 0.065), ('sensing', 0.06), ('cost', 0.059), ('dxi', 0.058), ('focuss', 0.058), ('signal', 0.057), ('relevance', 0.056), ('substantially', 0.055), ('minimization', 0.054), ('engan', 0.053), ('transparent', 0.05), ('nf', 0.05), ('stress', 0.05), ('localization', 0.049), ('penalized', 0.049), ('yj', 0.049), ('failure', 0.048), ('hyperparameters', 0.048), ('palmer', 0.046), ('afforded', 0.046), ('necessitates', 0.046), ('cluster', 0.045), ('corr', 0.045), ('principled', 0.045), ('min', 0.044), ('sparser', 0.044), ('ej', 0.044), ('attraction', 0.044), ('herein', 0.044), ('reasons', 0.044), ('lead', 0.043), ('iterative', 0.043), ('reconstruction', 0.043), ('march', 0.043), ('tipping', 0.042), ('isometry', 0.042), ('emerges', 0.041), ('vastly', 0.041), ('view', 0.041), ('xi', 0.04), ('estimation', 0.04), ('properly', 0.039), ('naturally', 0.039), ('purposes', 0.039), ('minx', 0.039), ('classi', 0.038), ('generalized', 0.038), ('conveniently', 0.038), ('involves', 0.038), ('source', 0.038), ('ideas', 0.037), ('prior', 0.037), ('solving', 0.036), ('succeed', 0.036), ('blind', 0.035), ('facilitates', 0.035), ('trials', 0.034), ('solutions', 0.034), ('global', 0.033), ('minimum', 0.033), ('advantageous', 0.033), ('retain', 0.033), ('noise', 0.032), ('convex', 0.032), ('overcomplete', 0.032), ('cand', 0.032), ('marginalization', 0.032), ('estimate', 0.031), ('norm', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

Author: Yi Wu, David P. Wipf

Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1

2 0.14809704 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

Author: Xaq Pitkow

Abstract: This paper shows how sparse, high-dimensional probability distributions could be represented by neurons with exponential compression. The representation is a novel application of compressive sensing to sparse probability distributions rather than to the usual sparse signals. The compressive measurements correspond to expected values of nonlinear functions of the probabilistically distributed variables. When these expected values are estimated by sampling, the quality of the compressed representation is limited only by the quality of sampling. Since the compression preserves the geometric structure of the space of sparse probability distributions, probabilistic computation can be performed in the compressed domain. Interestingly, functions satisfying the requirements of compressive sensing can be implemented as simple perceptrons. If we use perceptrons as a simple model of feedforward computation by neurons, these results show that the mean activity of a relatively small number of neurons can accurately represent a highdimensional joint distribution implicitly, even without accounting for any noise correlations. This comprises a novel hypothesis for how neurons could encode probabilities in the brain. 1

3 0.13133298 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville

Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. 1

4 0.12967102 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

Author: Aaron Defazio, Tibério S. Caetano

Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov´ sz extension to obtain a convex relaxation. For tractable classes a such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.

5 0.11229987 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

Author: Piyush Rai, Abhishek Kumar, Hal Daume

Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1

6 0.10983168 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

7 0.1082448 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

8 0.10538452 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

9 0.099020451 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

10 0.096281417 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates

11 0.095643692 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

12 0.090999231 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

13 0.090233289 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

14 0.088704288 69 nips-2012-Clustering Sparse Graphs

15 0.087165855 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

16 0.086181298 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

17 0.086008631 78 nips-2012-Compressive Sensing MRI with Wavelet Tree Sparsity

18 0.083907902 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

19 0.082671165 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

20 0.081111282 16 nips-2012-A Polynomial-time Form of Robust Regression


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.261), (1, 0.063), (2, 0.047), (3, -0.053), (4, -0.004), (5, 0.079), (6, -0.017), (7, -0.018), (8, 0.023), (9, -0.087), (10, -0.019), (11, 0.001), (12, -0.017), (13, -0.013), (14, 0.053), (15, 0.009), (16, 0.044), (17, -0.005), (18, -0.031), (19, -0.107), (20, -0.022), (21, 0.128), (22, 0.005), (23, 0.09), (24, 0.037), (25, 0.027), (26, 0.006), (27, 0.049), (28, 0.028), (29, 0.092), (30, -0.063), (31, -0.056), (32, 0.054), (33, -0.021), (34, -0.088), (35, -0.028), (36, 0.045), (37, 0.074), (38, 0.015), (39, -0.059), (40, 0.018), (41, 0.058), (42, -0.061), (43, 0.019), (44, -0.144), (45, -0.011), (46, -0.059), (47, -0.024), (48, -0.034), (49, 0.113)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96132052 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

Author: Yi Wu, David P. Wipf

Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1

2 0.80091178 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

Author: Henrik Ohlsson, Allen Yang, Roy Dong, Shankar Sastry

Abstract: While compressive sensing (CS) has been one of the most vibrant research fields in the past few years, most development only applies to linear models. This limits its application in many areas where CS could make a difference. This paper presents a novel extension of CS to the phase retrieval problem, where intensity measurements of a linear system are used to recover a complex sparse signal. We propose a novel solution using a lifting technique – CPRL, which relaxes the NP-hard problem to a nonsmooth semidefinite program. Our analysis shows that CPRL inherits many desirable properties from CS, such as guarantees for exact recovery. We further provide scalable numerical solvers to accelerate its implementation. 1

3 0.77941346 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

Author: Xaq Pitkow

Abstract: This paper shows how sparse, high-dimensional probability distributions could be represented by neurons with exponential compression. The representation is a novel application of compressive sensing to sparse probability distributions rather than to the usual sparse signals. The compressive measurements correspond to expected values of nonlinear functions of the probabilistically distributed variables. When these expected values are estimated by sampling, the quality of the compressed representation is limited only by the quality of sampling. Since the compression preserves the geometric structure of the space of sparse probability distributions, probabilistic computation can be performed in the compressed domain. Interestingly, functions satisfying the requirements of compressive sensing can be implemented as simple perceptrons. If we use perceptrons as a simple model of feedforward computation by neurons, these results show that the mean activity of a relatively small number of neurons can accurately represent a highdimensional joint distribution implicitly, even without accounting for any noise correlations. This comprises a novel hypothesis for how neurons could encode probabilities in the brain. 1

4 0.69933236 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

Author: Ulugbek Kamilov, Sundeep Rangan, Michael Unser, Alyson K. Fletcher

Abstract: We consider the estimation of an i.i.d. vector x ∈ Rn from measurements y ∈ Rm obtained by a general cascade model consisting of a known linear transform followed by a probabilistic componentwise (possibly nonlinear) measurement channel. We present a method, called adaptive generalized approximate message passing (Adaptive GAMP), that enables joint learning of the statistics of the prior and measurement channel along with estimation of the unknown vector x. Our method can be applied to a large class of learning problems including the learning of sparse priors in compressed sensing or identification of linear-nonlinear cascade models in dynamical systems and neural spiking processes. We prove that for large i.i.d. Gaussian transform matrices the asymptotic componentwise behavior of the adaptive GAMP algorithm is predicted by a simple set of scalar state evolution equations. This analysis shows that the adaptive GAMP method can yield asymptotically consistent parameter estimates, which implies that the algorithm achieves a reconstruction quality equivalent to the oracle algorithm that knows the correct parameter values. The adaptive GAMP methodology thus provides a systematic, general and computationally efficient method applicable to a large range of complex linear-nonlinear models with provable guarantees. 1

5 0.65062767 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville

Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. 1

6 0.62728995 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

7 0.61337614 34 nips-2012-Active Learning of Multi-Index Function Models

8 0.61138546 78 nips-2012-Compressive Sensing MRI with Wavelet Tree Sparsity

9 0.60150588 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

10 0.59789079 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding

11 0.58401829 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

12 0.58383662 358 nips-2012-Value Pursuit Iteration

13 0.58229989 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

14 0.57953036 221 nips-2012-Multi-Stage Multi-Task Feature Learning

15 0.57751983 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

16 0.56753534 254 nips-2012-On the Sample Complexity of Robust PCA

17 0.56610829 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

18 0.55619884 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

19 0.55479413 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

20 0.55062813 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.088), (14, 0.016), (17, 0.011), (21, 0.044), (36, 0.01), (38, 0.153), (39, 0.018), (42, 0.026), (54, 0.015), (55, 0.046), (65, 0.077), (68, 0.011), (74, 0.041), (76, 0.17), (80, 0.13), (92, 0.044), (94, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96095723 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

Author: Yi Wu, David P. Wipf

Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1

2 0.94630867 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

Author: Simon Lyons, Amos J. Storkey, Simo Särkkä

Abstract: Stochastic differential equations (SDE) are a natural tool for modelling systems that are inherently noisy or contain uncertainties that can be modelled as stochastic processes. Crucial to the process of using SDE to build mathematical models is the ability to estimate parameters of those models from observed data. Over the past few decades, significant progress has been made on this problem, but we are still far from having a definitive solution. We describe a novel method of approximating a diffusion process that we show to be useful in Markov chain Monte-Carlo (MCMC) inference algorithms. We take the ‘white’ noise that drives a diffusion process and decompose it into two terms. The first is a ‘coloured noise’ term that can be deterministically controlled by a set of auxilliary variables. The second term is small and enables us to form a linear Gaussian ‘small noise’ approximation. The decomposition allows us to take a diffusion process of interest and cast it in a form that is amenable to sampling by MCMC methods. We explain why many state-of-the-art inference methods fail on highly nonlinear inference problems, and we demonstrate experimentally that our method performs well in such situations. Our results show that this method is a promising new tool for use in inference and parameter estimation problems. 1

3 0.93999588 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1

4 0.93899029 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

Author: Anima Anandkumar, Ragupathyraj Valluvan

Abstract: Graphical model selection refers to the problem of estimating the unknown graph structure given observations at the nodes in the model. We consider a challenging instance of this problem when some of the nodes are latent or hidden. We characterize conditions for tractable graph estimation and develop efficient methods with provable guarantees. We consider the class of Ising models Markov on locally tree-like graphs, which are in the regime of correlation decay. We propose an efficient method for graph estimation, and establish its structural consistency −δη(η+1)−2 when the number of samples n scales as n = Ω(θmin log p), where θmin is the minimum edge potential, δ is the depth (i.e., distance from a hidden node to the nearest observed nodes), and η is a parameter which depends on the minimum and maximum node and edge potentials in the Ising model. The proposed method is practical to implement and provides flexibility to control the number of latent variables and the cycle lengths in the output graph. We also present necessary conditions for graph estimation by any method and show that our method nearly matches the lower bound on sample requirements. Keywords: Graphical model selection, latent variables, quartet methods, locally tree-like graphs. 1

5 0.93874151 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

6 0.93476504 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

7 0.93298006 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

8 0.93208301 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

9 0.93182075 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

10 0.92766678 200 nips-2012-Local Supervised Learning through Space Partitioning

11 0.92650419 65 nips-2012-Cardinality Restricted Boltzmann Machines

12 0.92504674 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

13 0.92406422 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

14 0.92393744 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

15 0.92393523 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

16 0.92327219 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

17 0.92312527 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

18 0.92274314 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

19 0.92267245 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

20 0.92231202 197 nips-2012-Learning with Recursive Perceptual Representations