nips nips2009 nips2009-222 knowledge-graph by maker-knowledge-mining

222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors


Source: pdf

Author: David P. Wipf, Srikantan S. Nagarajan

Abstract: Finding maximally sparse representations from overcomplete feature dictionaries frequently involves minimizing a cost function composed of a likelihood (or data fit) term and a prior (or penalty function) that favors sparsity. While typically the prior is factorial, here we examine non-factorial alternatives that have a number of desirable properties relevant to sparse estimation and are easily implemented using an efficient and globally-convergent, reweighted 1 -norm minimization procedure. The first method under consideration arises from the sparse Bayesian learning (SBL) framework. Although based on a highly non-convex underlying cost function, in the context of canonical sparse estimation problems, we prove uniform superiority of this method over the Lasso in that, (i) it can never do worse, and (ii) for any dictionary and sparsity profile, there will always exist cases where it does better. These results challenge the prevailing reliance on strictly convex penalty functions for finding sparse solutions. We then derive a new non-factorial variant with similar properties that exhibits further performance improvements in some empirical tests. For both of these methods, as well as traditional factorial analogs, we demonstrate the effectiveness of reweighted 1 -norm algorithms in handling more general sparse estimation problems involving classification, group feature selection, and non-negativity constraints. As a byproduct of this development, a rigorous reformulation of sparse Bayesian classification (e.g., the relevance vector machine) is derived that, unlike the original, involves no approximation steps and descends a well-defined objective function. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Finding maximally sparse representations from overcomplete feature dictionaries frequently involves minimizing a cost function composed of a likelihood (or data fit) term and a prior (or penalty function) that favors sparsity. [sent-4, score-0.544]

2 While typically the prior is factorial, here we examine non-factorial alternatives that have a number of desirable properties relevant to sparse estimation and are easily implemented using an efficient and globally-convergent, reweighted 1 -norm minimization procedure. [sent-5, score-0.538]

3 The first method under consideration arises from the sparse Bayesian learning (SBL) framework. [sent-6, score-0.135]

4 These results challenge the prevailing reliance on strictly convex penalty functions for finding sparse solutions. [sent-8, score-0.468]

5 For both of these methods, as well as traditional factorial analogs, we demonstrate the effectiveness of reweighted 1 -norm algorithms in handling more general sparse estimation problems involving classification, group feature selection, and non-negativity constraints. [sent-10, score-0.631]

6 As a byproduct of this development, a rigorous reformulation of sparse Bayesian classification (e. [sent-11, score-0.165]

7 1 Introduction With the advent of compressive sensing and other related applications, there has been growing interest in finding sparse signal representations from redundant dictionaries [3, 5]. [sent-14, score-0.256]

8 Moreover, we would ideally like these methods to generalize to other likelihood functions and priors for applications such as non-negative sparse coding, classification, and group variable selection. [sent-28, score-0.245]

9 One common strategy is to replace x 0 with a more manageable penalty function g(x) (or prior) that still favors sparsity. [sent-29, score-0.226]

10 Given this selection, a recent, very successful optimization technique involves iterative reweighted 1 minimization, a process that produces more focal estimates with each passing iteration [3, 19]. [sent-35, score-0.448]

11 To implement this procedure, at the (k + 1)-th iteration we compute x(k+1) → arg min y − Φx x 2 2 (k) wi |xi |, +λ (3) i where wi ∂g(xi )/∂|xi |. [sent-36, score-0.405]

12 As discussed in [6], these updates are guaranteed to converge to a local minimum of the underlying cost function by satisfying the conditions of the Global Convergence Theorem (see for example [24]). [sent-37, score-0.133]

13 (k) (k) (k) While certainly successful in practice, there remain fundamental limitations as to what can be achieved using factorial penalties to approximate x 0 . [sent-43, score-0.227]

14 In this paper we consider two non-factorial methods that rely on the same basic iterative reweighted 1 minimization procedure outlined above. [sent-45, score-0.428]

15 In Section 2, we briefly introduce the non-factorial penalty function first proposed in [19] (based on a dual-form interpretation of sparse Bayesian learning) and then derive a new iterative reweighted 1 implementation that builds upon these ideas. [sent-46, score-0.667]

16 Together, these results imply that this reweighting (0) scheme can never do worse than Lasso (assuming wi = 1, ∀i), and that there will always be cases where improvement over Lasso is achieved. [sent-48, score-0.359]

17 Later in Section 3, we derive a second promising non-factorial variant by starting with a plausible 1 reweighting scheme and then working backwards to determine the form and properties of the underlying penalty function. [sent-50, score-0.38]

18 In general, iterative reweighted 1 procedures of any kind are attractive for our purposes because they can easily be augmented to handle other likelihoods and priors, provided convexity of the update (3) is preserved (of course the overall cost function being minimized will be non-convex). [sent-51, score-0.527]

19 For example, to address the extensions mentioned above, in Section 4 we explore adding constraints such as xi ≥ 0, replacing |xi | with a norm on groups of variables, and using a logistic instead of quadratic likelihood term for classification. [sent-52, score-0.196]

20 The latter extension leads to a rigorous reformulation of sparse Bayesian classification (e. [sent-53, score-0.165]

21 2 Non-Factorial Methods Based on Sparse Bayesian Learning A particularly useful non-factorial penalty emerges from a dual-space view [19] of sparse Bayesian learning (SBL) [15], which is based on the notion of automatic relevance determination (ARD) [10]. [sent-57, score-0.446]

22 Mathematically, this is equivalent to minimizing L(γ) − log p(y|x)p(x; γ)dx = − log p(y; γ) ≡ log |Σy | + y T Σ−1 y, y (4) where Σy λI + ΦΓΦT and Γ diag[γ]. [sent-61, score-0.143]

23 It is not immediately apparent how the SBL procedure, which requires optimizing a cost function in γ-space and is based on a factorial prior p(x; γ), relates to solving/approximating (1) and/or (2) via a non-factorial penalty in x-space. [sent-65, score-0.385]

24 However, it has been shown in [19] that x SBL satisfies xSBL = arg min y − Φx x where 2 2 (6) + λgSBL (x), (7) min xT Γ−1 x + log |αI + ΦΓΦT |, gSBL (x) γ≥0 assuming α = λ and |x| [|x1 |, . [sent-66, score-0.157]

25 While not discussed in [19], gSBL (x) is a general penalty function that only need have α = λ to obtain equivalence with SBL; other selections may lead to better performance (more on this in Section 4 below). [sent-70, score-0.267]

26 The analysis in [19] reveals that replacing x 0 with gSBL (x) and α → 0 leaves the globally minimizing solution to (1) unchanged but drastically reduces the number of local minima (more so than any possible factorial penalty function). [sent-71, score-0.411]

27 It can also be shown that gSBL (x) is a non-decreasing, concave function of |x| (see Appendix), a desirable property of sparsity-promoting penalties. [sent-74, score-0.129]

28 Importantly, as a direct consequence of this concavity, (6) can be optimized using a reweighted 1 algorithm (in an analogous fashion to the factorial case) using ∂gSBL (x) (k+1) wi = . [sent-75, score-0.584]

29 In fact, it can be shown that a more rudimentary form of reweighted 1 applied to this model in [19] amounts to performing exactly one such iteration. [sent-79, score-0.306]

30 From a theoretical standpoint, 1 reweighting applied to gSBL (x) is guaranteed to aid performance in the sense described by the following two results, which apply in the case where λ → 0, α → 0. [sent-81, score-0.214]

31 When applying iterative reweighted 1 using (9) and wi = 0, ∀i, the solution sparsity satisfies x(k+1) 0 ≤ x(k) 0 (i. [sent-85, score-0.593]

32 Then there exists 2 a set of signals y (with non-zero measure) generated from S such that non-factorial reweighted 1 , with W (k+1) updated using (9), always succeeds but standard 1 always fails. [sent-90, score-0.336]

33 Note that Theorem 2 does not in any way indicate what is the best non-factorial reweighting scheme in practice (for example, in our limited experience with empirical simulations, the selection α → 0 is not necessarily always optimal). [sent-91, score-0.24]

34 However, it does suggest that reweighting with non-convex, nonfactorial penalties is potentially very effective, motivating other selections as discussed next. [sent-92, score-0.345]

35 3 Bottom-Up Construction of Non-Factorial Penalty In the previous section, we described what amounts to a top-down formulation of a non-factorial penalty function that emerges from a particular hierarchical Bayesian model. [sent-94, score-0.203]

36 Based on the insights gleaned from this procedure (and its distinction from factorial penalties), it is possible to stipulate alternative penalty functions from the bottom up by creating plausible, non-factorial reweighting schemes. [sent-95, score-0.509]

37 The Achilles heel of standard, factorial penalties is that if we want to retain a global minimum similar to that of (1), we require a highly concave penalty on each xi [21]. [sent-98, score-0.576]

38 However, this implies that almost all basic feasible solutions (BFS) to y = Φx, defined as a solution with x 0 ≤ n, will form local minima of the penalty function constrained to the feasible region. [sent-99, score-0.245]

39 Consequently we would like to utilize a non-factorial, yet highly concave penalty that explicitly favors degenerate BFS. [sent-103, score-0.368]

40 We can accomplish this by constructing a reweighting scheme designed to avoid non-degenerate BFS whenever possible. [sent-104, score-0.208]

41 Now consider the covariance-like quantity αI + Φ(X (k+1) )2 ΦT , where α may be small, and then construct weights using the projection of each basis vector φi as defined via (k+1) wi → φT αI + Φ(X (k+1) )2 ΦT i −1 φi . [sent-105, score-0.15]

42 (10) Ideally, if at iteration k + 1 we are at a bad or non-degenerate BFS, we do not want the newly (k+1) computed wi to favor the present position at the next iteration of (3) by assigning overly large weights to the zero-valued xi . [sent-106, score-0.374]

43 In some sense, the distinction between (10) and its factorial counterparts, such as the method of (k+1) (k+1) Cand` s et al. [sent-109, score-0.157]

44 [3] which uses wi e → 1/(|xi |+α), can be summarized as follows: the factorial methods assign the largest weight whenever the associated coefficient goes to zero; with (10) the largest weight is only assigned when the associated coefficient goes to zero and x (k+1) 0 < n. [sent-110, score-0.336]

45 The reweighting option (10), which bears some resemblance to (9), also has some very desirable properties beyond the intuitive justification given above. [sent-111, score-0.209]

46 First, since we are utilizing (10) in the context of reweighted 1 minimization, it would productive to know what cost function, if any, we are minimizing when we compute each iteration. [sent-112, score-0.4]

47 Using the fundamental theorem of calculus for line integrals (or the gradient theorem), it follows that the bottom-up (BU) penalty function associated with (10) is 1 gBU (x) trace XΦT αI + Φ(ν X)2 ΦT −1 Φ dν. [sent-113, score-0.241]

48 (11) 0 Moreover, because each weight wi is a non-increasing function of each xj , ∀j, from Kachurovskii’s theorem [12] it directly follows that (11) is concave and non-decreasing in |x|, and thus naturally promotes sparsity. [sent-114, score-0.29]

49 4 Extensions One of the motivating factors for using iterative reweighted 1 optimization is that it is very easy to incorporate alternative likelihoods and priors. [sent-117, score-0.399]

50 Non-Negative Sparse Coding: Numerous applications require sparse solutions where all coefficients xi are constrained to be non-negative [2]. [sent-119, score-0.241]

51 By adding the contraint x ≥ 0 to (3) at each iteration, we can easily compute such solutions using gSBL (x), gBU (x), or any other appropriate penalty function. [sent-120, score-0.201]

52 Note that in the original SBL formulation, this is not a possibility since the integrals required to compute the associated cost function or update rules no longer have closed-form expressions. [sent-121, score-0.128]

53 The simultaneous sparse approximation problem [17] is a particularly useful adaptation of this idea relevant to compressive sensing [18], manifold learning [13], and neuroimaging [22]. [sent-125, score-0.24]

54 , x·r ] characterized by the same sparsity profile or support, meaning that the coefficient matrix X is row sparse. [sent-132, score-0.148]

55 The sparse recovery problems (1) and (2) then become min d(X), s. [sent-134, score-0.182]

56 d(X) favors row sparsity and is a natural extension of the 0 norm to the simultaneous approximation problem. [sent-137, score-0.279]

57 All of the algorithms discussed herein can naturally be expanded to this domain essentially by substituting the scalar coefficient magnitudes from a given (k) iteration |xi | with some row-vector penalty, such as a norm. [sent-139, score-0.198]

58 If we utilize xi· 2 , then the coefficient matrix update analogous to (3) requires the solution of the more complicated weighted second-order cone (SOC) program X (k+1) → arg min Y − ΦX X Other selections such as the ∞ 2 F (k) +λ wi xi· 2 . [sent-140, score-0.291]

59 Sparse Classifier Design: At a high level, sparse classifiers can be trained by simply substituting a (preferrably) convex likelihood function for the quadratic term in (2). [sent-142, score-0.247]

60 For example, to perform sparse logistic regression we would solve yj log(φT x) + (1 − yj ) log(1 − φT x) + λg(x), j· j· min x (14) j where now yj ∈ {0, 1} and g(x) is an arbitrary, concave-in-|x| penalty. [sent-143, score-0.352]

61 Note that cost function descent does not require that we compute the full reweighted 1 solution; the iterations from [7] naturally lend themselves to an efficient partial (or greedy) update before recomputing the weights. [sent-145, score-0.435]

62 Of course we can always substitute the reweighted 1 scheme discussed above to avoid this issue, since the underlying cost function in x-space is the same. [sent-159, score-0.462]

63 For example, with α small, the penalty gSBL (x) is more highly concave favoring sparsity, while in the limit at α becomes large, it acts like a standard 1 norm, still favoring sparsity but not exceedingly so (the same phenomena occurs when using the penalty (11)). [sent-161, score-0.596]

64 In the standard regression case where marginalization was possible, the optimized quantity − log p(y; γ) represents an approximation to − log p(y) that can be used, among other things, for model comparison. [sent-165, score-0.165]

65 While space precludes the details, if we are willing to substitute a probit likelihood function for the logistic, it is possible to revert (14) back to the original hierarchical, γ-dependent Bayesian model and obtain a rigorous upper bound on − log p(y; γ). [sent-167, score-0.164]

66 In the first experiment, each trial consisted of generating a 100 × 256 dictionary Φ with iid Gaussian entries and a sparse vector x∗ with 60 nonzero, non-negative (truncated Gaussian) coefficients. [sent-170, score-0.208]

67 , which represents the current state-of-the-art in reweighted 1 algorithms. [sent-173, score-0.306]

68 Additionally, since we are working with a noise-free signal, we assume λ → 0 and so the requisite coefficient update (3) with xi ≥ 0 reduces to a standard linear program. [sent-175, score-0.12]

69 Given wi = 1, ∀i for each algorithm, the first iteration amounts to the non-negative minimum 1 -norm solution (i. [sent-176, score-0.209]

70 Average results from 1000 random trials are displayed in Figure 1 (left), which plots the empirical probability of success in recovering x∗ versus the iteration number. [sent-179, score-0.153]

71 We observe that standard non-negative 1 never succeeds (see first iteration results); however, (0) with only a few reweighted iterations drastic improvement is possible, especially for the bottom-up approach. [sent-180, score-0.484]

72 ) This shows both the efficacy of non-factorial reweighting and the ability to handle constraints on x. [sent-183, score-0.18]

73 , x∗ ] with matching ·5 ·1 sparsity profile and iid Gaussian nonzero coefficients. [sent-187, score-0.201]

74 We then generate the signal matrix Y = ΦX ∗ and attempt to learn X ∗ using various group-level reweighting schemes. [sent-188, score-0.224]

75 In this experiment we varied the row sparsity of X ∗ from d(X ∗ ) = 30 to d(X ∗ ) = 40; in general, the more nonzero rows, the harder the recovery problem becomes. [sent-189, score-0.202]

76 Results are presented in Figure 1 (right), where the performance gap between the factorial and non-factorial approaches is very significant. [sent-191, score-0.157]

77 1 0 30 10 32 34 36 row sparsity, d(X ∗ ) 38 40 Figure 1: Left: Probability of success recovering sparse non-negative coefficients as a function of reweighted 1 iterations. [sent-213, score-0.542]

78 Right: Iterative reweighted results using 5 simultaneous signal vectors. [sent-214, score-0.397]

79 Probability of success recovering sparse coefficients for different row sparsity values, i. [sent-215, score-0.348]

80 6 Conclusion In this paper we have examined concave, non-factorial priors (which previously have received little attention) for the purpose of estimating sparse coefficients. [sent-218, score-0.172]

81 When coupled with general likelihood models and minimized using efficient iterative reweighted 1 methods, these priors offer a powerful alternative to existing state-of-the-art sparse estimation techniques. [sent-219, score-0.601]

82 We can then gSBL (x) = min xT Γ−1 x + log |αI + ΦΓΦT | = min γ≥0 γ,z≥0 Minimizing over γ for fixed x and z, we get −1/2 γi = z i |xi |, ∀i. [sent-222, score-0.129]

83 To derive the weight update (9), we only need the optimal value of each zi , which from basic convex analysis will satisfy ∂gSBL (x) 1/2 zi = . [sent-226, score-0.2]

84 We start by initializing zi → wi , ∀i and then minimize over γ using (17). [sent-228, score-0.166]

85 By substituting (17) into (20) and defining wi zi , we obtain the weight update (9). [sent-230, score-0.272]

86 This procedure is guaranteed to converge to a solution satisfying (19) [20] although, as mentioned previously, only one iteration is actually required for the overall algorithm. [sent-231, score-0.122]

87 If φi is (k+1) not in the span of ΦW (k+1) X (k+1) ΦT , then wi → ∞ and the corresponding coefficient xi (k+1) can be set to zero for all future iterations. [sent-233, score-0.198]

88 Otherwise wi can be computed efficiently using the Moore-Penrose pseudoinverse and will be strictly nonzero. [sent-234, score-0.165]

89 , x∗ 0 − 1, the minimization problem ˆ x arg min gSBL (x), s. [sent-248, score-0.143]

90 This result follows from [21] and the dual-space characterization of the penalty gSBL (x) from [19]. [sent-251, score-0.172]

91 Note that (21) is equivalent to (6) with λ → 0, so the reweighted non-factorial update (9) can be applied. [sent-252, score-0.349]

92 Zibulevsky, “A non-negative and sparse enough solution of an underdetermined linear system of equations is unique,” IEEE Trans. [sent-267, score-0.135]

93 Talbot, “Gene selection in cancer classification using sparse logistic regression with Bayesian regularization,” Bioinformatics, vol. [sent-285, score-0.247]

94 Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization,” Proc. [sent-291, score-0.184]

95 Willsky, “Optimal sparse representations in general overcom¸ plete bases,” IEEE Int. [sent-324, score-0.135]

96 Lemos, “Selecting landmark points for sparse manifold learning,” Advances in Neural Information Processing Systems 18, pp. [sent-348, score-0.135]

97 Tipping, “Sparse bayesian learning and the relevance vector machine,” J. [sent-356, score-0.12]

98 Faul, “Fast marginal likelihood maximisation for sparse Bayesian models,” Ninth Int. [sent-362, score-0.175]

99 Baraniuk, “Recovery of jointly sparse signals from a few random projections,” Advances in Neural Information Processing Systems 18, pp. [sent-376, score-0.135]

100 Nagarajan, “Iterative reweighted 1 and 2 methods for finding sparse solutions,” Submitted, 2009. [sent-384, score-0.441]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gsbl', 0.46), ('sbl', 0.423), ('reweighted', 0.306), ('reweighting', 0.18), ('penalty', 0.172), ('coef', 0.161), ('factorial', 0.157), ('bfs', 0.141), ('spark', 0.138), ('sparse', 0.135), ('wi', 0.121), ('lasso', 0.118), ('nagarajan', 0.115), ('sparsity', 0.112), ('concave', 0.1), ('iteration', 0.088), ('wipf', 0.086), ('xi', 0.077), ('cients', 0.07), ('penalties', 0.07), ('gbu', 0.069), ('xsbl', 0.069), ('minimization', 0.068), ('relevance', 0.063), ('bayesian', 0.057), ('cost', 0.056), ('iterative', 0.054), ('favors', 0.054), ('nonzero', 0.054), ('selections', 0.052), ('cand', 0.05), ('dictionaries', 0.049), ('logistic', 0.049), ('simultaneous', 0.047), ('min', 0.047), ('prevailing', 0.046), ('psfrag', 0.046), ('replacements', 0.046), ('zi', 0.045), ('determination', 0.045), ('minima', 0.044), ('strictly', 0.044), ('signal', 0.044), ('discussed', 0.043), ('diag', 0.043), ('classi', 0.043), ('update', 0.043), ('degenerate', 0.042), ('pro', 0.041), ('descends', 0.04), ('exceedingly', 0.04), ('likelihood', 0.04), ('theorem', 0.04), ('likelihoods', 0.039), ('ii', 0.039), ('minimizing', 0.038), ('dictionary', 0.038), ('convex', 0.038), ('mackay', 0.037), ('priors', 0.037), ('row', 0.036), ('success', 0.035), ('log', 0.035), ('iid', 0.035), ('wakin', 0.035), ('marginalization', 0.035), ('rank', 0.034), ('guaranteed', 0.034), ('boyd', 0.034), ('substituting', 0.034), ('group', 0.033), ('reliance', 0.033), ('herein', 0.033), ('ard', 0.033), ('tipping', 0.033), ('concavity', 0.033), ('selection', 0.032), ('le', 0.032), ('emerges', 0.031), ('regression', 0.031), ('iterations', 0.03), ('recovering', 0.03), ('yj', 0.03), ('precludes', 0.03), ('succeeds', 0.03), ('neuroimaging', 0.03), ('never', 0.03), ('norm', 0.03), ('rigorous', 0.03), ('minimized', 0.029), ('weight', 0.029), ('substitute', 0.029), ('integrals', 0.029), ('solutions', 0.029), ('quantity', 0.029), ('desirable', 0.029), ('scheme', 0.028), ('arg', 0.028), ('compressive', 0.028), ('modest', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

Author: David P. Wipf, Srikantan S. Nagarajan

Abstract: Finding maximally sparse representations from overcomplete feature dictionaries frequently involves minimizing a cost function composed of a likelihood (or data fit) term and a prior (or penalty function) that favors sparsity. While typically the prior is factorial, here we examine non-factorial alternatives that have a number of desirable properties relevant to sparse estimation and are easily implemented using an efficient and globally-convergent, reweighted 1 -norm minimization procedure. The first method under consideration arises from the sparse Bayesian learning (SBL) framework. Although based on a highly non-convex underlying cost function, in the context of canonical sparse estimation problems, we prove uniform superiority of this method over the Lasso in that, (i) it can never do worse, and (ii) for any dictionary and sparsity profile, there will always exist cases where it does better. These results challenge the prevailing reliance on strictly convex penalty functions for finding sparse solutions. We then derive a new non-factorial variant with similar properties that exhibits further performance improvements in some empirical tests. For both of these methods, as well as traditional factorial analogs, we demonstrate the effectiveness of reweighted 1 -norm algorithms in handling more general sparse estimation problems involving classification, group feature selection, and non-negativity constraints. As a byproduct of this development, a rigorous reformulation of sparse Bayesian classification (e.g., the relevance vector machine) is derived that, unlike the original, involves no approximation steps and descends a well-defined objective function. 1

2 0.11681068 138 nips-2009-Learning with Compressible Priors

Author: Volkan Cevher

Abstract: We describe a set of probability distributions, dubbed compressible priors, whose independent and identically distributed (iid) realizations result in p-compressible signals. A signal x ∈ RN is called p-compressible with magnitude R if its sorted coefficients exhibit a power-law decay as |x|(i) R · i−d , where the decay rate d is equal to 1/p. p-compressible signals live close to K-sparse signals (K N) in the r -norm (r > p) since their best K-sparse approximation error decreases with O R · K 1/r−1/p . We show that the membership of generalized Pareto, Student’s t, log-normal, Fr´ chet, and log-logistic distributions to the set of compresse ible priors depends only on the distribution parameters and is independent of N . In contrast, we demonstrate that the membership of the generalized Gaussian distribution (GGD) depends both on the signal dimension and the GGD parameters: the expected decay rate of N -sample iid realizations from the GGD with the shape parameter q is given by 1/ [q log (N/q)]. As stylized examples, we show via experiments that the wavelet coefficients of natural images are 1.67-compressible whereas their pixel gradients are 0.95 log (N/0.95)-compressible, on the average. We also leverage the connections between compressible priors and sparse signals to develop new iterative re-weighted sparse signal recovery algorithms that outperform the standard 1 -norm minimization. Finally, we describe how to learn the hyperparameters of compressible priors in underdetermined regression problems by exploiting the geometry of their order statistics during signal recovery. 1

3 0.11534911 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1

4 0.11321653 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

5 0.10708622 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

Author: Mladen Kolar, Le Song, Eric P. Xing

Abstract: To estimate the changing structure of a varying-coefficient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model — piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models. 1

6 0.090076581 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

7 0.088152789 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

8 0.085919388 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

9 0.084626064 157 nips-2009-Multi-Label Prediction via Compressed Sensing

10 0.082748324 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

11 0.080335826 246 nips-2009-Time-Varying Dynamic Bayesian Networks

12 0.078896321 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

13 0.073295943 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

14 0.071363889 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

15 0.070927486 147 nips-2009-Matrix Completion from Noisy Entries

16 0.070697114 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

17 0.070574619 67 nips-2009-Directed Regression

18 0.070143141 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

19 0.069414318 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

20 0.068296388 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.222), (1, 0.092), (2, 0.007), (3, 0.074), (4, -0.003), (5, -0.061), (6, 0.127), (7, -0.105), (8, 0.119), (9, 0.083), (10, -0.026), (11, -0.013), (12, -0.068), (13, 0.049), (14, 0.003), (15, 0.069), (16, 0.018), (17, -0.001), (18, -0.082), (19, -0.007), (20, 0.013), (21, -0.049), (22, -0.05), (23, 0.139), (24, 0.026), (25, 0.07), (26, -0.01), (27, 0.05), (28, -0.001), (29, -0.033), (30, -0.075), (31, -0.043), (32, -0.081), (33, 0.016), (34, 0.019), (35, 0.026), (36, 0.024), (37, -0.068), (38, -0.019), (39, -0.028), (40, 0.066), (41, -0.089), (42, -0.017), (43, -0.04), (44, 0.051), (45, -0.055), (46, -0.05), (47, -0.051), (48, -0.077), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94522732 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

Author: David P. Wipf, Srikantan S. Nagarajan

Abstract: Finding maximally sparse representations from overcomplete feature dictionaries frequently involves minimizing a cost function composed of a likelihood (or data fit) term and a prior (or penalty function) that favors sparsity. While typically the prior is factorial, here we examine non-factorial alternatives that have a number of desirable properties relevant to sparse estimation and are easily implemented using an efficient and globally-convergent, reweighted 1 -norm minimization procedure. The first method under consideration arises from the sparse Bayesian learning (SBL) framework. Although based on a highly non-convex underlying cost function, in the context of canonical sparse estimation problems, we prove uniform superiority of this method over the Lasso in that, (i) it can never do worse, and (ii) for any dictionary and sparsity profile, there will always exist cases where it does better. These results challenge the prevailing reliance on strictly convex penalty functions for finding sparse solutions. We then derive a new non-factorial variant with similar properties that exhibits further performance improvements in some empirical tests. For both of these methods, as well as traditional factorial analogs, we demonstrate the effectiveness of reweighted 1 -norm algorithms in handling more general sparse estimation problems involving classification, group feature selection, and non-negativity constraints. As a byproduct of this development, a rigorous reformulation of sparse Bayesian classification (e.g., the relevance vector machine) is derived that, unlike the original, involves no approximation steps and descends a well-defined objective function. 1

2 0.79729795 138 nips-2009-Learning with Compressible Priors

Author: Volkan Cevher

Abstract: We describe a set of probability distributions, dubbed compressible priors, whose independent and identically distributed (iid) realizations result in p-compressible signals. A signal x ∈ RN is called p-compressible with magnitude R if its sorted coefficients exhibit a power-law decay as |x|(i) R · i−d , where the decay rate d is equal to 1/p. p-compressible signals live close to K-sparse signals (K N) in the r -norm (r > p) since their best K-sparse approximation error decreases with O R · K 1/r−1/p . We show that the membership of generalized Pareto, Student’s t, log-normal, Fr´ chet, and log-logistic distributions to the set of compresse ible priors depends only on the distribution parameters and is independent of N . In contrast, we demonstrate that the membership of the generalized Gaussian distribution (GGD) depends both on the signal dimension and the GGD parameters: the expected decay rate of N -sample iid realizations from the GGD with the shape parameter q is given by 1/ [q log (N/q)]. As stylized examples, we show via experiments that the wavelet coefficients of natural images are 1.67-compressible whereas their pixel gradients are 0.95 log (N/0.95)-compressible, on the average. We also leverage the connections between compressible priors and sparse signals to develop new iterative re-weighted sparse signal recovery algorithms that outperform the standard 1 -norm minimization. Finally, we describe how to learn the hyperparameters of compressible priors in underdetermined regression problems by exploiting the geometry of their order statistics during signal recovery. 1

3 0.74774343 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

Author: Sundeep Rangan, Alyson K. Fletcher

Abstract: A well-known analysis of Tropp and Gilbert shows that orthogonal matching pursuit (OMP) can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free linear measurements obtained through a random Gaussian measurement matrix with a probability that approaches one as n → ∞. This work strengthens this result by showing that a lower number of measurements, m = 2k log(n − k), is in fact sufficient for asymptotic recovery. More generally, when the sparsity level satisfies kmin ≤ k ≤ kmax but is unknown, m = 2kmax log(n − kmin ) measurements is sufficient. Furthermore, this number of measurements is also sufficient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to infinity. The scaling m = 2k log(n − k) exactly matches the number of measurements required by the more complex lasso method for signal recovery in a similar SNR scaling.

4 0.73735338 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

Author: Liang Sun, Jun Liu, Jianhui Chen, Jieping Ye

Abstract: We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model, in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the (2, 1)-norm minimization, which is an extension of the well-known 1-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is significantly much more difficult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semidefinite programming (SDP) problem, which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efficient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.

5 0.716591 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1

6 0.71582139 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

7 0.66273844 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

8 0.63131475 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

9 0.6100949 157 nips-2009-Multi-Label Prediction via Compressed Sensing

10 0.60598415 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

11 0.6025421 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

12 0.59669822 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

13 0.5777058 76 nips-2009-Efficient Learning using Forward-Backward Splitting

14 0.57682049 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

15 0.55438471 67 nips-2009-Directed Regression

16 0.54247677 180 nips-2009-On the Convergence of the Concave-Convex Procedure

17 0.53707814 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

18 0.51568335 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

19 0.50433224 93 nips-2009-Fast Image Deconvolution using Hyper-Laplacian Priors

20 0.49964836 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.036), (25, 0.064), (35, 0.044), (36, 0.549), (39, 0.022), (58, 0.096), (71, 0.038), (81, 0.014), (86, 0.051), (91, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98998624 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efficient inference algorithms for a conditional random field using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1

2 0.98901272 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein

Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning. 1

3 0.98815614 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

4 0.98311877 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

Author: Saketha N. Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-tal, Ramakrishnan K.r.

Abstract: Motivated from real world problems, like object categorization, we study a particular mixed-norm regularization for Multiple Kernel Learning (MKL). It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. The formulation hence employs l∞ regularization for promoting combinations at the component level and l1 regularization for promoting sparsity among kernels in each component. While previous attempts have formulated this as a non-convex problem, the formulation given here is an instance of non-smooth convex optimization problem which admits an efficient Mirror-Descent (MD) based procedure. The MD procedure optimizes over product of simplexes, which is not a well-studied case in literature. Results on real-world datasets show that the new MKL formulation is well-suited for object categorization tasks and that the MD based algorithm outperforms stateof-the-art MKL solvers like simpleMKL in terms of computational effort. 1

5 0.98191446 238 nips-2009-Submanifold density estimation

Author: Arkadas Ozakin, Alexander G. Gray

Abstract: Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modification to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modified estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory. 1 Introduction: Density estimation and the curse of dimensionality Kernel density estimation (KDE) [8] is one of the most popular methods for estimating the underlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having the data points “contribute” to the estimate at a given point according to their distances from the ˆ point. In the simplest multi-dimensional KDE [3], the estimate fm (y0 ) of the PDF f (y0 ) at a point N y0 ∈ R is given in terms of a sample {y1 , . . . , ym } as, 1 ˆ fm (y0 ) = m m i=1 1 K hN m yi − y0 hm , (1) where hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number m of data points increases, and K : [0.∞) → [0, ∞) is a kernel function that satisfies certain properties such as boundedness. Various theorems exist on the different types of convergence of the estimator to the correct result and the rates of convergence. The earliest result on the pointwise convergence rate in the multivariable case seems to be given in [3], where it is stated that under certain conditions for f and K, assuming hm → 0 and mhm → ∞ as m → ∞, the mean squared ˆ ˆ error in the estimate f (y0 ) of the density at a point goes to zero with the rate, MSE[fm (y0 )] = E ˆ fm (y0 ) − f (y0 ) 2 = O h4 + m 1 mhN m as m → ∞. If hm is chosen to be proportional to m−1/(N +4) , one gets, ˆ MSE[fm (p)] = O 1 m4/(N +4) , (2) as m → ∞. This is an example of a curse of dimensionality; the convergence rate slows as the dimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the sample size required for a given mean square error for the estimate of a multivariable normal distribution increases with the dimensionality. The numbers look as discouraging as the formula 2. 1 One source of optimism towards various curses of dimensionality is the fact that although the data for a given problem may have many features, in reality the intrinsic dimensionality of the “data subspace” of the full feature space may be low. This may result in there being no curse at all, if the performance of the method/algorithm under consideration can be shown to depend only on the intrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising ways to work with the low-dimensional data subspace by using dimensional reduction techniques on the data. One example of the former case is the results on nearest neighbor search [6, 2] which indicate that the performance of certain nearest-neighbor search algortihms is determined not by the full dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace. Riemannian manifolds. In this paper, we will assume that the data subspace is a Riemannian manifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3 to higher dimensions. As first clarified by Gauss in the two-dimensional case (and by Riemann in the general case) it turns out that intrinsic features of the geometry of a surface such as lengths of its curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric tensor1 g without referring to the particular way the the surface is embedded in R3 . A space whose geometry is defined in terms of a metric tensor is called a Riemannian manifold (for a rigorous definition, see, e.g., [5, 7, 1]). Previous work. In [9], Pelletier defines an estimator of a PDF on a Riemannian manifold M by using the distances measured on M via its metric tensor, and obtains the same convergence rate as in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we know that the data lives on a Riemannian manifold M , the convergence rate of this estimator will be determined by the dimensionality of M , instead of the full dimensionality of the feature space on which the data may have been originally sampled. While an interesting generalization of the usual KDE, this approach assumes that the data manifold M is known in advance, and that we have access to certain geometric quantities related to this manifold such as intrinsic distances between its points and the so-called volume density function. Thus, this Riemannian KDE cannot be used directly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools from existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate the quantities needed in the estimator of [9], however, a more straightforward method that directly estimates the density of the data as measured in the subspace is desirable. Other related works include [13], where the authors propose a submanifold density estimation method that uses a kernel function with a variable covariance but do not present theorerical results, [4] where the author proposes a method for doing density estimation on a Riemannian manifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that the manifold is known in advance, together with intricate geometric information pertaining to it, and [10, 11], which discuss various issues related to statistics on a Riemannian manifold. This paper. In this paper, we propose a direct way to estimate the density of Euclidean data that lives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise consistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic dimension of the submanifold the data lives in. This is an example of the avoidance of the curse of dimensionality in the manner mentioned above, by a method whose performance depends on the intrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our method is practical in that it works with Euclidean distances on RN . In particular, we do not assume any knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold such as its metric tensor, geodesic distances between its points, its volume form, etc. 2 The estimator and its convergence rate Motivation. In this paper, we are concerned with the estimation of a PDF that lives on an (unknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional kernel density estimation would not work for this problem, since if interpreted as living on RN , the 1 The metric tensor can be thought of as giving the “infinitesimal distance” ds between two points whose P coordinates differ by the infinitesimal amounts (dy 1 , . . . , dy N ) as ds2 = ij gij dy i dy j . 2 underlying PDF would involve a “delta function” that vanishes when one moves away from M , and “becomes infinite” on M in order to have proper normalization. More formally, the N -dimensional probability measure for such an n-dimensional PDF on M will have support only on M , will not be absolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability density function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from such a probability measure, the estimator will “try to converge” to a singular PDF, one that is infinite on M , zero outside. In order to estimate the probability density function on M by using data given in RN , we propose a simple modification of usual KDE on RN , namely, to use a kernel that is normalized for n-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind this approach is based on three facts: 1) For small distances, an n-dimensional Riemannian manifold “looks like” Rn , and densities in Rn should be estimated by an n-dimensional kernel, 2) For points of M that are close enough to each other, the intrinsic distances as measured on M are close to Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to the estimate at a point comes from data points that are nearby. Thus, as the number of data points increases and the bandwidth is taken to be smaller and smaller, estimating the density by using a kernel normalized for n-dimensions and distances as measured in RN should give a result closer and closer to the correct value. We will next give the formal definition of the estimator motivated by these considerations, and state our theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator is asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel function becomes a “delta function”. This result is also used in showing that with an appropriate choice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the estimator is pointwise consistent. Statement of the theorem Let M be an n-dimensional, embedded, complete Riemannian submanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and let u(p, q) be the geodesic (linear) distance between p and q as measured in RN . Note that u(p, q) ≤ d(p, q). We will use the notation up (q) = u(p, q) and dp (q) = d(p, q). We will denote the Riemannian volume measure on M by V , and the volume form by dV . Theorem 2.1. Let f : M → [0, ∞) be a probability density function defined on M (so that the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continous function that satisfies vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satisfies, n z ≤1 K( z )d z = 1. Assume f is differentiable to second order in a neighborhood of p ∈ M , ˆ and for a sample q1 , . . . , qm of size m drawn from the density f , define an estimator fm (p) of f (p) as, m 1 1 up (qj ) ˆ fm (p) = (3) K n m j=1 hm hm where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhn = ∞, then, there exists m non-negative numbers m∗ , Cb , and CV such that for all m > m∗ we have, ˆ MSE fm (p) = E ˆ fm (p) − f (p) 2 < Cb h4 + m CV . mhn m If hm is chosen to be proportional to m−1/(n+4) , this gives, E (fm (p) − f (p))2 = O as m → ∞. (4) 1 m4/(n+4) Thus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced by the intrinsic dimension n of M . The proof will follow from the two lemmas below on the convergence rates of the bias and the variance. 2 The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points on the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great circle passing throught the points. For a detailed discussion of these issues, see, e.g., [1]. 3 3 Preliminary results The following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the kernel becomes a “delta function” as the bandwidth gets smaller. Theorem 3.1. Let K : [0, ∞) → [0, ∞) be a continuous function that vanishes outside [0, 1) and is differentiable with a bounded derivative in [0, 1), and let ξ : M → R be a function that is differentiable to second order in a neighborhood of p ∈ M . Let ξh (p) = 1 hn K M up (q) h ξ(q) dV (q) , (5) where h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h → 0, K( z )dn z = O(h2 ) , ξh (p) − ξ(p) (6) Rn where z = (z 1 , . . . , z n ) denotes the Cartesian coordinates on Rn and dn z = dz 1 . . . dz n denotes the volume form on Rn . In particular, limh→0 ξh (p) = ξ(p) Rn K( z )dn z. Before proving this theorem, we prove some results on the relation between up (q) and dp (q). Lemma 3.1. There exist δup > 0 and Mup > 0 such that for all q with dp (q) ≤ δup , we have, 3 dp (q) ≥ up (q) ≥ dp (q) − Mup [dp (q)] . In particular, limq→p up (q) dp (q) (7) = 1. Proof. Let cv0 (s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial vedcv locity ds0 s=0 = v0 . When s < rinj , s is equal to dp (cv0 (s)) [7, 1]. Now let xv0 (s) be the representation of cv0 (s) in RN in terms of Cartesian coordinates with the origin at p. We have up (cv0 (s)) = xv0 (s) and x′ 0 (s) = 1, which gives3 x′ 0 (s) · x′′0 (s) = 0. Using these v v v we get, dup (cv0 (s)) ds s=0 the absolute value of the third d3 up (cv0 (s)) ds3 d2 up (cv0 (s)) = ds2 s=0 derivative of up (cv0 (s)) = 1 , and 0. Let M3 ≥ 0 be an upper bound on for all s ≤ rinj and all unit length v0 : 3 ≤ M3 . Taylor’s theorem gives up (cv0 (s)) = s + Rv0 (s) where |Rv0 (s)| ≤ M3 s . 3! Thus, (7) holds with Mup = M3 , for all r < rinj . For later convenience, instead of δu = rinj , 3! we will pick δup as follows. The polynomial r − Mup r3 is monotonically increasing in the interval 0 ≤ r ≤ 1/ 3Mup . We let δup = min{rinj , 1/ Mup }, so that r − Mup r3 is ensured to be monotonic for 0 ≤ r ≤ δup . Definition 3.2. For 0 ≤ r1 < r2 , let, Hp (r1 , r2 ) = Hp (r) = inf{up (q) : r1 ≤ dp (q) < r2 } , Hp (r, ∞) = inf{up (q) : r1 ≤ dp (q)} , (8) (9) i.e., Hp (r1 , r2 ) is the smallest u-distance from p among all points that have a d-distance between r1 and r2 . Since M is assumed to be an embedded submanifold, we have Hp (r) > 0 for all r > 0. In the below, we will assume that all radii are smaller than rinj , in particular, a set of the form {q : r1 ≤ dp (q) < r2 } will be assumed to be non-empty and so, due to the completeness of M , to contain a point q ∈ M such that dp (q) = r1 . Note that, Hp (r1 ) = min{H(r1 , r2 ), H(r2 )} . (10) Lemma 3.2. Hp (r) is a non-decreasing, non-negative function, and there exist δHp > 0 and MHp ≥ H (r) 0 such that, r ≥ Hp (r) ≥ r − MHp r3 , for all r < δHp . In particular, limr→0 pr = 1. 3 Primes denote differentiation with respect to s. 4 Proof. Hp (r) is clearly non-decreasing and Hp (r) ≤ r follows from up (q) ≤ dp (q) and the fact that there exists at least one point q with dp (q) = r in the set {q : r ≤ dp (q)} Let δHp = Hp (δup ) where δup is as in the proof of Lemma 3.1 and let r < δHp . Since r < δHp = Hp (δup ) ≤ δup , by Lemma 3.1 we have, r ≥ up (r) ≥ r − Mup r3 , (11) for some Mup > 0. Now, since r and r − Mup r3 are both monotonic for 0 ≤ r ≤ δup , we have (see figure) (12) r ≥ Hp (r, δup ) ≥ r − Mup r3 . In particular, H(r, δup ) ≤ r < δHp = Hp (δup ), i.e, H(r, δup ) < Hp (δup ). Using (10) this gives, Hp (r) = Hp (r, δup ). Combining this with (12), we get r ≥ Hp (r) ≥ r − Mup r3 for all r < δHp . Next we show that for all small enough h, there exists some radius Rp (h) such that for all points q with a dp (q) ≥ Rp (h), we have up (q) ≥ h. Rp (h) will roughly be the inverse function of Hp (r). Lemma 3.3. For any h < Hp (rinj ), let Rp (h) = sup{r : Hp (r) ≤ h}. Then, up (q) ≥ h for all q with dp (q) ≥ Rp (h) and there exist δRp > 0 and MRp > 0 such that for all h ≤ δRp , Rp (h) satisfies, (13) h ≤ Rp (h) ≤ h + MRp h3 . In particular, limh→0 Rp (h) h = 1. Proof. That up (q) ≥ h when dq (q) ≥ Rp (h) follows from the definitions. In order to show (13), we will use Lemma 3.2. Let α(r) = r − MHp r3 , where MHp is as in Lemma 3.2. Then, α(r) is oneto-one and continuous in the interval 0 ≤ r ≤ δHp ≤ δup . Let β = α−1 be the inverse function of α in this interval. From the definition of Rp (h) and Lemma 3.2, it follows that h ≤ Rp (h) ≤ β(h) for all h ≤ α(δHp ). Now, β(0) = 0, β ′ (0) = 1, β ′′ (0) = 0, so by Taylor’s theorem and the fact that the third derivative of β is bounded in a neighborhood of 0, there exists δg and MRp such that β(h) ≤ h + MRp h3 for all h ≤ δg . Thus, h ≤ Rp (h) ≤ h + MRp h3 , (14) for all h ≤ δR where δR = min{α(δHp ), δg }. Proof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to the integral in the definition of ξh (p) (see (5)) from outside the coordinate patch covered by normal coordinates.4 Let h0 > 0 be such that Rp (h0 ) < rinj (such an h0 exists since limh→ 0 Rp (h) = 0). For any h ≤ h0 , all points q with dp (q) > rinj will satisfy up (q) > h. This means if h is small enough, u (q) K( ph ) = 0 for all points outside the injectivity radius and we can perform the integral in (5) solely in the patch of normal coordinates at p. For normal coordinates y = (y 1 , . . . , y n ) around the point p with y(p) = 0, we have dp (q) = y(q) [7, 1]. With slight abuse of notation, we will write up (y(q)) = up (q), ξ(y(q)) = ξ(q) and g(q) = g(y(q)), where g is the metric tensor of M . Since K( up (q) h ) = 0 for all q with dp (q) > Rp (h), we have, ξh (p) = 1 hn K y ≤Rp (h) up (y) h ξ(y) g(y)dy 1 . . . dy n , (15) 4 Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordinates, in the sense that the components of the metric have vanishing first derivatives at p, and gij (p) = δij [1]. Normal coordinates can be defined in a “geodesic ball” of radius less than rinj . 5 where g denotes the determinant of g as calculated in normal coordinates. Changing the variable of integration to z = y/h, we get, K( z )dn z = ξh (p) − ξ(p) up (zh) h K z ≤Rp (h)/h up (zh) h K = z ≤1 ξ(zh) K z ≤1 z ≤1 g(zh) − 1 dn z + ξ(zh) up (zh) h K( z )dn z g(zh)dn z − ξ(0) ξ(zh) − K( z ) dn z + K( z ) (ξ(zh) − ξ(0)) dn z + z ≤1 K 1< z ≤Rp (h)/h up (zh) h ξ(zh) g(zh)dn z . Thus, K ( z ) dn z ≤ ξh (p) − ξ(p) (16) t∈R z ≤1 sup |ξ(zh)| . sup K( z ≤1 z ≤1 dn z + g(zh) − 1 . sup K(t) . sup |ξ(zh)| . sup z ≤1 (17) z ≤1 up (zh) ) − K( z ) . h dn z + (18) z ≤1 K( z )(ξ(zh) − ξ(0))dn z + (19) z ≤1 sup K(t) . t∈R sup g(zh) . 1< z ≤Rp (h)/h sup dn z . (20) |ξ(zh)| . 1< z ≤Rp (h)/h 1< z ≤Rp (h)/h Letting h → 0, the terms (17)-(20) approach zero at the following rates: (17): K(t) is bounded and ξ(y) is continuous at y = 0, so the first two terms can be bounded by constants as h → 0. In normal coordinates y, gij (y) = δij + O( y 2 ) as y → 0, so, sup z ≤1 g(zh) − 1 = O(h2 ) as h → 0. (18): Since K is assumed to be differentiable with a bounded derivative in [0, 1), we get K(b) − u (zh) K(a) = O(b − a) as b → a. By Lemma 3.1 we have p h − z = O(h2 ) as h → 0. Thus, K up (zh) h − K( z ) = O(h2 ) as h → 0. (19): Since ξ(y) is assumed to have partial derivatives up to second order in a neighborhood of y(p) = 0, for z ≤ 1, Taylor’s theorem gives, n zi ξ(zh) = ξ(0) + h i=1 as h → 0. Since h → 0. z ≤1 zK( z )dn z = 0, we get ∂ξ(y) ∂y i z ≤1 + O(h2 ) (21) y=0 K( z )(ξ(zh) − ξ(0))dn z = O(h2 ) as (20): The first three terms can be bounded by constants. By Lemma 3.3, Rp (h) = h + O(h3 ) as h → 0. A spherical shell 1 < z ≤ 1 + ǫ has volume O(ǫ) as ǫ → 0+ . Thus, the volume of 1 < z ≤ Rp (h)/h is O(Rp (h)/h − 1) = O(h2 ) as h → 0. Thus, the sum of the terms (17-20), is O(h2 ) as h → 0, as claimed in Theorem 3.1. 6 4 Bias, variance and mean squared error ˆ Let M , f , fm , K, p be as in Theorem 2.1 and assume hm → 0 as m → ∞. ˆ Lemma 4.1. Bias fm (p) = O(h2 ), as m → ∞. m u (q) Proof. We have Bias[fm (p)] = Bias h1 K ph m follows from Theorem 3.1 with ξ replaced with f . , so recalling Rn K( z )dn z = 1, the lemma Lemma 4.2. If in addition to hm → 0, we have mhn → ∞ as m → ∞, then, Var[fm (p)] = m O 1 mhn m , as m → ∞. Proof. Var[fm (p)] = 1 1 Var n K m hm up (q) hm (22) (23) Now, Var 1 K hn m up (q) hm =E 1 K2 h2n m up (q) hm 1 hn m f (q) − E 1 K hn m 2 up (q) hm , (24) and, E 1 K2 h2n m up (q) hm = M 1 2 K hn m up (q) hm dV (q) . (25) By Theorem 3.1, the integral in (25) converges to f (p) K 2 ( z )dn z, so, the right hand side of (25) is O 1 hn m ˆ Var[fm (p)] = O as m → ∞. By Lemma 4.1 we have, E 1 mhn m 1 hn K m up (q) hm 2 → f 2 (p). Thus, as m → ∞. ˆ ˆ ˆ Proof of Theorem 2.1 Finally, since MSE fm (p) = Bias2 [fm (p)] + Var[fm (p)], the theorem follows from Lemma 4.1 and 4.2. 5 Experiments and discussion We have empirically tested the estimator (3) on two datasets: A unit normal distribution mapped onto a piece of a spiral in the plane, so that n = 1 and N = 2, and a uniform distribution on the unit disc x2 + y 2 ≤ 1 mapped onto the unit hemisphere by (x, y) → (x, y, 1 − x2 + y 2 ), so that n = 2 and N = 3. We picked the bandwidths to be proportional to m−1/(n+4) where m is the number of data points. We performed live-one-out estimates of the density on the data points, and obtained the MSE for a range of ms. See Figure 5. 6 Conclusion and future work We have proposed a small modification of the usual KDE in order to estimate the density of data that lives on an n-dimensional submanifold of RN , and proved that the rate of convergence of the estimator is determined by the intrinsic dimension n. This shows that the curse of dimensionality in KDE can be overcome for data with low intrinsic dimension. Our method assumes that the intrinsic dimensionality n is given, so it has to be supplemented with an estimator of the dimension. We have assumed various smoothness properties for the submanifold M , the density f , and the kernel K. We find it likely that our estimator or slight modifications of it will be consistent under weaker requirements. Such a relaxation of requirements would have practical consequences, since it is unlikely that a generic data set lives on a smooth Riemannian manifold. 7 MSE MSE Mean squared error for the hemisphere data Mean squared error for the spiral data 0.000175 0.0008 0.00015 0.000125 0.0006 0.0001 0.000075 0.0004 0.00005 0.0002 0.000025 # of data points 50000 100000 150000 200000 # of data points 50000 100000 150000 200000 Figure 1: Mean squared error as a function of the number of data points for the spiral data (left) and the hemisphere data. In each case, we fit a curve of the form M SE(m) = amb , which gave b = −0.80 for the spiral and b = −0.69 for the hemisphere. Theorem 2.1 bounds the MSE by Cm−4/(n+4) , which gives the exponent as −0.80 for the spiral and −0.67 for the hemisphere. References [1] M. Berger and N. Hitchin. A panoramic view of Riemannian geometry. The Mathematical Intelligencer, 28(2):73–74, 2006. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM New York, NY, USA, 2006. [3] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179–189, 1966. [4] H. Hendriks. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990. [5] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008. [6] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity . IEEE Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001. [7] J. Lee. Riemannian manifolds: an introduction to curvature. Springer Verlag, 1997. [8] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, pages 1065–1076, 1962. [9] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Probability Letters, 73(3):297–304, 2005. [10] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In IEEE Workshop on Nonlinear Signal and Image Processing, volume 4. Citeseer, 1999. [11] X. Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006. [12] B. Silverman. Density estimation for statistics and data analysis. Chapman & Hall/CRC, 1986. [13] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in Neural Information Processing Systems, pages 849–856, 2003. 8

same-paper 6 0.97995073 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

7 0.82608682 76 nips-2009-Efficient Learning using Forward-Backward Splitting

8 0.82205665 128 nips-2009-Learning Non-Linear Combinations of Kernels

9 0.81253576 193 nips-2009-Potential-Based Agnostic Boosting

10 0.81079745 27 nips-2009-Adaptive Regularization of Weight Vectors

11 0.80956864 129 nips-2009-Learning a Small Mixture of Trees

12 0.80445188 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

13 0.79790574 72 nips-2009-Distribution Matching for Transduction

14 0.79584479 166 nips-2009-Noisy Generalized Binary Search

15 0.79236907 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

16 0.79185492 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

17 0.7911182 180 nips-2009-On the Convergence of the Concave-Convex Procedure

18 0.79014027 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

19 0.78899544 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

20 0.78699315 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm