nips nips2008 nips2008-149 knowledge-graph by maker-knowledge-mining

149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

Source: pdf

Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva

Abstract: This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For d covariates, there are 2d basis coefﬁcients to estimate, which renders conventional approaches computationally prohibitive when d is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for ﬂexible control of the trade-off between mean-squared error and computational complexity.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. [sent-8, score-0.397]

2 For d covariates, there are 2d basis coefﬁcients to estimate, which renders conventional approaches computationally prohibitive when d is large. [sent-9, score-0.069]

3 1 Introduction Multivariate binary data arise in a variety of ﬁelds, such as biostatistics [1], econometrics [2] or artiﬁcial intelligence [3]. [sent-12, score-0.08]

4 In these and other settings, it is often necessary to estimate a probability density from a number of independent observations. [sent-13, score-0.117]

5 samples from a probability density f (with respect to the counting measure) on the d-dimensional bi△ nary hypercube B d, B = {0, 1}, and seek an estimate f of f with a small mean-squared error 2 MSE(f, f ) = E x∈Bd (f (x) − f (x)) . [sent-17, score-0.179]

6 In many cases of practical interest, the number of covariates d is much larger than log n, so direct estimation of f as a multinomial density with 2d parameters is both unreliable and impractical. [sent-18, score-0.214]

7 Thus, one has to resort to “nonparametric” methods and search for good estimators in a suitably deﬁned class whose complexity grows with n. [sent-19, score-0.139]

8 Some nonparametric methods proposed in the literature, such as kernels [4] and orthogonal expansions [5, 6], either have very slow rates of MSE convergence or are computationally prohibitive for large d. [sent-20, score-0.107]

9 For example, the kernel method [4] requires O(n2 d) operations to compute the estimate at any x ∈ B d , yet its MSE decays as O(n−4/(4+d) ) [7], which is extremely slow when d is large. [sent-21, score-0.05]

10 In contrast, orthogonal function methods generally have much better MSE decay rates, but rely on estimating 2d coefﬁcients in a ﬁxed basis, which requires enormous computational resources for large d. [sent-22, score-0.106]

11 For instance, using the Fast Hadamard Transform to estimate the coefﬁcients in the so-called Walsh basis using n samples requires O(nd2d ) operations [5]. [sent-23, score-0.05]

12 In this paper we take up the problem of accurate, computationally tractable estimation of a density on the binary hypercube. [sent-24, score-0.16]

13 We take the minimax point of view, where we assume that f comes from a particular function class F and seek an estimator that approximately attains the minimax MSE △ ∗ Rn (F ) = inf sup MSE(f, f ), b f f ∈F where the inﬁmum is over all estimators based on n i. [sent-25, score-0.361]

14 We will deﬁne our function class to reﬂect another feature often encountered in situations involving multivariate binary data: namely, that the shape of the underlying density is strongly inﬂuenced by small constellations of the d covariates. [sent-29, score-0.188]

15 To model such “constellation effects” mathematically, we will consider classes of densities that satisfy a particular sparsity condition. [sent-31, score-0.15]

16 The algorithm entails recursively examining empirical estimates of whole blocks of the 2d basis coefﬁcients. [sent-33, score-0.049]

17 An additional attractive feature of our approach is that it gives us a principled way of trading off MSE against computational complexity by controlling the decay of the threshold as a function of the recursion depth. [sent-36, score-0.186]

18 Let µd denote the counting measure on the d-dimensional binary hypercube B d . [sent-41, score-0.108]

19 When η = 1/2, we get the standard Walsh system used in [5, 6]; in that case, we shall omit the index η = 1/2 for simplicity. [sent-57, score-0.086]

20 The product structure of the biased Walsh bases makes them especially convenient for statistical applications as it allows for a computationally efﬁcient recursive method for computing accurate estimates of squared coefﬁcients in certain hierarchically structured sets. [sent-58, score-0.273]

21 We are interested in densities whose representations in some biased Walsh basis satisfy a certain sparsity constraint. [sent-60, score-0.207]

22 (3) It is not hard to show that the coefﬁcients of any probability density on B d in Φd,η are bounded by R(η) = [η ∨ (1 − η)]d/2 . [sent-71, score-0.09]

23 When η = 1/2, with R(η) = 2−d/2 , we shall write simply Fd (p). [sent-74, score-0.057]

24 Given any f ∈ Fd (p, η) and denoting by fk the function obtained from it by retaining only the k largest coefﬁcients, we get from Parseval’s identity that f − fk L2 (µd ) ≤ CRk −r . [sent-84, score-0.099]

25 Other densities in {Φd (p, η) : 0 < p < 2} include, for example, mixtures of components that, up to a permutation of {1, . [sent-89, score-0.094]

26 , d}, can be written as a tensor product of a large number of Bernoulli(η ∗ ) densities and some other density. [sent-92, score-0.147]

27 3 Density estimation via recursive Walsh thresholding We now turn to our problem of estimating a density f on B d from a sample {Xi }n when f ∈ Fd (p) i=1 for some unknown 0 < p < 2. [sent-105, score-0.467]

28 The minimax theory for weak-ℓp balls [10] says that d ∗ Rn (Fd (p)) ≥ CM −p/2 n−2r/(2r+1) , r = 1/p − 1/2 where M = 2 . [sent-106, score-0.111]

29 We shall construct an estimator that adapts to unknown sparsity of f in the sense that it achieves this minimax rate up to a logarithmic factor without prior knowledge of p and that its computational complexity improves as p → 0. [sent-107, score-0.502]

30 Our method is based on the thresholding of empirical Walsh coefﬁcients. [sent-108, score-0.166]

31 A thresholding estimator is any estimator of the form I{T (θs )≥λn } θs ϕs , f= b s∈Bd n where θs = (1/n) i=1 ϕs (Xi ) are some statistic, and I{·} is an indicator empirical estimates of the Walsh coefﬁcients of f , T (·) is function. [sent-109, score-0.378]

32 For 2 example, in [5, 6] the statistic T (θs ) = θs was used with the threshold λn = 1/M (n + 1). [sent-111, score-0.072]

33 While this is not an issue when d ≍ log n, it is clearly impractical when d ≫ log n. [sent-114, score-0.134]

34 To deal with this issue, we will consider a recursive thresholding approach that will allow us to reject whole groups of coefﬁcients based on efﬁciently computable statistics. [sent-115, score-0.398]

35 For any 1 ≤ k ≤ d, we can write any f ∈ L2 (µd ) with the Walsh coefﬁcients θ(f ) as θuv ϕuv = f= u∈Bk v∈Bd−k u∈Bk fu ⊗ ϕu , △ where uv denotes the concatenation of u ∈ B k and v ∈ B d−k and, for each u ∈ B k , fu = △ 2 2 2 v∈Bd−k θuv ϕv lies in L (µd−k ). [sent-117, score-0.509]

36 By Parseval’s identity, Wu = fu L2 (µd−k ) = v∈Bd−k θuv . [sent-118, score-0.162]

37 We will use the following fact, easily proved using the deﬁnitions (1) and (2) of the Walsh functions: for any density f on B d , any k and u ∈ B k , we have fu (y) = Ef ϕu (πk (X))I{σk (X)=y} , ∀y ∈ B d−k △ and Wu = Ef {ϕu (πk (X))fu (σk (X))} , △ where πk (x) = (x(1), . [sent-127, score-0.252]

38 Now we can deﬁne our density estimation procedure. [sent-140, score-0.09]

39 Instead of using a single threshold for all 1 ≤ k ≤ d, we consider a more ﬂexible strategy: for every k, we shall compare each Wu to a threshold that depends not only on n, but also on k. [sent-141, score-0.127]

40 Speciﬁcally, we will let λk,n = αk log n , n 1≤k≤d (7) where α = {αk }d satisﬁes α1 ≥ αk ≥ αd > 0. [sent-142, score-0.067]

41 ) Given λ = {λk,n }d , deﬁne the set A(λ) = {s ∈ B d : k=1 Wπk (s) ≥ λk,n , ∀1 ≤ k ≤ d} and the corresponding estimator △ fRWT = I{s∈A(λ)} θs ϕs , (8) s∈Bd where RWT stands for “recursive Walsh thresholding. [sent-144, score-0.106]

42 We now turn to the asymptotic analysis of the MSE and the computational complexity of fRWT . [sent-147, score-0.104]

43 We ﬁrst prove that fRWT adapts to unknown sparsity of f : Theorem 3. [sent-148, score-0.142]

44 1 Suppose the threshold sequence λ = {λk }d is such that αd ≥ (20d + 25)2 /2d. [sent-149, score-0.046]

45 k=1 Then for all 0 < p < 2 the estimator (8) satisﬁes sup MSE(f, fRWT ) = f ∈Fd (p) sup Ef f − fRWT f ∈Fd (p) 2 L2 (µd ) ≤ C 2d 2d α1 log n n 2r/(2r+1) where the constant C depends only on p. [sent-150, score-0.242]

46 Deﬁning the sets A1 = {s ∈ B d : θs ≥ λd,n } and d 2 2 2 A2 = {s ∈ B : θs < λ1,n }, we get T1 ≤ s I{s∈A1 } (θs − θs ) and T2 ≤ s I{s∈A2 } θs . [sent-153, score-0.051]

47 Applying (4), (5) and a bit of algebra, we get E T12 ≤ E T22 ≤ 1 Mn s∈Bd 2 s : θs ≥ λd,n /2 ≤ 1 Mn 2 2 I{θs <(3α1 /2) log n/n} θs ≤ C M 2 M λd,n p/2 ≤ M α1 log n n 1 −2r/(2r+1) n , M (10) 2r/(2r+1) . [sent-156, score-0.185]

48 Using Cauchy–Schwarz, we get E T11 ≤ s 1/2 E(θs − θs )4 · P(s ∈ A1 ∩ B) . [sent-158, score-0.051]

49 (12) To estimate the fourth moment in (12), we use Rosenthal’s inequality [14] to get E(θs − θs )4 ≤ c/M 2 n2 . [sent-159, score-0.105]

50 To bound the probability that s ∈ A1 ∩ B, we observe that s ∈ A1 ∩ B implies that |θs − θs | ≥ (1/5) λd,n , and then use Bernstein’s inequality [14] to get P |θs − θs | ≥ (1/5) λd,n ≤ 2 exp − β 2 log n 2(1 + 2β/3) = 2n−β 2 /[2(1+2β/3)] √ with β = (1/5) M αd ≥ 4d + 5. [sent-160, score-0.145]

51 Using the same argument as above, we get P(s ∈ A2 ∩ S) ≤ √ 2 2n−(γ−1)/2 , where γ = (1/5) M α1 . [sent-163, score-0.051]

52 (10), (11), (13), and (14), we get (9), and the theorem is proved. [sent-166, score-0.051]

53 2 Given any δ ∈ (0, 1), provided each αk is chosen so that √ 2k αk n log n ≥ 5 C2 n + (log(d/δ) + k)/ log e , 2 p/2 Algorithm 1 runs in O(n d(n/M log n) K(α, p)) time with probability at least 1 − δ. [sent-169, score-0.23]

54 (15) Proof: The complexity is determined by the number of calls to R ECURSIVE WALSH. [sent-171, score-0.216]

55 Let us say that a call to R ECURSIVE WALSH(u, λ) is correct if Wu ≥ λk,n /2. [sent-173, score-0.06]

56 We will show that, with probability at least 1 − δ, only the correct calls are made. [sent-174, score-0.161]

57 For a given u ∈ B k , Wu ≥ λk,n and Wu < λk,n /2 together imply that fu − fu △ 2 L2 (µd−k ) k ≥ (1/5) λk,n , where fu = v∈Bd−k θuv ϕv . [sent-176, score-0.486]

58 Now, it can be shown that, for every u ∈ B , the norm fu − fu L2 (µd−k ) can be expressed as a supremum of an empirical process [15] over a certain function class that depends on k (details are omitted for lack of space). [sent-177, score-0.35]

59 We can then use Talagrand’s concentration-of-measure inequality for empirical processes [16] to get P(Wu ≥ λk,n , Wu < λk,n /2) ≤ exp − nC1 (2k a2 ∧ 2k/2 ak,n ) , k,n √ k n, and C , C are the absolute constants in Talagrand’s where ak,n = (1/5) αk log n/n − C2 / 2 1 2 bound. [sent-178, score-0.193]

60 Summing over k, u ∈ B k , we see that, with probability ≥ 1 − δ, only the correct calls will be made. [sent-180, score-0.161]

61 Hence, the number of correct d recursive calls is bounded by N = k=1 (2/M λk,n )p/2 = (2n/M log n)p/2 K(α, p). [sent-184, score-0.394]

62 Therefore, with probability at least 1 − δ, the time complexity will be as stated in the theorem. [sent-186, score-0.079]

63 By controlling the rate at which the sequence αk decays with k, we can trade off MSE against complexity. [sent-189, score-0.084]

64 However, it has K(α, p) = O(M p/2 d), resulting in O(d2 n2 (n/ log n)p/2 ) complexity. [sent-195, score-0.067]

65 The second case, which leads to a very severe estimator that will tend to reject a lot of coefﬁcients, has MSE of O((log n/n)2r/(2r+1) M −1/(2r+1) ), but K(α, p) = O(M p/2 ), leading to a considerably better O(dn2 (n/ log n)p/2 ) complexity. [sent-196, score-0.244]

66 However, this reduction in complexity will be offset by a corresponding increase in MSE. [sent-198, score-0.079]

67 In fact, using exponentially decaying αk ’s in practice is not advisable as its low complexity is mainly due to the fact that it will tend to reject even the big coefﬁcients very early on, especially when d is large. [sent-199, score-0.189]

68 To achieve a good balance between complexity and MSE, a moderately decaying threshold sequence might be best, e. [sent-200, score-0.164]

69 As p → 0, the effect of λ on complexity becomes negligible, and the complexity tends to O(n2 d). [sent-203, score-0.158]

70 In practice renormalization may be computationally expensive when d is very large. [sent-208, score-0.07]

71 If the estimate is suitably sparse, however, the renormalization can be carried out approximately using Monte-Carlo methods. [sent-209, score-0.104]

72 4 Simulations The focus of our work is theoretical, consisting in the derivation of a recursive thresholding procedure for estimating multivariate binary densities (Algorithm 1), with a proof of its near-minimaxity and an asymptotic analysis of its complexity. [sent-210, score-0.551]

73 Although an extensive empirical evaluation is outside the scope of this paper, we have implemented the proposed estimator, and now present some simulation results to demonstrate its small-sample performance. [sent-211, score-0.052]

74 We generated synthetic observations from a mixture density f on a 15-dimensional binary hypercube. [sent-212, score-0.136]

75 The mixture has 10 components, where each component is a product density with 12 randomly chosen covariates having Bernoulli(1/2) distributions, and the other three having Bernoulli(0. [sent-213, score-0.172]

76 As can be seen from the coefﬁcient proﬁle in the bottom of the ﬁgure, this density is clearly sparse. [sent-218, score-0.09]

77 1 also shows the estimated probabilities and the Walsh coefﬁcients for sample sizes n = 5000 (middle) and n = 10000 (right). [sent-220, score-0.046]

78 b fRWT , n = 5000 Ground truth (f ) b fRWT , n = 10000 Figure 1: Ground truth (left) and estimated density for n = 5000 (middle) and n = 10000 (right) with constant thresholding. [sent-221, score-0.187]

79 Top: true and estimated probabilities (clipped at zero and renormalized) arranged in lexicographic order. [sent-222, score-0.104]

80 Bottom: absolute values of true and estimated Walsh coefﬁcients arranged in lexicographic order. [sent-223, score-0.126]

81 For the estimated densities, the coefﬁcient plots also show the threshold level (dotted line) and absolute values of the rejected coefﬁcients (lighter color). [sent-224, score-0.092]

82 5 1400 Time (s) MSE (× 2d) 40 constant log linear 0. [sent-231, score-0.088]

83 All results are averaged over ﬁve independent runs for each sample size (the error bars show the standard deviations). [sent-233, score-0.051]

84 To study the trade-off between MSE and complexity, we implemented three different thresholding schemes: (1) constant, λk,n = 2 log n/(2d n), (2) logarithmic, λk,n = 2 log(d − k + 2) log n/(2d n), and (3) linear, λk,n = 2(d − k + 1) log n/(2d n). [sent-234, score-0.367]

85 Up to the log n factor (dictated by the theory), the thresholds at k = d are set to twice the variance of the empirical estimate of any coefﬁcient whose value is zero; this forces the estimator to reject empirical coefﬁcients whose values cannot be reliably distinguished from zero. [sent-235, score-0.323]

86 Occasionally, spurious coefﬁcients get retained, as can be seen in Fig. [sent-236, score-0.051]

87 In agreement with the theory, MSE is the smallest for the constant thresholding scheme [which is simply an efﬁcient recursive implementation of a term-by-term thresholding estimator with λn ∼ log n/(M n)], and then it increases for the logarithmic and for the linear schemes. [sent-242, score-0.723]

88 2(b,c) shows the running time (in seconds) and the number of recursive calls made to R ECURSIVE WALSH vs. [sent-244, score-0.33]

89 The number of recursive calls is a platformindependent way of gauging the computational complexity of the algorithm, although it should be kept in mind that each recursive call has O(n2 d) overhead. [sent-246, score-0.584]

90 The running time increases polynomially with n, and is the largest for the constant scheme, followed by the logarithmic and the linear schemes. [sent-247, score-0.103]

91 We see that, while the MSE of the logarithmic scheme is fairly close to that of the constant scheme, its complexity is considerably lower, in terms of both the number of recursive calls and the running time. [sent-248, score-0.513]

92 In all three cases, the number of recursive calls decreases with n due to the fact that weight estimates become increasingly accurate with n, which causes the expected number of false discoveries (i. [sent-249, score-0.303]

93 , making a recursive call at an internal node of the tree only to reject its descendants later) to decrease. [sent-251, score-0.273]

94 2(d) shows the number of coefﬁcients retained in the estimate. [sent-253, score-0.049]

95 This number grows with n as a consequence of the fact that the threshold decreases with n, while the number of accurately estimated coefﬁcients increases. [sent-254, score-0.07]

96 The true density f has 40 parameters: 9 to specify the weights of the components, 3 per component to locate the indices of the nonuniform covariates, and the single Bernoulli parameter of the nonuniform covariates. [sent-255, score-0.17]

97 Overall, these preliminary simulation results show that our implemented estimator behaves in accordance with the theory even in the small-sample regime. [sent-257, score-0.132]

98 The performance of the logarithmic thresholding scheme is especially encouraging, suggesting that it may be possible to trade off MSE against complexity in a way that will scale to large values of d. [sent-258, score-0.341]

99 To model their densities, we plan to experiment with Walsh bases with η biased toward unity. [sent-266, score-0.08]

100 Some classiﬁcation procedures for multivariate binary data using orthogonal functions. [sent-297, score-0.137]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('walsh', 0.479), ('frwt', 0.342), ('mse', 0.313), ('wu', 0.278), ('bd', 0.188), ('ecursive', 0.183), ('fd', 0.17), ('recursive', 0.166), ('coef', 0.163), ('uv', 0.163), ('fu', 0.162), ('cients', 0.152), ('thresholding', 0.14), ('calls', 0.137), ('estimator', 0.106), ('densities', 0.094), ('density', 0.09), ('minimax', 0.085), ('complexity', 0.079), ('reject', 0.071), ('log', 0.067), ('adapts', 0.065), ('bk', 0.057), ('nc', 0.057), ('covariates', 0.057), ('sparsity', 0.056), ('logarithmic', 0.055), ('multivariate', 0.052), ('duke', 0.051), ('durham', 0.051), ('get', 0.051), ('retained', 0.049), ('bernoulli', 0.047), ('threshold', 0.046), ('chapel', 0.046), ('crk', 0.046), ('goldreich', 0.046), ('lexicographic', 0.046), ('parseval', 0.046), ('renormalization', 0.046), ('binary', 0.046), ('nonuniform', 0.04), ('lazebnik', 0.04), ('hypercube', 0.04), ('orthogonal', 0.039), ('decay', 0.039), ('decaying', 0.039), ('trade', 0.039), ('talagrand', 0.037), ('willett', 0.037), ('call', 0.036), ('shall', 0.035), ('econometrics', 0.034), ('arranged', 0.034), ('biased', 0.034), ('attains', 0.032), ('sparser', 0.032), ('suitably', 0.031), ('hill', 0.03), ('runs', 0.029), ('estimators', 0.029), ('tensor', 0.028), ('scheme', 0.028), ('estimating', 0.028), ('participants', 0.027), ('running', 0.027), ('estimate', 0.027), ('inequality', 0.027), ('truth', 0.026), ('empirical', 0.026), ('implemented', 0.026), ('statistic', 0.026), ('balls', 0.026), ('ground', 0.025), ('asymptotic', 0.025), ('product', 0.025), ('orthonormal', 0.024), ('sup', 0.024), ('panel', 0.024), ('fk', 0.024), ('mn', 0.024), ('bases', 0.024), ('computationally', 0.024), ('estimated', 0.024), ('correct', 0.024), ('cm', 0.023), ('decays', 0.023), ('fourier', 0.023), ('basis', 0.023), ('nonparametric', 0.022), ('write', 0.022), ('plan', 0.022), ('prohibitive', 0.022), ('controlling', 0.022), ('counting', 0.022), ('sample', 0.022), ('absolute', 0.022), ('constant', 0.021), ('groups', 0.021), ('unknown', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva

2 0.092378609 24 nips-2008-An improved estimator of Variance Explained in the presence of noise

Author: Ralf M. Haefner, Bruce G. Cumming

Abstract: A crucial part of developing mathematical models of information processing in the brain is the quantiﬁcation of their success. One of the most widely-used metrics yields the percentage of the variance in the data that is explained by the model. Unfortunately, this metric is biased due to the intrinsic variability in the data. We derive a simple analytical modiﬁcation of the traditional formula that signiﬁcantly improves its accuracy (as measured by bias) with similar or better precision (as measured by mean-square error) in estimating the true underlying Variance Explained by the model class. Our estimator advances on previous work by a) accounting for overﬁtting due to free model parameters mitigating the need for a separate validation data set, b) adjusting for the uncertainty in the noise estimate and c) adding a conditioning term. We apply our new estimator to binocular disparity tuning curves of a set of macaque V1 neurons and ﬁnd that on a population level almost all of the variance unexplained by Gabor functions is attributable to noise. 1

3 0.074748904 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1

4 0.070813209 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss

Author: Ingo Steinwart, Andreas Christmann

Abstract: In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. Finally, we brieﬂy discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. 1

5 0.067794159 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

Author: Aarti Singh, Robert Nowak, Xiaojin Zhu

Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conﬂicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a ﬁnite sample analysis that characterizes the value of unlabeled data and quantiﬁes the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can signiﬁcantly outperform supervised learning, in ﬁnite sample regimes and sometimes also in terms of error convergence rates.

6 0.065005019 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models

7 0.064226419 99 nips-2008-High-dimensional support union recovery in multivariate regression

8 0.061586015 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

9 0.060636591 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions

10 0.056349158 106 nips-2008-Inferring rankings under constrained sensing

11 0.054576166 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy

12 0.05360695 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

13 0.053115226 75 nips-2008-Estimating vector fields using sparse basis field expansions

14 0.05226393 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

15 0.051361006 179 nips-2008-Phase transitions for high-dimensional joint support recovery

16 0.050747279 118 nips-2008-Learning Transformational Invariants from Natural Movies

17 0.05073059 228 nips-2008-Support Vector Machines with a Reject Option

18 0.050125547 62 nips-2008-Differentiable Sparse Coding

19 0.047882847 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform

20 0.047394991 235 nips-2008-The Infinite Hierarchical Factor Regression Model

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.148), (1, 0.011), (2, -0.036), (3, 0.063), (4, 0.047), (5, -0.01), (6, -0.031), (7, 0.045), (8, 0.032), (9, 0.054), (10, -0.044), (11, -0.018), (12, 0.046), (13, -0.024), (14, -0.129), (15, -0.005), (16, 0.036), (17, 0.028), (18, 0.006), (19, 0.056), (20, -0.013), (21, 0.087), (22, 0.068), (23, -0.055), (24, 0.022), (25, 0.095), (26, -0.007), (27, -0.086), (28, 0.09), (29, 0.001), (30, 0.036), (31, -0.015), (32, 0.033), (33, 0.043), (34, -0.0), (35, 0.055), (36, 0.007), (37, 0.022), (38, 0.088), (39, -0.009), (40, -0.072), (41, -0.117), (42, 0.005), (43, -0.134), (44, -0.171), (45, -0.114), (46, 0.022), (47, -0.002), (48, -0.057), (49, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93173146 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva

2 0.63036537 24 nips-2008-An improved estimator of Variance Explained in the presence of noise

Author: Ralf M. Haefner, Bruce G. Cumming

3 0.61971432 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform

Author: Guangzhi Cao, Charles Bouman

Abstract: Covariance estimation for high dimensional vectors is a classically difﬁcult problem in statistical analysis and machine learning. In this paper, we propose a maximum likelihood (ML) approach to covariance estimation, which employs a novel sparsity constraint. More speciﬁcally, the covariance is constrained to have an eigen decomposition which can be represented as a sparse matrix transform (SMT). The SMT is formed by a product of pairwise coordinate rotations known as Givens rotations. Using this framework, the covariance can be efﬁciently estimated using greedy minimization of the log likelihood function, and the number of Givens rotations can be efﬁciently computed using a cross-validation procedure. The resulting estimator is positive deﬁnite and well-conditioned even when the sample size is limited. Experiments on standard hyperspectral data sets show that the SMT covariance estimate is consistently more accurate than both traditional shrinkage estimates and recently proposed graphical lasso estimates for a variety of different classes and sample sizes. 1

4 0.60070437 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss

Author: Ingo Steinwart, Andreas Christmann

5 0.59449941 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling

Author: Ping Li, Kenneth W. Church, Trevor J. Hastie

Abstract: Conditional Random Sampling (CRS) was originally proposed for efﬁciently computing pairwise (l2 , l1 ) distances, in static, large-scale, and sparse data. This study modiﬁes the original CRS and extends CRS to handle dynamic or streaming data, which much better reﬂect the real-world situation than assuming static data. Compared with many other sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a signiﬁcant advantage in that it is “one-sketch-for-all.” In particular, we demonstrate the effectiveness of CRS in efﬁciently computing the Hamming norm, the Hamming distance, the lp distance, and the χ2 distance. A generic estimator and an approximate variance formula are also provided, for approximating any type of distances. We recommend CRS as a promising tool for building highly scalable systems, in machine learning, data mining, recommender systems, and information retrieval. 1

6 0.52547008 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy

7 0.51721275 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models

8 0.47252896 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

9 0.44841912 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime

10 0.44416511 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

11 0.43289858 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models

12 0.41757569 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

13 0.41739258 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection

14 0.41500059 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables

15 0.38234776 153 nips-2008-Nonlinear causal discovery with additive noise models

16 0.36549219 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

17 0.36277574 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

18 0.36103767 233 nips-2008-The Gaussian Process Density Sampler

19 0.35979149 75 nips-2008-Estimating vector fields using sparse basis field expansions

20 0.35341099 106 nips-2008-Inferring rankings under constrained sensing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.014), (6, 0.104), (7, 0.088), (12, 0.048), (28, 0.158), (57, 0.08), (59, 0.035), (63, 0.021), (71, 0.025), (75, 0.239), (77, 0.033), (78, 0.017), (83, 0.043), (94, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82328093 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva

2 0.78112304 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

Author: Indraneel Mukherjee, David M. Blei

Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean ﬁeld variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1

3 0.75830835 84 nips-2008-Fast Prediction on a Tree

Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano

Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classiﬁcation tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1

4 0.6883384 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the beneﬁts of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efﬁciently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and ﬁnd that online optimization of the parameters of the KL-regularized model can signiﬁcantly improve prediction performance. 1

5 0.68100446 75 nips-2008-Estimating vector fields using sparse basis field expansions

Author: Stefan Haufe, Vadim V. Nikulin, Andreas Ziehe, Klaus-Robert Müller, Guido Nolte

Abstract: We introduce a novel framework for estimating vector ﬁelds using sparse basis ﬁeld expansions (S-FLEX). The notion of basis ﬁelds, which are an extension of scalar basis functions, arises naturally in our framework from a rotational invariance requirement. We consider a regression setting as well as inverse problems. All variants discussed lead to second-order cone programming formulations. While our framework is generally applicable to any type of vector ﬁeld, we focus in this paper on applying it to solving the EEG/MEG inverse problem. It is shown that signiﬁcantly more precise and neurophysiologically more plausible location and shape estimates of cerebral current sources from EEG/MEG measurements become possible with our method when comparing to the state-of-the-art. 1

6 0.67553943 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

7 0.67294693 202 nips-2008-Robust Regression and Lasso

8 0.6703862 226 nips-2008-Supervised Dictionary Learning

9 0.67000622 143 nips-2008-Multi-label Multiple Kernel Learning

10 0.66961384 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

11 0.66679448 66 nips-2008-Dynamic visual attention: searching for coding length increments

12 0.66657865 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

13 0.66625351 194 nips-2008-Regularized Learning with Networks of Features

14 0.66607594 179 nips-2008-Phase transitions for high-dimensional joint support recovery

15 0.66536766 118 nips-2008-Learning Transformational Invariants from Natural Movies

16 0.66474515 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

17 0.66290867 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

18 0.66254866 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform

19 0.66248697 24 nips-2008-An improved estimator of Variance Explained in the presence of noise

20 0.66238052 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations