nips nips2008 nips2008-149 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva
Abstract: This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For d covariates, there are 2d basis coefficients to estimate, which renders conventional approaches computationally prohibitive when d is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. [sent-8, score-0.397]
2 For d covariates, there are 2d basis coefficients to estimate, which renders conventional approaches computationally prohibitive when d is large. [sent-9, score-0.069]
3 1 Introduction Multivariate binary data arise in a variety of fields, such as biostatistics [1], econometrics [2] or artificial intelligence [3]. [sent-12, score-0.08]
4 In these and other settings, it is often necessary to estimate a probability density from a number of independent observations. [sent-13, score-0.117]
5 samples from a probability density f (with respect to the counting measure) on the d-dimensional bi△ nary hypercube B d, B = {0, 1}, and seek an estimate f of f with a small mean-squared error 2 MSE(f, f ) = E x∈Bd (f (x) − f (x)) . [sent-17, score-0.179]
6 In many cases of practical interest, the number of covariates d is much larger than log n, so direct estimation of f as a multinomial density with 2d parameters is both unreliable and impractical. [sent-18, score-0.214]
7 Thus, one has to resort to “nonparametric” methods and search for good estimators in a suitably defined class whose complexity grows with n. [sent-19, score-0.139]
8 Some nonparametric methods proposed in the literature, such as kernels [4] and orthogonal expansions [5, 6], either have very slow rates of MSE convergence or are computationally prohibitive for large d. [sent-20, score-0.107]
9 For example, the kernel method [4] requires O(n2 d) operations to compute the estimate at any x ∈ B d , yet its MSE decays as O(n−4/(4+d) ) [7], which is extremely slow when d is large. [sent-21, score-0.05]
10 In contrast, orthogonal function methods generally have much better MSE decay rates, but rely on estimating 2d coefficients in a fixed basis, which requires enormous computational resources for large d. [sent-22, score-0.106]
11 For instance, using the Fast Hadamard Transform to estimate the coefficients in the so-called Walsh basis using n samples requires O(nd2d ) operations [5]. [sent-23, score-0.05]
12 In this paper we take up the problem of accurate, computationally tractable estimation of a density on the binary hypercube. [sent-24, score-0.16]
13 We take the minimax point of view, where we assume that f comes from a particular function class F and seek an estimator that approximately attains the minimax MSE △ ∗ Rn (F ) = inf sup MSE(f, f ), b f f ∈F where the infimum is over all estimators based on n i. [sent-25, score-0.361]
14 We will define our function class to reflect another feature often encountered in situations involving multivariate binary data: namely, that the shape of the underlying density is strongly influenced by small constellations of the d covariates. [sent-29, score-0.188]
15 To model such “constellation effects” mathematically, we will consider classes of densities that satisfy a particular sparsity condition. [sent-31, score-0.15]
16 The algorithm entails recursively examining empirical estimates of whole blocks of the 2d basis coefficients. [sent-33, score-0.049]
17 An additional attractive feature of our approach is that it gives us a principled way of trading off MSE against computational complexity by controlling the decay of the threshold as a function of the recursion depth. [sent-36, score-0.186]
18 Let µd denote the counting measure on the d-dimensional binary hypercube B d . [sent-41, score-0.108]
19 When η = 1/2, we get the standard Walsh system used in [5, 6]; in that case, we shall omit the index η = 1/2 for simplicity. [sent-57, score-0.086]
20 The product structure of the biased Walsh bases makes them especially convenient for statistical applications as it allows for a computationally efficient recursive method for computing accurate estimates of squared coefficients in certain hierarchically structured sets. [sent-58, score-0.273]
21 We are interested in densities whose representations in some biased Walsh basis satisfy a certain sparsity constraint. [sent-60, score-0.207]
22 (3) It is not hard to show that the coefficients of any probability density on B d in Φd,η are bounded by R(η) = [η ∨ (1 − η)]d/2 . [sent-71, score-0.09]
23 When η = 1/2, with R(η) = 2−d/2 , we shall write simply Fd (p). [sent-74, score-0.057]
24 Given any f ∈ Fd (p, η) and denoting by fk the function obtained from it by retaining only the k largest coefficients, we get from Parseval’s identity that f − fk L2 (µd ) ≤ CRk −r . [sent-84, score-0.099]
25 Other densities in {Φd (p, η) : 0 < p < 2} include, for example, mixtures of components that, up to a permutation of {1, . [sent-89, score-0.094]
26 , d}, can be written as a tensor product of a large number of Bernoulli(η ∗ ) densities and some other density. [sent-92, score-0.147]
27 3 Density estimation via recursive Walsh thresholding We now turn to our problem of estimating a density f on B d from a sample {Xi }n when f ∈ Fd (p) i=1 for some unknown 0 < p < 2. [sent-105, score-0.467]
28 The minimax theory for weak-ℓp balls [10] says that d ∗ Rn (Fd (p)) ≥ CM −p/2 n−2r/(2r+1) , r = 1/p − 1/2 where M = 2 . [sent-106, score-0.111]
29 We shall construct an estimator that adapts to unknown sparsity of f in the sense that it achieves this minimax rate up to a logarithmic factor without prior knowledge of p and that its computational complexity improves as p → 0. [sent-107, score-0.502]
30 Our method is based on the thresholding of empirical Walsh coefficients. [sent-108, score-0.166]
31 A thresholding estimator is any estimator of the form I{T (θs )≥λn } θs ϕs , f= b s∈Bd n where θs = (1/n) i=1 ϕs (Xi ) are some statistic, and I{·} is an indicator empirical estimates of the Walsh coefficients of f , T (·) is function. [sent-109, score-0.378]
32 For 2 example, in [5, 6] the statistic T (θs ) = θs was used with the threshold λn = 1/M (n + 1). [sent-111, score-0.072]
33 While this is not an issue when d ≍ log n, it is clearly impractical when d ≫ log n. [sent-114, score-0.134]
34 To deal with this issue, we will consider a recursive thresholding approach that will allow us to reject whole groups of coefficients based on efficiently computable statistics. [sent-115, score-0.398]
35 For any 1 ≤ k ≤ d, we can write any f ∈ L2 (µd ) with the Walsh coefficients θ(f ) as θuv ϕuv = f= u∈Bk v∈Bd−k u∈Bk fu ⊗ ϕu , △ where uv denotes the concatenation of u ∈ B k and v ∈ B d−k and, for each u ∈ B k , fu = △ 2 2 2 v∈Bd−k θuv ϕv lies in L (µd−k ). [sent-117, score-0.509]
36 By Parseval’s identity, Wu = fu L2 (µd−k ) = v∈Bd−k θuv . [sent-118, score-0.162]
37 We will use the following fact, easily proved using the definitions (1) and (2) of the Walsh functions: for any density f on B d , any k and u ∈ B k , we have fu (y) = Ef ϕu (πk (X))I{σk (X)=y} , ∀y ∈ B d−k △ and Wu = Ef {ϕu (πk (X))fu (σk (X))} , △ where πk (x) = (x(1), . [sent-127, score-0.252]
38 Now we can define our density estimation procedure. [sent-140, score-0.09]
39 Instead of using a single threshold for all 1 ≤ k ≤ d, we consider a more flexible strategy: for every k, we shall compare each Wu to a threshold that depends not only on n, but also on k. [sent-141, score-0.127]
40 Specifically, we will let λk,n = αk log n , n 1≤k≤d (7) where α = {αk }d satisfies α1 ≥ αk ≥ αd > 0. [sent-142, score-0.067]
41 ) Given λ = {λk,n }d , define the set A(λ) = {s ∈ B d : k=1 Wπk (s) ≥ λk,n , ∀1 ≤ k ≤ d} and the corresponding estimator △ fRWT = I{s∈A(λ)} θs ϕs , (8) s∈Bd where RWT stands for “recursive Walsh thresholding. [sent-144, score-0.106]
42 We now turn to the asymptotic analysis of the MSE and the computational complexity of fRWT . [sent-147, score-0.104]
43 We first prove that fRWT adapts to unknown sparsity of f : Theorem 3. [sent-148, score-0.142]
44 1 Suppose the threshold sequence λ = {λk }d is such that αd ≥ (20d + 25)2 /2d. [sent-149, score-0.046]
45 k=1 Then for all 0 < p < 2 the estimator (8) satisfies sup MSE(f, fRWT ) = f ∈Fd (p) sup Ef f − fRWT f ∈Fd (p) 2 L2 (µd ) ≤ C 2d 2d α1 log n n 2r/(2r+1) where the constant C depends only on p. [sent-150, score-0.242]
46 Defining the sets A1 = {s ∈ B d : θs ≥ λd,n } and d 2 2 2 A2 = {s ∈ B : θs < λ1,n }, we get T1 ≤ s I{s∈A1 } (θs − θs ) and T2 ≤ s I{s∈A2 } θs . [sent-153, score-0.051]
47 Applying (4), (5) and a bit of algebra, we get E T12 ≤ E T22 ≤ 1 Mn s∈Bd 2 s : θs ≥ λd,n /2 ≤ 1 Mn 2 2 I{θs <(3α1 /2) log n/n} θs ≤ C M 2 M λd,n p/2 ≤ M α1 log n n 1 −2r/(2r+1) n , M (10) 2r/(2r+1) . [sent-156, score-0.185]
48 Using Cauchy–Schwarz, we get E T11 ≤ s 1/2 E(θs − θs )4 · P(s ∈ A1 ∩ B) . [sent-158, score-0.051]
49 (12) To estimate the fourth moment in (12), we use Rosenthal’s inequality [14] to get E(θs − θs )4 ≤ c/M 2 n2 . [sent-159, score-0.105]
50 To bound the probability that s ∈ A1 ∩ B, we observe that s ∈ A1 ∩ B implies that |θs − θs | ≥ (1/5) λd,n , and then use Bernstein’s inequality [14] to get P |θs − θs | ≥ (1/5) λd,n ≤ 2 exp − β 2 log n 2(1 + 2β/3) = 2n−β 2 /[2(1+2β/3)] √ with β = (1/5) M αd ≥ 4d + 5. [sent-160, score-0.145]
51 Using the same argument as above, we get P(s ∈ A2 ∩ S) ≤ √ 2 2n−(γ−1)/2 , where γ = (1/5) M α1 . [sent-163, score-0.051]
52 (10), (11), (13), and (14), we get (9), and the theorem is proved. [sent-166, score-0.051]
53 2 Given any δ ∈ (0, 1), provided each αk is chosen so that √ 2k αk n log n ≥ 5 C2 n + (log(d/δ) + k)/ log e , 2 p/2 Algorithm 1 runs in O(n d(n/M log n) K(α, p)) time with probability at least 1 − δ. [sent-169, score-0.23]
54 (15) Proof: The complexity is determined by the number of calls to R ECURSIVE WALSH. [sent-171, score-0.216]
55 Let us say that a call to R ECURSIVE WALSH(u, λ) is correct if Wu ≥ λk,n /2. [sent-173, score-0.06]
56 We will show that, with probability at least 1 − δ, only the correct calls are made. [sent-174, score-0.161]
57 For a given u ∈ B k , Wu ≥ λk,n and Wu < λk,n /2 together imply that fu − fu △ 2 L2 (µd−k ) k ≥ (1/5) λk,n , where fu = v∈Bd−k θuv ϕv . [sent-176, score-0.486]
58 Now, it can be shown that, for every u ∈ B , the norm fu − fu L2 (µd−k ) can be expressed as a supremum of an empirical process [15] over a certain function class that depends on k (details are omitted for lack of space). [sent-177, score-0.35]
59 We can then use Talagrand’s concentration-of-measure inequality for empirical processes [16] to get P(Wu ≥ λk,n , Wu < λk,n /2) ≤ exp − nC1 (2k a2 ∧ 2k/2 ak,n ) , k,n √ k n, and C , C are the absolute constants in Talagrand’s where ak,n = (1/5) αk log n/n − C2 / 2 1 2 bound. [sent-178, score-0.193]
60 Summing over k, u ∈ B k , we see that, with probability ≥ 1 − δ, only the correct calls will be made. [sent-180, score-0.161]
61 Hence, the number of correct d recursive calls is bounded by N = k=1 (2/M λk,n )p/2 = (2n/M log n)p/2 K(α, p). [sent-184, score-0.394]
62 Therefore, with probability at least 1 − δ, the time complexity will be as stated in the theorem. [sent-186, score-0.079]
63 By controlling the rate at which the sequence αk decays with k, we can trade off MSE against complexity. [sent-189, score-0.084]
64 However, it has K(α, p) = O(M p/2 d), resulting in O(d2 n2 (n/ log n)p/2 ) complexity. [sent-195, score-0.067]
65 The second case, which leads to a very severe estimator that will tend to reject a lot of coefficients, has MSE of O((log n/n)2r/(2r+1) M −1/(2r+1) ), but K(α, p) = O(M p/2 ), leading to a considerably better O(dn2 (n/ log n)p/2 ) complexity. [sent-196, score-0.244]
66 However, this reduction in complexity will be offset by a corresponding increase in MSE. [sent-198, score-0.079]
67 In fact, using exponentially decaying αk ’s in practice is not advisable as its low complexity is mainly due to the fact that it will tend to reject even the big coefficients very early on, especially when d is large. [sent-199, score-0.189]
68 To achieve a good balance between complexity and MSE, a moderately decaying threshold sequence might be best, e. [sent-200, score-0.164]
69 As p → 0, the effect of λ on complexity becomes negligible, and the complexity tends to O(n2 d). [sent-203, score-0.158]
70 In practice renormalization may be computationally expensive when d is very large. [sent-208, score-0.07]
71 If the estimate is suitably sparse, however, the renormalization can be carried out approximately using Monte-Carlo methods. [sent-209, score-0.104]
72 4 Simulations The focus of our work is theoretical, consisting in the derivation of a recursive thresholding procedure for estimating multivariate binary densities (Algorithm 1), with a proof of its near-minimaxity and an asymptotic analysis of its complexity. [sent-210, score-0.551]
73 Although an extensive empirical evaluation is outside the scope of this paper, we have implemented the proposed estimator, and now present some simulation results to demonstrate its small-sample performance. [sent-211, score-0.052]
74 We generated synthetic observations from a mixture density f on a 15-dimensional binary hypercube. [sent-212, score-0.136]
75 The mixture has 10 components, where each component is a product density with 12 randomly chosen covariates having Bernoulli(1/2) distributions, and the other three having Bernoulli(0. [sent-213, score-0.172]
76 As can be seen from the coefficient profile in the bottom of the figure, this density is clearly sparse. [sent-218, score-0.09]
77 1 also shows the estimated probabilities and the Walsh coefficients for sample sizes n = 5000 (middle) and n = 10000 (right). [sent-220, score-0.046]
78 b fRWT , n = 5000 Ground truth (f ) b fRWT , n = 10000 Figure 1: Ground truth (left) and estimated density for n = 5000 (middle) and n = 10000 (right) with constant thresholding. [sent-221, score-0.187]
79 Top: true and estimated probabilities (clipped at zero and renormalized) arranged in lexicographic order. [sent-222, score-0.104]
80 Bottom: absolute values of true and estimated Walsh coefficients arranged in lexicographic order. [sent-223, score-0.126]
81 For the estimated densities, the coefficient plots also show the threshold level (dotted line) and absolute values of the rejected coefficients (lighter color). [sent-224, score-0.092]
82 5 1400 Time (s) MSE (× 2d) 40 constant log linear 0. [sent-231, score-0.088]
83 All results are averaged over five independent runs for each sample size (the error bars show the standard deviations). [sent-233, score-0.051]
84 To study the trade-off between MSE and complexity, we implemented three different thresholding schemes: (1) constant, λk,n = 2 log n/(2d n), (2) logarithmic, λk,n = 2 log(d − k + 2) log n/(2d n), and (3) linear, λk,n = 2(d − k + 1) log n/(2d n). [sent-234, score-0.367]
85 Up to the log n factor (dictated by the theory), the thresholds at k = d are set to twice the variance of the empirical estimate of any coefficient whose value is zero; this forces the estimator to reject empirical coefficients whose values cannot be reliably distinguished from zero. [sent-235, score-0.323]
86 Occasionally, spurious coefficients get retained, as can be seen in Fig. [sent-236, score-0.051]
87 In agreement with the theory, MSE is the smallest for the constant thresholding scheme [which is simply an efficient recursive implementation of a term-by-term thresholding estimator with λn ∼ log n/(M n)], and then it increases for the logarithmic and for the linear schemes. [sent-242, score-0.723]
88 2(b,c) shows the running time (in seconds) and the number of recursive calls made to R ECURSIVE WALSH vs. [sent-244, score-0.33]
89 The number of recursive calls is a platformindependent way of gauging the computational complexity of the algorithm, although it should be kept in mind that each recursive call has O(n2 d) overhead. [sent-246, score-0.584]
90 The running time increases polynomially with n, and is the largest for the constant scheme, followed by the logarithmic and the linear schemes. [sent-247, score-0.103]
91 We see that, while the MSE of the logarithmic scheme is fairly close to that of the constant scheme, its complexity is considerably lower, in terms of both the number of recursive calls and the running time. [sent-248, score-0.513]
92 In all three cases, the number of recursive calls decreases with n due to the fact that weight estimates become increasingly accurate with n, which causes the expected number of false discoveries (i. [sent-249, score-0.303]
93 , making a recursive call at an internal node of the tree only to reject its descendants later) to decrease. [sent-251, score-0.273]
94 2(d) shows the number of coefficients retained in the estimate. [sent-253, score-0.049]
95 This number grows with n as a consequence of the fact that the threshold decreases with n, while the number of accurately estimated coefficients increases. [sent-254, score-0.07]
96 The true density f has 40 parameters: 9 to specify the weights of the components, 3 per component to locate the indices of the nonuniform covariates, and the single Bernoulli parameter of the nonuniform covariates. [sent-255, score-0.17]
97 Overall, these preliminary simulation results show that our implemented estimator behaves in accordance with the theory even in the small-sample regime. [sent-257, score-0.132]
98 The performance of the logarithmic thresholding scheme is especially encouraging, suggesting that it may be possible to trade off MSE against complexity in a way that will scale to large values of d. [sent-258, score-0.341]
99 To model their densities, we plan to experiment with Walsh bases with η biased toward unity. [sent-266, score-0.08]
100 Some classification procedures for multivariate binary data using orthogonal functions. [sent-297, score-0.137]
wordName wordTfidf (topN-words)
[('walsh', 0.479), ('frwt', 0.342), ('mse', 0.313), ('wu', 0.278), ('bd', 0.188), ('ecursive', 0.183), ('fd', 0.17), ('recursive', 0.166), ('coef', 0.163), ('uv', 0.163), ('fu', 0.162), ('cients', 0.152), ('thresholding', 0.14), ('calls', 0.137), ('estimator', 0.106), ('densities', 0.094), ('density', 0.09), ('minimax', 0.085), ('complexity', 0.079), ('reject', 0.071), ('log', 0.067), ('adapts', 0.065), ('bk', 0.057), ('nc', 0.057), ('covariates', 0.057), ('sparsity', 0.056), ('logarithmic', 0.055), ('multivariate', 0.052), ('duke', 0.051), ('durham', 0.051), ('get', 0.051), ('retained', 0.049), ('bernoulli', 0.047), ('threshold', 0.046), ('chapel', 0.046), ('crk', 0.046), ('goldreich', 0.046), ('lexicographic', 0.046), ('parseval', 0.046), ('renormalization', 0.046), ('binary', 0.046), ('nonuniform', 0.04), ('lazebnik', 0.04), ('hypercube', 0.04), ('orthogonal', 0.039), ('decay', 0.039), ('decaying', 0.039), ('trade', 0.039), ('talagrand', 0.037), ('willett', 0.037), ('call', 0.036), ('shall', 0.035), ('econometrics', 0.034), ('arranged', 0.034), ('biased', 0.034), ('attains', 0.032), ('sparser', 0.032), ('suitably', 0.031), ('hill', 0.03), ('runs', 0.029), ('estimators', 0.029), ('tensor', 0.028), ('scheme', 0.028), ('estimating', 0.028), ('participants', 0.027), ('running', 0.027), ('estimate', 0.027), ('inequality', 0.027), ('truth', 0.026), ('empirical', 0.026), ('implemented', 0.026), ('statistic', 0.026), ('balls', 0.026), ('ground', 0.025), ('asymptotic', 0.025), ('product', 0.025), ('orthonormal', 0.024), ('sup', 0.024), ('panel', 0.024), ('fk', 0.024), ('mn', 0.024), ('bases', 0.024), ('computationally', 0.024), ('estimated', 0.024), ('correct', 0.024), ('cm', 0.023), ('decays', 0.023), ('fourier', 0.023), ('basis', 0.023), ('nonparametric', 0.022), ('write', 0.022), ('plan', 0.022), ('prohibitive', 0.022), ('controlling', 0.022), ('counting', 0.022), ('sample', 0.022), ('absolute', 0.022), ('constant', 0.021), ('groups', 0.021), ('unknown', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva
Abstract: This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For d covariates, there are 2d basis coefficients to estimate, which renders conventional approaches computationally prohibitive when d is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.
2 0.092378609 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
Author: Ralf M. Haefner, Bruce G. Cumming
Abstract: A crucial part of developing mathematical models of information processing in the brain is the quantification of their success. One of the most widely-used metrics yields the percentage of the variance in the data that is explained by the model. Unfortunately, this metric is biased due to the intrinsic variability in the data. We derive a simple analytical modification of the traditional formula that significantly improves its accuracy (as measured by bias) with similar or better precision (as measured by mean-square error) in estimating the true underlying Variance Explained by the model class. Our estimator advances on previous work by a) accounting for overfitting due to free model parameters mitigating the need for a separate validation data set, b) adjusting for the uncertainty in the noise estimate and c) adding a conditioning term. We apply our new estimator to binocular disparity tuning curves of a set of macaque V1 neurons and find that on a population level almost all of the variance unexplained by Gabor functions is attributable to noise. 1
3 0.074748904 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos
Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1
4 0.070813209 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss
Author: Ingo Steinwart, Andreas Christmann
Abstract: In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. Finally, we briefly discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. 1
5 0.067794159 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
6 0.065005019 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
7 0.064226419 99 nips-2008-High-dimensional support union recovery in multivariate regression
8 0.061586015 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
9 0.060636591 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
10 0.056349158 106 nips-2008-Inferring rankings under constrained sensing
11 0.054576166 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
12 0.05360695 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
13 0.053115226 75 nips-2008-Estimating vector fields using sparse basis field expansions
14 0.05226393 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
15 0.051361006 179 nips-2008-Phase transitions for high-dimensional joint support recovery
16 0.050747279 118 nips-2008-Learning Transformational Invariants from Natural Movies
17 0.05073059 228 nips-2008-Support Vector Machines with a Reject Option
18 0.050125547 62 nips-2008-Differentiable Sparse Coding
19 0.047882847 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
20 0.047394991 235 nips-2008-The Infinite Hierarchical Factor Regression Model
topicId topicWeight
[(0, -0.148), (1, 0.011), (2, -0.036), (3, 0.063), (4, 0.047), (5, -0.01), (6, -0.031), (7, 0.045), (8, 0.032), (9, 0.054), (10, -0.044), (11, -0.018), (12, 0.046), (13, -0.024), (14, -0.129), (15, -0.005), (16, 0.036), (17, 0.028), (18, 0.006), (19, 0.056), (20, -0.013), (21, 0.087), (22, 0.068), (23, -0.055), (24, 0.022), (25, 0.095), (26, -0.007), (27, -0.086), (28, 0.09), (29, 0.001), (30, 0.036), (31, -0.015), (32, 0.033), (33, 0.043), (34, -0.0), (35, 0.055), (36, 0.007), (37, 0.022), (38, 0.088), (39, -0.009), (40, -0.072), (41, -0.117), (42, 0.005), (43, -0.134), (44, -0.171), (45, -0.114), (46, 0.022), (47, -0.002), (48, -0.057), (49, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.93173146 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva
Abstract: This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For d covariates, there are 2d basis coefficients to estimate, which renders conventional approaches computationally prohibitive when d is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.
2 0.63036537 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
Author: Ralf M. Haefner, Bruce G. Cumming
Abstract: A crucial part of developing mathematical models of information processing in the brain is the quantification of their success. One of the most widely-used metrics yields the percentage of the variance in the data that is explained by the model. Unfortunately, this metric is biased due to the intrinsic variability in the data. We derive a simple analytical modification of the traditional formula that significantly improves its accuracy (as measured by bias) with similar or better precision (as measured by mean-square error) in estimating the true underlying Variance Explained by the model class. Our estimator advances on previous work by a) accounting for overfitting due to free model parameters mitigating the need for a separate validation data set, b) adjusting for the uncertainty in the noise estimate and c) adding a conditioning term. We apply our new estimator to binocular disparity tuning curves of a set of macaque V1 neurons and find that on a population level almost all of the variance unexplained by Gabor functions is attributable to noise. 1
3 0.61971432 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
Author: Guangzhi Cao, Charles Bouman
Abstract: Covariance estimation for high dimensional vectors is a classically difficult problem in statistical analysis and machine learning. In this paper, we propose a maximum likelihood (ML) approach to covariance estimation, which employs a novel sparsity constraint. More specifically, the covariance is constrained to have an eigen decomposition which can be represented as a sparse matrix transform (SMT). The SMT is formed by a product of pairwise coordinate rotations known as Givens rotations. Using this framework, the covariance can be efficiently estimated using greedy minimization of the log likelihood function, and the number of Givens rotations can be efficiently computed using a cross-validation procedure. The resulting estimator is positive definite and well-conditioned even when the sample size is limited. Experiments on standard hyperspectral data sets show that the SMT covariance estimate is consistently more accurate than both traditional shrinkage estimates and recently proposed graphical lasso estimates for a variety of different classes and sample sizes. 1
4 0.60070437 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss
Author: Ingo Steinwart, Andreas Christmann
Abstract: In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. Finally, we briefly discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. 1
5 0.59449941 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise (l2 , l1 ) distances, in static, large-scale, and sparse data. This study modifies the original CRS and extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than assuming static data. Compared with many other sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a significant advantage in that it is “one-sketch-for-all.” In particular, we demonstrate the effectiveness of CRS in efficiently computing the Hamming norm, the Hamming distance, the lp distance, and the χ2 distance. A generic estimator and an approximate variance formula are also provided, for approximating any type of distances. We recommend CRS as a promising tool for building highly scalable systems, in machine learning, data mining, recommender systems, and information retrieval. 1
6 0.52547008 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
7 0.51721275 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
8 0.47252896 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
9 0.44841912 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
10 0.44416511 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
11 0.43289858 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
12 0.41757569 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
13 0.41739258 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
14 0.41500059 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
15 0.38234776 153 nips-2008-Nonlinear causal discovery with additive noise models
16 0.36549219 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
17 0.36277574 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
18 0.36103767 233 nips-2008-The Gaussian Process Density Sampler
19 0.35979149 75 nips-2008-Estimating vector fields using sparse basis field expansions
20 0.35341099 106 nips-2008-Inferring rankings under constrained sensing
topicId topicWeight
[(4, 0.014), (6, 0.104), (7, 0.088), (12, 0.048), (28, 0.158), (57, 0.08), (59, 0.035), (63, 0.021), (71, 0.025), (75, 0.239), (77, 0.033), (78, 0.017), (83, 0.043), (94, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.82328093 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva
Abstract: This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For d covariates, there are 2d basis coefficients to estimate, which renders conventional approaches computationally prohibitive when d is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.
2 0.78112304 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
Author: Indraneel Mukherjee, David M. Blei
Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1
3 0.75830835 84 nips-2008-Fast Prediction on a Tree
Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano
Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1
4 0.6883384 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
5 0.68100446 75 nips-2008-Estimating vector fields using sparse basis field expansions
Author: Stefan Haufe, Vadim V. Nikulin, Andreas Ziehe, Klaus-Robert Müller, Guido Nolte
Abstract: We introduce a novel framework for estimating vector fields using sparse basis field expansions (S-FLEX). The notion of basis fields, which are an extension of scalar basis functions, arises naturally in our framework from a rotational invariance requirement. We consider a regression setting as well as inverse problems. All variants discussed lead to second-order cone programming formulations. While our framework is generally applicable to any type of vector field, we focus in this paper on applying it to solving the EEG/MEG inverse problem. It is shown that significantly more precise and neurophysiologically more plausible location and shape estimates of cerebral current sources from EEG/MEG measurements become possible with our method when comparing to the state-of-the-art. 1
6 0.67553943 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
7 0.67294693 202 nips-2008-Robust Regression and Lasso
8 0.6703862 226 nips-2008-Supervised Dictionary Learning
9 0.67000622 143 nips-2008-Multi-label Multiple Kernel Learning
10 0.66961384 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
11 0.66679448 66 nips-2008-Dynamic visual attention: searching for coding length increments
12 0.66657865 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
13 0.66625351 194 nips-2008-Regularized Learning with Networks of Features
14 0.66607594 179 nips-2008-Phase transitions for high-dimensional joint support recovery
15 0.66536766 118 nips-2008-Learning Transformational Invariants from Natural Movies
16 0.66474515 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
17 0.66290867 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
18 0.66254866 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
19 0.66248697 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
20 0.66238052 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations