nips nips2011 nips2011-305 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Samory Kpotufe
Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. [sent-3, score-0.669]
2 These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e. [sent-4, score-0.876]
3 We show that k-NN regression is also adaptive to intrinsic dimension. [sent-7, score-0.315]
4 In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. [sent-8, score-0.494]
5 Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. [sent-9, score-0.852]
6 We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. [sent-10, score-1.301]
7 1 Introduction We derive new rates of convergence in terms of dimension for the popular approach of Nearest Neighbor (k-NN) regression. [sent-11, score-0.421]
8 Our motivation is that, for good performance, k-NN regression can require a number of samples exponential in the dimension of the input space X . [sent-12, score-0.419]
9 Formally stated, the curse of dimension is the fact that, for any nonparametric regressor there exists a distribution in RD such that, given a training size n, the regressor converges at a rate no better than n−1/O(D) (see e. [sent-14, score-0.656]
10 Fortunately it often occurs that high-dimensional data has low intrinsic dimension: typical examples are data lying near low-dimensional manifolds [3, 4, 5]. [sent-17, score-0.243]
11 We would hope that in these cases nonparametric regressors can escape the curse of dimension, i. [sent-18, score-0.258]
12 their performance should only depend on the intrinsic dimension of the data, appropriately formalized. [sent-20, score-0.421]
13 In other words, if the data in RD has intrinsic dimension d < D, we would hope for a better convergence rate of the form n−1/O(d) < instead of n−1/O(D) . [sent-21, score-0.525]
14 This has recently been shown to indeed be the case for methods such as kernel regression [6], tree-based regression [7] and variants of these methods [8]. [sent-22, score-0.274]
15 In the case of k-NN regression however, it is only known that 1-NN regression (where k = 1) converges at a rate that depends on intrinsic dimension [9]. [sent-23, score-0.757]
16 First we show that, for a wide range of values of k ensuring consistency, k-NN regression converges at a rate that only depends on the intrinsic dimension in a neighborhood of a query x. [sent-28, score-0.844]
17 Our local notion of dimension in a neighborhood of a point x relies on the well-studied notion of doubling measure (see Section 2. [sent-29, score-1.137]
18 In particular our dimension quantifies how the mass of balls vary with radius, and this captures standard examples of data with low intrinsic dimension. [sent-31, score-0.673]
19 Our second, and perhaps most important contribution, is a simple procedure for choosing k = k(x) so as to nearly achieve the minimax rate of O n−2/(2+d) in terms of the unknown dimension d in a neighborhood of x. [sent-32, score-0.787]
20 Our final contribution is in showing that this minimax rate holds for any metric space and doubling measure. [sent-33, score-0.874]
21 In other words the hardness of the regression problem is not tied to a particular 1 choice of metric space X or doubling measure µ, but depends only on how the doubling measure µ expands on a metric space X . [sent-34, score-1.474]
22 Thus, for any marginal µ on X with expansion constant Θ 2d , the minimax rate for the measure space (X , µ) is Ω n−2/(2+d) . [sent-35, score-0.409]
23 1 Discussion It is desirable to express regression rates in terms of a local notion of dimension rather than a global one because the complexity of data can vary considerably over regions of space. [sent-37, score-0.825]
24 A simple example of this is a so-called space filling curve where a low-dimensional manifold curves enough that globally it seems to fill up space. [sent-43, score-0.158]
25 The behavior of k-NN regression is rather controlled by the often smaller local dimension in a neighborhood B(x, r) of x, where the neighborhood size r shrinks with k/n. [sent-45, score-0.829]
26 Given such a neighborhood B(x, r) of x, how does one choose k = k(x) optimally relative to the unknown local dimension in B(x, r)? [sent-46, score-0.591]
27 Another possibility is to estimate the dimension of the data in the vicinity of x, and use this estimate to set k. [sent-49, score-0.297]
28 However, for optimal rates, we have to estimate the dimension exactly and we know of no finite sample result that guarantees the exact estimate of intrinsic dimension. [sent-50, score-0.421]
29 Our method consists of finding a value of k that balances quantities which control estimator variance and bias at x, namely 1/k and distances to x’s k nearest neighbors. [sent-51, score-0.197]
30 The method guarantees, uniformly over all x ∈ X , a near optimal rate of O n−2/(2+d) where d = d(x) is exactly the unknown local dimension on a neighborhood B(x, r) of x, where r → 0 as n → ∞. [sent-52, score-0.717]
31 d samples (X, Y) = {(Xi , Yi )}i=1 from some unknown distribution where the input variable X belongs to a metric space (X , ρ), and the output Y is a real number. [sent-55, score-0.189]
32 We assume that the class B of balls on (X , ρ) has finite VC dimension VB . [sent-56, score-0.412]
33 The VC assumption is however irrelevant to the minimax result of Theorem 3. [sent-61, score-0.216]
34 1 Regression function and noise The regression function f (x) = E [Y |X = x] is assumed to be λ-Lipschitz, i. [sent-64, score-0.185]
35 We assume a simple but general noise model: the distributions of the noise at points x ∈ X have uniformly bounded tails and variance. [sent-67, score-0.193]
36 As a last example, in the case of Gaussian (or sub-Gaussian) noise, it’s not hard to see that ∀δ > 0, tY (δ) ≤ O( ln 1/δ). [sent-76, score-0.231]
37 2 Weighted k-NN regression estimate We assume a kernel function K : R+ → R+ , non-increasing, such that K(1) > 0, and K(ρ) = 0 for ρ > 1. [sent-78, score-0.137]
38 The regression estimate at x given the n-sample (X, Y) is then defined as fn,k (x) = i 2. [sent-80, score-0.137]
39 i Notion of dimension We start with the following definition of doubling measure which will lead to the notion of local dimension used in this work. [sent-82, score-1.117]
40 , 11, 12] for thorough overviews of the topic of metric space dimension and doubling measures. [sent-84, score-0.846]
41 The marginal µ is a doubling measure if there exist Cdb > 0 such that for any x ∈ X and r ≥ 0, we have µ(B(x, r)) ≤ Cdb µ(B(x, r/2)). [sent-86, score-0.514]
42 An equivalent definition is that, µ is doubling if there exist C and d such that for any x ∈ X , for any r ≥ 0 and any 0 < < 1, we have µ(B(x, r)) ≤ C −d µ(B(x, r)). [sent-88, score-0.43]
43 A simple example of a doubling measure is the Lebesgue volume in the Euclidean space Rd . [sent-94, score-0.509]
44 For any x ∈ Rd and r > 0, vol (B(x, r)) = vol (B(x, 1)) rd . [sent-95, score-0.42]
45 Thus vol (B(x, r)) / vol (B(x, r)) = −d for any x ∈ Rd , r > 0 and 0 < < 1. [sent-96, score-0.322]
46 Building upon the doubling behavior of volumes in Rd , we can construct various examples of doubling probability measures. [sent-97, score-0.889]
47 Let X ⊂ RD be a subset of a d-dimensional hyperplane, and let X satisfy for all balls B(x, r) with x ∈ X , vol (B(x, r) ∩ X ) = Θ(rd ), the volume being with respect to the containing hyperplane. [sent-99, score-0.366]
48 Now let µ be approximately uniform, that is µ satisfies for all such balls B(x, r), µ(B(x, r) ∩ X ) = Θ(vol (B(x, r) ∩ X )). [sent-100, score-0.205]
49 Unfortunately a global notion of dimension such as the above definition of d is rather restrictive as it requires the same complexity globally and locally. [sent-102, score-0.448]
50 However a data space can be complex globally and have small complexity locally. [sent-103, score-0.145]
51 The manifold might be globally complex but the restriction of µ to a ball B(x, r), x ∈ X , is doubling with local dimension d, provided r is sufficiently small and certain conditions on curvature hold. [sent-105, score-0.893]
52 The above example motivates the following definition of local dimension d. [sent-109, score-0.344]
53 In particular, another space with small local dimension is a sparse data space X ⊂ RD where each vector x has at most d non-zero coordinates, i. [sent-115, score-0.422]
54 X is a collection of finitely many hyperplanes of dimension at most d. [sent-117, score-0.243]
55 all have local dimension d on B, then µ is also (C, d)-homogeneous on B for some C. [sent-121, score-0.344]
56 We want rates of convergence which hold uniformly over all regions where µ is doubling. [sent-122, score-0.284]
57 The marginal µ is (C0 , d0 )-maximally-homogeneous for some C0 ≥ 1 and d0 ≥ 1, if the following holds for all x ∈ X and r > 0: suppose there exists C ≥ 1 and d ≥ 1 such that µ is (C, d)-homogeneous on B(x, r), then µ is (C0 , d0 )-homogeneous on B(x, r). [sent-126, score-0.208]
58 We note that, rather than assuming as in Definition 3 that all local dimensions are at most d0 , we can express our results in terms of the subset of X where local dimensions are at most d0 . [sent-127, score-0.202]
59 1 Local rates for fixed k The first result below establishes the rates of convergence for any k ln n in terms of the (unknown) complexity on B(x, r) where r is any r satisfying µ(B(x, r)) > Ω(k/n) (we need at least Ω(k) samples in the relevant neighborhoods of x). [sent-131, score-0.65]
60 Suppose µ is (C0 , d0 )-maximally-homogeneous, and B has finite VC dimension VB . [sent-133, score-0.243]
61 With probability at least 1 − 2δ over the choice of (X, Y), the following holds simultaneously for all x ∈ X and k satisfying n > k ≥ VB ln 2n + ln(8/δ). [sent-135, score-0.456]
62 Note that the above rates hold uniformly over x, k ln n, and any r where µ(B(x, r)) ≥ Ω(k/n). [sent-140, score-0.429]
63 The rate also depends on µ(B(x, r)) and suggests that the best scenario is that where x has a small neighborhood of large mass and small dimension d. [sent-141, score-0.524]
64 2 Minimax rates for a doubling measure Our next result shows that the hardness of the regression problem is not tied to a particular choice of the metric X or the doubling measure µ. [sent-143, score-1.427]
65 The result relies mainly on the fact that µ is doubling on X . [sent-144, score-0.459]
66 This does not however make the lower-bound less expressive, as it still tells us which rates to expect locally. [sent-146, score-0.136]
67 Thus if µ is (C, d)-homogeneous near x, we cannot expect a better rate than O n−2/(2+d) (assuming a Lipschitz regression function f ). [sent-147, score-0.229]
68 Let µ be a doubling measure on a metric space (X , ρ) of diameter 1, and suppose µ satisfies, for all x ∈ X , for all r > 0 and 0 < < 1, C1 −d µ(B(x, r)) ≤ µ(B(x, r)) ≤ C2 −d µ(B(x, r)), where C1 , C2 and d are positive constants independent of x, r, and . [sent-149, score-0.699]
69 Fix a sample size n > 0 and let fn denote any regressor on samples (X, Y) of size n, i. [sent-152, score-0.247]
70 Choosing k for near-optimal rates at x Our last result shows a practical and simple way to choose k locally so as to nearly achieve the minimax rate at x, i. [sent-157, score-0.468]
71 a rate that depends on the unknown local dimension in a neighborhood B(x, r) of x, where again, r satisfies µ(B(x, r)) > Ω(k/n) for good choices of k. [sent-159, score-0.625]
72 The intuition, similar to that of a method for tree-pruning in [7], is to look for a k that 2 balances the variance (roughly 1/k) and the square bias (roughly rk,n (x)) of the estimate. [sent-162, score-0.143]
73 The procedure is as follows: Choosing k at x: Pick ∆ ≥ maxi ρ (x, Xi ), and pick θn,δ ≥ ln n/δ. [sent-163, score-0.318]
74 Suppose µ is (C0 , d0 )-maximally-homogeneous, and B has finite VC dimension VB . [sent-170, score-0.243]
75 Let 0 < δ < 1 and suppose n4/(6+3d0 ) > (VB ln 2n + ln(8/δ)) /θn,δ . [sent-172, score-0.316]
76 With probability at least 1 − 2δ over the choice of (X, Y), the following holds simultaneously for all x ∈ X . [sent-173, score-0.186]
77 Thus Y ideally we want to set θn,δ to the order of (t2 (δ/2n) · ln n/δ). [sent-181, score-0.231]
78 Y Just as in Theorem 1, the rates of Theorem 3 hold uniformly for all x ∈ X , and all 0 < r < ∆ where µ(B(x, r)) > Ω(n−1/3 ). [sent-182, score-0.198]
79 any neighborhood B(x, r) of x, 0 < r < supx ∈X ρ (x, x ), becomes admissible. [sent-187, score-0.174]
80 Once a neighborhood B(x, r) is admissible for some n, our procedure nearly attains the minimax rates in terms of the local dimension on B(x, r), provided µ is doubling on B(x, r). [sent-188, score-1.398]
81 Again, the mass of an admissible neighborhood affects the rate, and the bound in Theorem 3 is best for an admissible neighborhood with large mass µ(B(x, r)) and small dimension d. [sent-189, score-0.922]
82 1 Local rates for fixed k: bias and variance at x In this section we bound the bias and variance terms of equation (1) with high probability, uniformly over x ∈ X . [sent-197, score-0.473]
83 Let B denote the class of balls on X , with VC-dimension VB . [sent-201, score-0.169]
84 Let 0 < δ < 1, and define αn = (VB ln 2n + ln(8/δ)) /n. [sent-202, score-0.231]
85 The following holds with probability at least 1 − δ for all balls in B. [sent-203, score-0.289]
86 We start with the bias which is simpler to handle: it is easy to show that the bias of the estimate at x depends on the radius rk,n (x). [sent-206, score-0.176]
87 This radius can then be bounded, first in expectation using the doubling assumption on µ, then by calling on the above lemma to relate this expected bound to rk,n (x) with high probability. [sent-207, score-0.641]
88 With probability at least 1 − δ over the choice of X, the following holds simultaneously for all x ∈ X and k satisfying n > k ≥ VB ln 2n + ln(8/δ). [sent-211, score-0.456]
89 = 3Ck nµ(B(x, r)) 1/d , so that < 1 by the bound on µ(B(x, r)); then by the local doubling assumption on B(x, r), we have µ(B(x, r)) ≥ C −1 d µ(B(x, r)) ≥ 3k/n. [sent-224, score-0.598]
90 With probability at least 1 − 2δ over the choice of (X, Y), the following then holds simultaneously for all x ∈ X and k ∈ [n]: fn,k (x) − fn,k (x) 2 ≤ 2 K(0) VB · t2 (δ/2n) · ln(2n/δ) + σY Y · . [sent-230, score-0.186]
91 For X fixed, the number of such subsets Yx,k is therefore at most the number of ways we can intersect balls in B with the sample X; this is in turn upper-bounded by nVB as is well-known in VC theory. [sent-234, score-0.169]
92 (3) i We can now bound i 2 wi,k (x) as follows: 2 wi,k (x) ≤ max wi,k (x) = max i∈[n] i ≤ i∈[n] K (ρ(x, xi )/rk,n (x)) ≤ j K (ρ(x, xj )/rk,n (x)) K(0) K (ρ(x, xj )/rk,n (x)) j K(0) K(0) ≤ . [sent-256, score-0.154]
93 2 Minimax rates for a doubling measure The minimax rates of theorem 2 (proved in the long version [15]) are obtained as is commonly done by constructing a regression problem that reduces to the problem of binary classification (see e. [sent-259, score-1.117]
94 We will consider a class of candidate regression functions such that each function f alternates between positive and negative in neighboring regions (f is depicted as the dashed line below). [sent-266, score-0.181]
95 + − The reduction relies on the simple observation that for a regressor fn to approximate the right f from data it needs to at least identify the sign of f in the various regions of space. [sent-267, score-0.323]
96 Now by the first inequality of (4) we also have µ(B(x, r)) > 6Cθn,δ n−1/3 ≥ 6Cθn,δ n−d/(2+d) = 6Cθn,δ θn,δ 4/(6+3d) θn,δ 4/(6+3d0 ) κ ≥ n ≥ n ≥ αn , n n n where αn = (VB ln 2n + ln(8/δ)) /n is as defined in Lemma 1. [sent-279, score-0.264]
97 Therefore, suppose k1 ≤ κ, it must be that k2 > κ otherwise k2 is the highest integer satisfying the condition, contradicting our choice of k1 . [sent-284, score-0.159]
98 Final remark The problem of choosing k = k(x) optimally at x is similar to the problem of local bandwidth selection for kernel-based methods (see e. [sent-293, score-0.215]
99 [16, 17]), and our method for choosing k might yield insights into bandwidth selection, since k-NN and kernel regression methods only differ in their notion of neighborhood of a query x. [sent-295, score-0.507]
100 Rate of convergence for the wild bootstrap in nonpara- metric regression. [sent-386, score-0.147]
wordName wordTfidf (topN-words)
[('doubling', 0.43), ('vb', 0.358), ('dimension', 0.243), ('ln', 0.231), ('minimax', 0.186), ('intrinsic', 0.178), ('ty', 0.176), ('neighborhood', 0.174), ('balls', 0.169), ('cdb', 0.166), ('vol', 0.161), ('regression', 0.137), ('rates', 0.136), ('fn', 0.123), ('metric', 0.105), ('lemma', 0.104), ('curse', 0.102), ('local', 0.101), ('nvb', 0.1), ('rd', 0.098), ('vc', 0.095), ('admissible', 0.088), ('regressor', 0.088), ('pick', 0.087), ('suppose', 0.085), ('globally', 0.079), ('bias', 0.068), ('samory', 0.066), ('regressors', 0.066), ('uniformly', 0.062), ('rate', 0.062), ('nition', 0.06), ('notion', 0.06), ('fix', 0.057), ('yi', 0.057), ('nearest', 0.054), ('vicinity', 0.054), ('theorem', 0.052), ('holds', 0.052), ('query', 0.05), ('bandwidth', 0.049), ('noise', 0.048), ('nonparametric', 0.046), ('mass', 0.045), ('unknown', 0.045), ('regions', 0.044), ('marginal', 0.044), ('locally', 0.044), ('escape', 0.044), ('convergence', 0.042), ('zi', 0.042), ('xi', 0.041), ('measure', 0.04), ('radius', 0.04), ('manifold', 0.04), ('nearly', 0.04), ('ey', 0.04), ('least', 0.039), ('global', 0.039), ('space', 0.039), ('satisfying', 0.039), ('balances', 0.039), ('expansion', 0.038), ('xj', 0.038), ('vary', 0.038), ('hardness', 0.038), ('choosing', 0.037), ('bound', 0.037), ('variance', 0.036), ('let', 0.036), ('tied', 0.036), ('adapts', 0.036), ('bounded', 0.035), ('choice', 0.035), ('manifolds', 0.035), ('event', 0.035), ('neighbor', 0.033), ('inequality', 0.033), ('cn', 0.032), ('simultaneously', 0.031), ('assumption', 0.03), ('equation', 0.03), ('near', 0.03), ('nitely', 0.03), ('relies', 0.029), ('rki', 0.029), ('krzyzak', 0.029), ('birkhauser', 0.029), ('overviews', 0.029), ('kohler', 0.029), ('fractal', 0.029), ('wildly', 0.029), ('probability', 0.029), ('optimally', 0.028), ('affects', 0.028), ('proceed', 0.027), ('complexity', 0.027), ('exists', 0.027), ('dimensionality', 0.027), ('grow', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
Author: Samory Kpotufe
Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1
2 0.20100044 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
Author: Shinichi Nakajima, Masashi Sugiyama, S. D. Babacan
Abstract: Variational Bayesian matrix factorization (VBMF) efficiently approximates the posterior distribution of factorized matrices by assuming matrix-wise independence of the two factors. A recent study on fully-observed VBMF showed that, under a stronger assumption that the two factorized matrices are column-wise independent, the global optimal solution can be analytically computed. However, it was not clear how restrictive the column-wise independence assumption is. In this paper, we prove that the global solution under matrix-wise independence is actually column-wise independent, implying that the column-wise independence assumption is harmless. A practical consequence of our theoretical finding is that the global solution under matrix-wise independence (which is a standard setup) can be obtained analytically in a computationally very efficient way without any iterative algorithms. We experimentally illustrate advantages of using our analytic solution in probabilistic principal component analysis. 1
3 0.15253848 25 nips-2011-Adaptive Hedge
Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald
Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1
4 0.13495931 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Author: Eric Moulines, Francis R. Bach
Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
5 0.12433445 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
Author: Mona Eberts, Ingo Steinwart
Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1
6 0.10682553 263 nips-2011-Sparse Manifold Clustering and Embedding
7 0.10113021 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
8 0.098658599 162 nips-2011-Lower Bounds for Passive and Active Learning
9 0.09731669 109 nips-2011-Greedy Model Averaging
10 0.087665267 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
11 0.078041971 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
12 0.076597013 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
13 0.076274596 177 nips-2011-Multi-armed bandits on implicit metric spaces
14 0.076189019 282 nips-2011-The Fast Convergence of Boosting
15 0.075282693 198 nips-2011-On U-processes and clustering performance
16 0.073687784 171 nips-2011-Metric Learning with Multiple Kernels
17 0.072645433 217 nips-2011-Practical Variational Inference for Neural Networks
18 0.072173208 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
19 0.069479913 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
20 0.068027362 80 nips-2011-Efficient Online Learning via Randomized Rounding
topicId topicWeight
[(0, 0.21), (1, -0.084), (2, -0.042), (3, -0.122), (4, 0.001), (5, 0.004), (6, 0.02), (7, -0.044), (8, -0.002), (9, 0.017), (10, -0.088), (11, 0.029), (12, 0.011), (13, -0.088), (14, 0.082), (15, -0.005), (16, -0.037), (17, -0.045), (18, -0.096), (19, 0.148), (20, -0.034), (21, -0.017), (22, 0.129), (23, 0.194), (24, -0.046), (25, 0.121), (26, 0.059), (27, 0.039), (28, -0.032), (29, -0.071), (30, -0.044), (31, 0.076), (32, -0.01), (33, -0.057), (34, 0.259), (35, -0.082), (36, 0.045), (37, 0.11), (38, 0.079), (39, 0.047), (40, 0.082), (41, 0.006), (42, -0.082), (43, -0.067), (44, 0.132), (45, -0.058), (46, 0.0), (47, 0.061), (48, -0.03), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.95788884 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
Author: Samory Kpotufe
Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1
2 0.68766111 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
3 0.65425026 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
Author: Shinichi Nakajima, Masashi Sugiyama, S. D. Babacan
Abstract: Variational Bayesian matrix factorization (VBMF) efficiently approximates the posterior distribution of factorized matrices by assuming matrix-wise independence of the two factors. A recent study on fully-observed VBMF showed that, under a stronger assumption that the two factorized matrices are column-wise independent, the global optimal solution can be analytically computed. However, it was not clear how restrictive the column-wise independence assumption is. In this paper, we prove that the global solution under matrix-wise independence is actually column-wise independent, implying that the column-wise independence assumption is harmless. A practical consequence of our theoretical finding is that the global solution under matrix-wise independence (which is a standard setup) can be obtained analytically in a computationally very efficient way without any iterative algorithms. We experimentally illustrate advantages of using our analytic solution in probabilistic principal component analysis. 1
4 0.60570323 25 nips-2011-Adaptive Hedge
Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald
Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1
5 0.57689387 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
Author: Joseph Keshet, David A. McAllester
Abstract: We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1
6 0.56157589 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
7 0.50515002 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
8 0.48574725 241 nips-2011-Scalable Training of Mixture Models via Coresets
9 0.46930617 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
10 0.43836576 217 nips-2011-Practical Variational Inference for Neural Networks
11 0.42762873 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
12 0.42512906 69 nips-2011-Differentially Private M-Estimators
13 0.42510387 162 nips-2011-Lower Bounds for Passive and Active Learning
14 0.41654462 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
15 0.41155177 263 nips-2011-Sparse Manifold Clustering and Embedding
16 0.40901577 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
17 0.4071888 198 nips-2011-On U-processes and clustering performance
18 0.39259887 202 nips-2011-On the Universality of Online Mirror Descent
19 0.39089635 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
20 0.38068673 231 nips-2011-Randomized Algorithms for Comparison-based Search
topicId topicWeight
[(0, 0.032), (4, 0.035), (20, 0.485), (26, 0.031), (31, 0.057), (33, 0.01), (43, 0.083), (45, 0.08), (57, 0.027), (74, 0.043), (99, 0.04)]
simIndex simValue paperId paperTitle
1 0.90290886 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
Author: Joseph J. Lim, Antonio Torralba, Ruslan Salakhutdinov
Abstract: Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples. To overcome this lack of training data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes. Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class. Our experimental results demonstrate that our new object detector, with borrowed and transformed examples, improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset. 1
same-paper 2 0.90255451 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
Author: Samory Kpotufe
Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1
3 0.89410359 275 nips-2011-Structured Learning for Cell Tracking
Author: Xinghua Lou, Fred A. Hamprecht
Abstract: We study the problem of learning to track a large quantity of homogeneous objects such as cell tracking in cell culture study and developmental biology. Reliable cell tracking in time-lapse microscopic image sequences is important for modern biomedical research. Existing cell tracking methods are usually kept simple and use only a small number of features to allow for manual parameter tweaking or grid search. We propose a structured learning approach that allows to learn optimum parameters automatically from a training set. This allows for the use of a richer set of features which in turn affords improved tracking compared to recently reported methods on two public benchmark sequences. 1
4 0.89028698 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation
Author: Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, Chang D. Yoo
Abstract: For many of the state-of-the-art computer vision algorithms, image segmentation is an important preprocessing step. As such, several image segmentation algorithms have been proposed, however, with certain reservation due to high computational load and many hand-tuning parameters. Correlation clustering, a graphpartitioning algorithm often used in natural language processing and document clustering, has the potential to perform better than previously proposed image segmentation algorithms. We improve the basic correlation clustering formulation by taking into account higher-order cluster relationships. This improves clustering in the presence of local boundary ambiguities. We first apply the pairwise correlation clustering to image segmentation over a pairwise superpixel graph and then develop higher-order correlation clustering over a hypergraph that considers higher-order relations among superpixels. Fast inference is possible by linear programming relaxation, and also effective parameter learning framework by structured support vector machine is possible. Experimental results on various datasets show that the proposed higher-order correlation clustering outperforms other state-of-the-art image segmentation algorithms.
5 0.79528981 260 nips-2011-Sparse Features for PCA-Like Linear Regression
Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail
Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.
6 0.54623061 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
7 0.53239071 154 nips-2011-Learning person-object interactions for action recognition in still images
8 0.51780117 227 nips-2011-Pylon Model for Semantic Segmentation
9 0.49649674 303 nips-2011-Video Annotation and Tracking with Active Learning
10 0.49559107 59 nips-2011-Composite Multiclass Losses
11 0.48647925 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
12 0.48580202 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
13 0.47601074 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials
14 0.47553933 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
15 0.47282887 263 nips-2011-Sparse Manifold Clustering and Embedding
16 0.46746135 25 nips-2011-Adaptive Hedge
17 0.46463752 22 nips-2011-Active Ranking using Pairwise Comparisons
18 0.46463591 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
19 0.46381178 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
20 0.46333224 180 nips-2011-Multiple Instance Filtering