nips nips2013 nips2013-31 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Samory Kpotufe, Vikas Garg
Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. [sent-3, score-1.125]
2 The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. [sent-4, score-0.314]
3 tree-based, k-NN and kernel regression) where regression performance is particularly sensitive to how well the method is tuned to the unknown problem parameters. [sent-10, score-0.366]
4 In this work, we present an adaptive kernel regression procedure, i. [sent-11, score-0.304]
5 a procedure which self-tunes, optimally, to the unknown parameters of the problem at hand. [sent-13, score-0.134]
6 We consider regression on a general metric space X of unknown metric dimension, where the output Y is given as f (x) + noise. [sent-14, score-0.501]
7 We are interested in adaptivity at any input point x ∈ X : the algorithm must self-tune to the unknown local parameters of the problem at x. [sent-15, score-0.356]
8 [1, 2]), are (1) the unknown smoothness of f , and (2) the unknown intrinsic dimension, both defined over a neighborhood of x. [sent-18, score-0.474]
9 Existing results on adaptivity have typically treated these two problem parameters separately, resulting in methods that solve only part of the self-tuning problem. [sent-19, score-0.143]
10 In kernel regression, the main algorithmic parameter to tune is the bandwidth h of the kernel. [sent-20, score-0.339]
11 The problem of (local) bandwidth selection at a point x ∈ X has received considerable attention in both the theoretical and applied literature (see e. [sent-21, score-0.27]
12 In this paper we present the first method which provably adapts to both the unknown local intrinsic dimension and the unknown H¨ ldero continuity of the regression function f at any point x in a metric space of unknown structure. [sent-24, score-1.018]
13 The intrinsic dimension and H¨ lder-continuity are allowed to vary with x in the space, and the algorithm o must thus choose the bandwidth h as a function of the query x, for all possible x ∈ X. [sent-25, score-0.586]
14 It is unclear how to extend global bandwidth selection methods such as cross-validation to the local bandwidth selection problem at x. [sent-26, score-0.718]
15 The main difficulty is that of evaluating the regression error at x since the ouput Y at x is unobserved. [sent-27, score-0.147]
16 We do have the labeled training sample to guide us in selecting h(x), and we will show an approach that guarantees a regression rate optimal in terms of the local problem complexity at x. [sent-28, score-0.271]
17 In particular some such o Lepski’s adaptive methods consist of monitoring the change in regression estimates fn,h (x) as the bandwidth h is varied. [sent-32, score-0.476]
18 The selected estimate has to meet some stability criteria. [sent-33, score-0.076]
19 The stability criteria is designed to ensure that the selected fn,h (x) is sufficiently close to a target estimate fn,h (x) for ˜ ˜ a bandwidth h known to yield an optimal regression rate. [sent-34, score-0.462]
20 These methods however are generally instantiated for regression in R, but extend to high-dimensional regression if the dimension of the input space X is known. [sent-35, score-0.462]
21 In this work however the dimension of X is unknown, and in fact X is allowed to be a general metric space with significantly less regularity than usual Euclidean spaces. [sent-36, score-0.286]
22 To adapt to local dimension we build on recent insights of [9] where a k-NN procedure is shown to adapt locally to intrinsic dimension. [sent-37, score-0.533]
23 The general idea for selecting k = k(x) is to balance surrogates of the unknown bias and variance of the estimate. [sent-38, score-0.324]
24 Since Lipschitz-continuity is a special case of H¨ lder-continuity, the work of [9] corresponds in the present context to knowing the smoothness of o f everywhere. [sent-40, score-0.131]
25 In this work we do not assume knowledge of the smoothness of f , but simply that f is locally H¨ lder-continuous with unknown H¨ lder parameters. [sent-41, score-0.54]
26 o o Suppose we knew the smoothness of f at x, then we can derive an approach for selecting h(x), similar to that of [9], by balancing the proper surrogates for the bias and variance of a kernel estimate. [sent-42, score-0.506]
27 Since we don’t actually know the local smoothness of f , our approach, similar to Lepski’s, is to monitor the change in estimates fn,h (x) as h varies, and pick the estimate fn,h (x) which is deemed close to the hypothetical estimate fn,h (x) under some ¯ ˆ stability condition. [sent-44, score-0.381]
28 ˜ We prove nearly optimal local rates O λ2d/(2α+d) n−2α/(2α+d) in terms of the local dimension d at any point x and H¨ lder parameters λ, α depending also on x. [sent-45, score-0.586]
29 Furthermore, the result holds with o high probability, simultaneously at all x ∈ X , for n sufficiently large. [sent-46, score-0.078]
30 Note that we cannot unionbound over all x ∈ X , so the uniform result relies on proper conditioning on particular events in our variance bounds on estimates fn,h (·). [sent-47, score-0.137]
31 1 Setup and Notation Distribution and sample We assume the input X belongs to a metric space (X , ρ) of bounded diameter ∆X ≥ 1. [sent-52, score-0.271]
32 The output Y belongs to a space Y of bounded diameter ∆Y . [sent-53, score-0.153]
33 We assume the regression function f (x) = E [Y |x] satisfies local H¨ lder assumptions: for every o x ∈ X and r > 0, there exists λ, α > 0 depending on x and r, such that f is (λ, α)-H¨ lder at x on o B(x, r): ∀x ∈ B(x, r) |f (x) − f (x )| ≤ λρ(x, x )α . [sent-65, score-0.756]
34 We note that the α parameter is usually assumed to be in the interval (0, 1] for global definitions of H¨ lder continuity, since a global α > 1 implies that f is constant (for differentiable f ). [sent-66, score-0.453]
35 For instance the function f (x) = xα is clearly locally α-H¨ lder at x = 0 with constant λ = 1 for any α > 0. [sent-68, score-0.32]
36 With o higher α = α(x), f gets flatter locally at x, and regression gets easier. [sent-69, score-0.21]
37 2 Notion of dimension We use the following notion of metric-dimension, also employed in [9]. [sent-70, score-0.178]
38 This notion extends some global notions of metric dimension to local regions of space . [sent-71, score-0.536]
39 Thus it allows for the intrinsic dimension of the data to vary over space. [sent-72, score-0.318]
40 As argued in [9] (see also [10] for a more general theory) it often coincides with other natural measures of dimension such as manifold dimension. [sent-73, score-0.174]
41 In the above definition, d will be viewed as the local dimension at x. [sent-78, score-0.234]
42 We will require a general upper-bound d0 on the local dimension d(x) over any x in the space. [sent-79, score-0.234]
43 This is defined below and can be viewed as the worst-case intrinsic dimension over regions of space. [sent-80, score-0.287]
44 the following holds for all x ∈ X and r > 0: suppose there exists C ≥ 1 and d ≥ 1 such that µ is (C, d)-homogeneous on B(x, r), then µ is (C0 , d0 )-homogeneous on B(x, r). [sent-84, score-0.068]
45 Thus, C0 , d0 can be viewed as global upper-bounds on the local homogeneity constants. [sent-86, score-0.208]
46 The metric entropy is relatively easy to estimate since it is a global quantity independent of any particular query x. [sent-90, score-0.201]
47 Finally we require that the local dimension is tight in small regions. [sent-91, score-0.234]
48 This last assumption extends (to local regions of space) the common assumption that µ has an upperbounded density (relative to Lebesgue). [sent-95, score-0.128]
49 2 Kernel Regression We consider a positive kernel K on [0, 1] highest at 0, decreasing on [0, 1], and 0 outside [0, 1]. [sent-98, score-0.1]
50 The kernel estimate is defined as follows: if B(x, h) ∩ X = ∅, K(ρ(x, Xi )/h) fn,h (x) = wi (x)Yi , where wi (x) = . [sent-99, score-0.406]
51 j K(ρ(x, Xj )/h) i We set wi (x) = 1/n, ∀i ∈ [n] if B(x, h) ∩ X = ∅. [sent-100, score-0.153]
52 (Global cover size) Let smallest -cover of (X , ρ). [sent-102, score-0.051]
53 Let Nρ ( ) denote an upper-bound on the size of the We assume the global quantity Nρ ( ) is known or pre-estimated. [sent-104, score-0.083]
54 Recall that, as discussed in Section 2, d0 can be picked to satisfy ln(Nρ ( )) = O(d0 log(∆X / )), in other words the procedure requires only knowledge of upper-bounds Nρ ( ) on global cover sizes. [sent-105, score-0.179]
55 For any x ∈ X , the set of admissible bandwidths is given as Fix = ∆X n ˆ Hx = h ≥ 16 : µn (B(x, h/32)) ≥ 32 ln(Nρ ( /2)/δ) n 3 ∆X 2i log(∆X / ) . [sent-107, score-0.057]
56 i=0 Let Cn,δ ≥ σh = 2 ˆ 2K(0) K(1) ˆ 4 ln (Nρ ( /2)/δ) + 9C0 4d0 . [sent-108, score-0.136]
57 There exists N such that, for n > N , the following holds with probability at 6C0 least 1 − 2δ over the choice of (X, Y), simultaneously for all x ∈ X and all r satisfying rµ > r > rn 2 2 2d0 C0 ∆d0 X C2 λ2 1/(2α+d0 ) ∆2 Cn,δ Y n 1/(2α+d0 ) . [sent-113, score-0.111]
58 Let x ∈ X , and suppose f is (λ, α)-H¨ lder at x on B(x, r). [sent-114, score-0.291]
59 n→∞ The result holds with high probability for all x ∈ X , and for all rµ > r > rn , where rn − − → 0. [sent-119, score-0.1]
60 −− Thus, as n grows, the procedure is eventually adaptive to the H¨ lder parameters in any neighborhood o of x. [sent-120, score-0.409]
61 Note that the dimension d is the same for all r < rµ by definition of rµ . [sent-121, score-0.139]
62 As previously discussed, the definition of rµ corresponds to a requirement that the intrinsic dimension is tight in small enough regions. [sent-122, score-0.297]
63 Since the result holds simultaneously for all rµ > r > rn , the best tradeoff in terms of smoothness and size of r is achieved. [sent-126, score-0.242]
64 ¯ As previously mentioned, the main idea behind the proof is to introduce hypothetical bandwidths h ˜ which balance respectively, σh and λ2 h2α , and O(∆2 /(nhd )) and λ2 h2α (see Figure 1). [sent-128, score-0.177]
65 and and h ˆ Y In the figure, d and α are the unknown parameters in some neighborhood of point x. [sent-129, score-0.139]
66 The first part of the proof consists in showing that the variance of the estimate using a bandwidth h is at most σh . [sent-130, score-0.304]
67 ¯ ˆ The argument is a bit more nuanced that just described above and in Figure 1: the respective curves O(∆2 /(nhd ) and λ2 h2α are changing with h since dimension and smoothness at x depend on the Y size of the region considered. [sent-135, score-0.356]
68 4 7 Cross Validation Adaptive 6 NMSE 5 4 3 2 1 3000 4000 5000 6000 7000 8000 9000 10000 Training Size ¯ ˜ Figure 1: (Left) The proof argues over h, h which balance respectively, σh and λ2 h2α , and ˆ ˆ O(∆2 /(nhd )) and λ2 h2α . [sent-137, score-0.076]
69 The estimates under h selected by the procedure is shown to be close to Y ¯ ˜ that of h, which in turn is shown to be close to that of h which is of the right adaptive form. [sent-138, score-0.165]
70 (Right) Simulation results comparing the error of the proposed method to that of a global h selected by cross-validation. [sent-139, score-0.113]
71 X ⊂ R70 has diameter 1, and is a collection of 3 disjoint flats (clusters) of dimension d1 = 2, d2 = 5, d3 = 10, and equal mass 1/3. [sent-141, score-0.203]
72 For the implementation of the proposed method, we set ˆ ˆ σh (x) = varY /nµn (B(x, h)), where varY is the variance of Y on the training sample. [sent-146, score-0.065]
73 For both our ˆ method and cross-validation, we use a box-kernel, and we vary h on an equidistant 100-knots grid on the interval from the smallest to largest interpoint distance on the training sample. [sent-147, score-0.134]
74 For any x ∈ X and bandwidth h, define the expected regression estimate . [sent-149, score-0.386]
75 (1) The bias term above is easily bounded in a standard way. [sent-152, score-0.075]
76 Let x ∈ X , and suppose f is (λ, α)-H¨ lder at x on B(x, h). [sent-155, score-0.291]
77 We have fn,h (x) − f (x) ≤ i wi (x) |f (Xi ) − f (x)| ≤ λhα . [sent-158, score-0.153]
78 The rest of this section is dedicated to the analysis of the variance term of (1). [sent-159, score-0.065]
79 We will need various supporting Lemmas relating the empirical mass of balls to their true mass. [sent-160, score-0.076]
80 1 Supporting Lemmas ˆ We often argue over the following distributional counterpart to Hx ( ). [sent-164, score-0.046]
81 Apply Bernstein’s inequality followed by a union bound on Z and S . [sent-174, score-0.076]
82 The following two lemmas result from the above Lemma 2. [sent-175, score-0.053]
83 Again, let Z be an /2 cover and define S and γn as in Lemma 2. [sent-183, score-0.051]
84 2 Bound on the variance The following two results of Lemma 5 to 6 serve to bound the variance of the kernel estimate. [sent-188, score-0.23]
85 The main result of this section is the variance bound of Lemma 7. [sent-190, score-0.065]
86 This last lemma bounds the variance term of (1) with high probability simultaneously for all x ∈ X and for values of h relevant to the algorithm. [sent-191, score-0.275]
87 For any x ∈ X and h > 0: EY|X fn,h (x) − fn,h (x) 2 ≤ 2 wi (x)∆2 . [sent-193, score-0.153]
88 Define Cn,δ = 2K(0) d0 + 4 ln (Nρ ( /2)/δ) , With probability at least 1 − 3δ over the choice of (X, Y), K(1) 9C0 4 2 ∆2 Cn,δ Y ˆ . [sent-200, score-0.136]
89 We note that changing any Yi value changes φ(Yz,h ) by at most ∆Y wi (z). [sent-223, score-0.193]
90 Applying McDiarmid’s inequality and taking a union bound over z ∈ Z and h ∈ Hz , we get 2 P(∃z ∈ Z, ∃h ∈ S , φ(Yz,h ) > Eφ(Yz,h ) + t) ≤ Nρ ( /2) exp − . [sent-224, score-0.076]
91 2t2 2 wi (z) ∆2 Y i We then have with probability at least 1 − 2δ, for all z ∈ Z and h ∈ Hz , 2 2 Nρ ( /2) + 2 ln fn,h (z) − fn,h (z) ≤ 2 EY |X fn,h (z) − fn,h (z) δ ∆2 · Y 2 wi (z) i K(0)∆2 Nρ ( /2) Y , ≤ 4 ln · δ K(1) · nµn (B(z, h)) where we apply Lemma 5 and 6 for the last inequality. [sent-225, score-0.578]
92 By what we βn,h (x, z) x:ρ(x,z)≤ /2 just argued, changing any Yi makes ψh (z) vary by at most ∆Y . [sent-234, score-0.104]
93 2n (5) ∞ To bound the above expectation for any z and h ∈ Hz , consider a sequence {xi }1 , xi ∈ B(z, /2) i→∞ such that ψh (xi , z) − − ψh (z). [sent-236, score-0.09]
94 K(0) Since ψh (xi , z) is bounded for all xi ∈ B(z, ), the Dominated Convergence Theorem yields EY|X ψh (z) = lim E Y|X ψh (xi , z) ≤ 2∆Y i→∞ nK(1)µn (B(z, h)) . [sent-239, score-0.121]
95 3 2 + 2 |φ(Yx,h ) − φ(Yz,h ))| 2 2∆2 Y · 9C0 4d0 + 4 ln (Nρ ( /2)/δ) . [sent-247, score-0.136]
96 As previously discussed, the main part of the argument consists of relating the error of fn,h (x) to that of fn,h (x) which is of the right form for ¯ ˜ B(x, r) appropriately defined as in the theorem statement. [sent-249, score-0.08]
97 To relate the error of fn,h (x) to that fn,h (x), we employ a simple argument inspired by Lepski’s ¯ ˆ ˆ ¯ ˆ adaptivity work. [sent-250, score-0.189]
98 ¯ fn,h − f 2 ≤ 2ˆh so the intervals Dh must Therefore by Lemma 1 and 7 that, for any h < h, σ ˆ ¯ all contain f (x) and therefore must intersect. [sent-252, score-0.089]
99 By the same argument h ≥ h and Dh and Dh must ¯ ˆ intersect. [sent-253, score-0.075]
100 Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. [sent-294, score-0.404]
wordName wordTfidf (topN-words)
[('hx', 0.448), ('lepski', 0.275), ('lder', 0.257), ('bandwidth', 0.239), ('ey', 0.187), ('dh', 0.172), ('hz', 0.17), ('lemma', 0.166), ('nhd', 0.157), ('wi', 0.153), ('cn', 0.152), ('regression', 0.147), ('adaptivity', 0.143), ('dimension', 0.139), ('ln', 0.136), ('smoothness', 0.131), ('yxi', 0.118), ('metric', 0.118), ('intrinsic', 0.115), ('kernel', 0.1), ('fix', 0.097), ('local', 0.095), ('nk', 0.095), ('xi', 0.09), ('unknown', 0.089), ('global', 0.083), ('hypothetical', 0.076), ('mcdiarmid', 0.069), ('samory', 0.069), ('variance', 0.065), ('vary', 0.064), ('diameter', 0.064), ('adapts', 0.064), ('locally', 0.063), ('bandwidths', 0.057), ('adaptive', 0.057), ('lemmas', 0.053), ('surrogates', 0.053), ('toyota', 0.053), ('cover', 0.051), ('cr', 0.051), ('technological', 0.051), ('neighborhood', 0.05), ('yi', 0.049), ('stability', 0.046), ('argument', 0.046), ('argue', 0.046), ('balancing', 0.045), ('procedure', 0.045), ('simultaneously', 0.044), ('continuity', 0.044), ('bias', 0.044), ('balance', 0.044), ('nonparametric', 0.044), ('requirement', 0.043), ('nition', 0.042), ('supporting', 0.042), ('notice', 0.04), ('statement', 0.04), ('changing', 0.04), ('proper', 0.039), ('notion', 0.039), ('union', 0.038), ('inequality', 0.038), ('adapt', 0.038), ('argued', 0.035), ('interpoint', 0.035), ('ats', 0.035), ('equidistant', 0.035), ('holder', 0.035), ('kohler', 0.035), ('krzyzak', 0.035), ('holds', 0.034), ('relating', 0.034), ('wj', 0.034), ('suppose', 0.034), ('rn', 0.033), ('estimates', 0.033), ('regions', 0.033), ('smoothing', 0.032), ('argues', 0.032), ('garg', 0.032), ('inhomogeneous', 0.032), ('mammen', 0.032), ('selection', 0.031), ('bounded', 0.031), ('intervals', 0.031), ('selected', 0.03), ('atter', 0.03), ('fh', 0.03), ('gyor', 0.03), ('homogeneity', 0.03), ('kpotufe', 0.03), ('nmse', 0.03), ('differentiable', 0.03), ('tuned', 0.03), ('space', 0.029), ('selecting', 0.029), ('belongs', 0.029), ('must', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
Author: Samory Kpotufe, Vikas Garg
Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1
2 0.14770578 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
3 0.12235564 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
Author: Ilya O. Tolstikhin, Yevgeny Seldin
Abstract: We present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. We show that when the empirical variance is significantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. Our theoretical analysis is confirmed empirically on a synthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. 1
4 0.10016588 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
Author: Haichao Zhang, David Wipf
Abstract: Typical blur from camera shake often deviates from the standard uniform convolutional assumption, in part because of problematic rotations which create greater blurring away from some unknown center point. Consequently, successful blind deconvolution for removing shake artifacts requires the estimation of a spatiallyvarying or non-uniform blur operator. Using ideas from Bayesian inference and convex analysis, this paper derives a simple non-uniform blind deblurring algorithm with a spatially-adaptive image penalty. Through an implicit normalization process, this penalty automatically adjust its shape based on the estimated degree of local blur and image structure such that regions with large blur or few prominent edges are discounted. Remaining regions with modest blur and revealing edges therefore dominate on average without explicitly incorporating structureselection heuristics. The algorithm can be implemented using an optimization strategy that is virtually tuning-parameter free and simpler than existing methods, and likely can be applied in other settings such as dictionary learning. Detailed theoretical analysis and empirical comparisons on real images serve as validation.
5 0.098885849 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros
Abstract: Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difficult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. We also evaluate our method on the task of scene classification, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. 1
6 0.096172757 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
7 0.090484396 224 nips-2013-On the Sample Complexity of Subspace Learning
8 0.083151184 91 nips-2013-Dirty Statistical Models
9 0.077340126 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
10 0.076898962 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
11 0.076747052 63 nips-2013-Cluster Trees on Manifolds
12 0.071449332 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
13 0.070973143 244 nips-2013-Parametric Task Learning
14 0.069693185 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
15 0.064567432 158 nips-2013-Learning Multiple Models via Regularized Weighting
16 0.064116955 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
17 0.063897304 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
18 0.058374215 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
19 0.058292698 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
20 0.058030669 137 nips-2013-High-Dimensional Gaussian Process Bandits
topicId topicWeight
[(0, 0.177), (1, 0.043), (2, 0.071), (3, -0.009), (4, 0.027), (5, 0.068), (6, -0.003), (7, 0.014), (8, -0.069), (9, 0.05), (10, -0.002), (11, 0.02), (12, -0.046), (13, -0.007), (14, 0.105), (15, 0.04), (16, 0.004), (17, 0.037), (18, -0.027), (19, -0.001), (20, 0.003), (21, 0.074), (22, 0.043), (23, 0.06), (24, -0.138), (25, -0.01), (26, -0.059), (27, -0.005), (28, -0.039), (29, -0.054), (30, -0.088), (31, -0.029), (32, -0.046), (33, -0.012), (34, 0.012), (35, 0.051), (36, -0.072), (37, -0.056), (38, 0.105), (39, -0.037), (40, 0.034), (41, -0.001), (42, -0.089), (43, -0.056), (44, -0.069), (45, -0.149), (46, 0.022), (47, 0.032), (48, 0.068), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.96251714 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
Author: Samory Kpotufe, Vikas Garg
Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1
2 0.71339595 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
Author: Po-Ling Loh, Martin J. Wainwright
Abstract: We establish theoretical results concerning local optima of regularized M estimators, where both loss and penalty functions are allowed to be nonconvex. Our results show that as long as the loss satisfies restricted strong convexity and the penalty satisfies suitable regularity conditions, any local optimum of the composite objective lies within statistical precision of the true parameter vector. Our theory covers a broad class of nonconvex objective functions, including corrected versions of the Lasso for errors-in-variables linear models and regression in generalized linear models using nonconvex regularizers such as SCAD and MCP. On the optimization side, we show that a simple adaptation of composite gradient descent may be used to compute a global optimum up to the statistical precision stat in log(1/ stat ) iterations, the fastest possible rate for any first-order method. We provide simulations to illustrate the sharpness of our theoretical predictions. 1
3 0.6809423 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
Author: Matus Telgarsky, Sanjoy Dasgupta
Abstract: Suppose k centers are fit to m points by heuristically minimizing the k-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for k-means instances possessing some cluster structure. 1
4 0.6694361 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
Author: Ilya O. Tolstikhin, Yevgeny Seldin
Abstract: We present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. We show that when the empirical variance is significantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. Our theoretical analysis is confirmed empirically on a synthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. 1
5 0.65675598 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
Author: Suvrit Sra, Reshad Hosseini
Abstract: Hermitian positive definite (hpd) matrices recur throughout machine learning, statistics, and optimisation. This paper develops (conic) geometric optimisation on the cone of hpd matrices, which allows us to globally optimise a large class of nonconvex functions of hpd matrices. Specifically, we first use the Riemannian manifold structure of the hpd cone for studying functions that are nonconvex in the Euclidean sense but are geodesically convex (g-convex), hence globally optimisable. We then go beyond g-convexity, and exploit the conic geometry of hpd matrices to identify another class of functions that remain amenable to global optimisation without requiring g-convexity. We present key results that help recognise g-convexity and also the additional structure alluded to above. We illustrate our ideas by applying them to likelihood maximisation for a broad family of elliptically contoured distributions: for this maximisation, we derive novel, parameter free fixed-point algorithms. To our knowledge, ours are the most general results on geometric optimisation of hpd matrices known so far. Experiments show that advantages of using our fixed-point algorithms. 1
6 0.64894587 244 nips-2013-Parametric Task Learning
7 0.6338281 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
9 0.60243106 158 nips-2013-Learning Multiple Models via Regularized Weighting
10 0.5988583 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
11 0.59697104 76 nips-2013-Correlated random features for fast semi-supervised learning
12 0.59550768 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
13 0.58355933 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
14 0.57329005 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport
15 0.56660354 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
16 0.56505936 63 nips-2013-Cluster Trees on Manifolds
17 0.53800011 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
18 0.53060496 256 nips-2013-Probabilistic Principal Geodesic Analysis
19 0.53014654 269 nips-2013-Regression-tree Tuning in a Streaming Setting
20 0.52981389 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
topicId topicWeight
[(16, 0.014), (33, 0.094), (34, 0.055), (56, 0.671), (85, 0.021), (89, 0.02), (93, 0.016), (95, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.99211258 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
Author: Samory Kpotufe, Vikas Garg
Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1
2 0.98428118 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
Author: Amit Daniely, Nati Linial, Shai Shalev-Shwartz
Abstract: The increased availability of data in recent years has led several authors to ask whether it is possible to use data as a computational resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task? We give the first positive answer to this question for a natural supervised learning problem — we consider agnostic PAC learning of halfspaces over 3-sparse vectors in {−1, 1, 0}n . This class is inefficiently learnable using O n/ 2 examples. Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely believed assumption that refuting random 3CNF formulas is hard, it is impossible to efficiently learn this class using only O n/ 2 examples. We further show that under stronger hardness assumptions, even O n1.499 / 2 examples do not suffice. On the other hand, we show a new algorithm that learns this class efficiently ˜ using Ω n2 / 2 examples. This formally establishes the tradeoff between sample and computational complexity for a natural supervised learning problem. 1
3 0.96316803 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
Author: Franz Kiraly, Louis Theran
Abstract: We propose a general framework for reconstructing and denoising single entries of incomplete and noisy entries. We describe: effective algorithms for deciding if and entry can be reconstructed and, if so, for reconstructing and denoising it; and a priori bounds on the error of each entry, individually. In the noiseless case our algorithm is exact. For rank-one matrices, the new algorithm is fast, admits a highly-parallel implementation, and produces an error minimizing estimate that is qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and OptSpace methods. 1
4 0.95280987 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
Author: Tzu-Kuo Huang, Jeff Schneider
Abstract: Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover firstorder Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings. 1
5 0.95186192 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
6 0.93138444 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces
7 0.91942936 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
8 0.90813553 269 nips-2013-Regression-tree Tuning in a Streaming Setting
9 0.89610463 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
10 0.86516976 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
11 0.85323524 42 nips-2013-Auditing: Active Learning with Outcome-Dependent Query Costs
12 0.85171604 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
13 0.82987309 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
14 0.82641655 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
15 0.82548314 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
16 0.82253343 66 nips-2013-Computing the Stationary Distribution Locally
17 0.81813031 230 nips-2013-Online Learning with Costly Features and Labels
18 0.80828279 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
19 0.8073054 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
20 0.79562259 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality