nips nips2013 nips2013-144 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Qichao Que, Mikhail Belkin
Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. [sent-3, score-0.292]
2 Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. [sent-4, score-0.182]
3 It is also closely related to the problem of covariate shift in transfer learning. [sent-5, score-0.284]
4 Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. [sent-6, score-0.424]
5 This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. [sent-7, score-0.078]
6 We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. [sent-9, score-0.253]
7 Model selection for unsupervised or semi-supervised inference is generally a difficult problem. [sent-10, score-0.083]
8 It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. [sent-11, score-0.333]
9 We show encouraging experimental results including applications to classification within the covariate shift framework. [sent-13, score-0.248]
10 1 Introduction q(x) In this paper we address the problem of estimating the ratio of two functions, p(x) where p is given by a sample and q(x) is either a known function or another probability density function given by a sample. [sent-14, score-0.34]
11 Recently there have been a significant amount of work on estimating the density ratio (also known as the importance function) from sampled data, e. [sent-16, score-0.395]
12 Many of these papers consider this problem in the context of covariate shift assumption [19] or the so-called selection bias [27]. [sent-19, score-0.289]
13 The approach taken in our paper is based on reformulating the density ratio estimation as an integral equation, known as the Fredholm equation of the first kind, and solving it using the tools of regularization and Reproducing Kernel Hilbert Spaces. [sent-20, score-0.523]
14 That allows us to develop simple and flexible algorithms for density ratio estimation within the popular kernel learning framework. [sent-21, score-0.473]
15 The connection to the classical operator theory setting makes it easier to apply the standard tools of spectral and Fourier analysis to obtain theoretical results. [sent-22, score-0.118]
16 We start with the following simple equality underlying the importance sampling method: q(x) q(x) Eq (h(x)) = h(x)q(x)dx = h(x) p(x)dx = Ep h(x) p(x) p(x) 1 (1) By replacing the function h(x) with a kernel k(x, y), we obtain q q(y) Kp (x) := k(x, y) p(y)dy = k(x, y)q(y)dy := Kq 1(x). [sent-23, score-0.35]
17 p p(y) (2) q(x) Thinking of the function p(x) as an unknown quantity and assuming that the right hand side is known this becomes a Fredholm integral equation. [sent-24, score-0.128]
18 Note that the right-hand side can be estimated given a sample from q while the operator on the left can be estimated using a sample from p. [sent-25, score-0.151]
19 To push this idea further, suppose kt (x, y) is a “local” kernel, (e. [sent-26, score-0.113]
20 , the Gaussian, kt (x, y) = 1 e− (2πt)d/2 x−y 2 2t ) such that Rd kt (x, y)dx = 1. [sent-28, score-0.226]
21 When we use δ-kernels, like Gaussian, and f satisfies some smoothness conditions, we have Rd kt (x, y)f (x)dx = f (y) + O(t) (see [24], Ch. [sent-29, score-0.113]
22 Thus we get another (approximate) integral equality: q Kt,p (y) := p It becomes an integral equation for q(x) p(x) , kt (x, y) Rd q(x) p(x)dx ≈ q(y). [sent-31, score-0.4]
23 We address these inverse problems by formulating them within the classical framework of TiknonovPhilips regularization with the penalty term corresponding to the norm of the function in the Reproducing Kernel Hilbert Space H with kernel kH used in many machine learning algorithms. [sent-33, score-0.489]
24 q q [Type I]: ≈ arg min Kp f −Kq 1(x) 2 2,p +λ f 2 [II]: ≈ arg min Kt,p f −q 2 2,p +λ f 2 L H L H f ∈H f ∈H p p Importantly, given a sample x1 , . [sent-34, score-0.148]
25 , xn from p, the integral operator Kp f applied to a function f 1 can be approximated by the corresponding discrete sum Kp f (x) ≈ n i f (xi )K(xi , x), while 1 L2,p norm is approximated by an average: f 2 2,p ≈ n i f (xi )2 . [sent-37, score-0.293]
26 We see that the Type I formulation is useful when q is a density and samples from both p and q are available, while the Type II is useful, when the values of q (which does not have to be a density function at all1 ) are known at the data points sampled from p. [sent-39, score-0.438]
27 Norms and loss functions other that L2,p can also be used in our setting as long as they can be approximated from a sample using function evaluations. [sent-43, score-0.117]
28 Perhaps, the most interesting is L2,q norm available in the Type I setting, when a sample from the probability distribution q is available. [sent-45, score-0.126]
29 In fact, given a sample from both p and q we can use the combined empirical norm γ · L2,p + (1 − γ) · L2,q . [sent-46, score-0.126]
30 Optimization using those norms leads to some interesting kernel algorithms described in Section 2. [sent-47, score-0.269]
31 We note that the solution is still a linear combination of kernel functions centered on the sample from p and can still be written explicitly. [sent-48, score-0.313]
32 In Type I formulation, if the kernels k(x, y) and kH (x, y) coincide, it is possible to use the RKHS norm · H instead of L2,p . [sent-50, score-0.135]
33 Since we are dealing with a classical inverse problem for integral operators, our formulation allows for theoretical analysis using the methods of spectral theory. [sent-52, score-0.309]
34 The standard kernel density estimator for q divided by a thresholded kernel density estimator for p. [sent-56, score-0.813]
35 We provide a formulation of estimating the density ratio (importance function) as a classical inverse problem, known as the Fredholm equation, establishing a connections to the methods of classical analysis. [sent-58, score-0.476]
36 The underlying idea is to “linearize” the properties of the density by studying an associated integral operator. [sent-59, score-0.292]
37 To solve the resulting inverse problems we apply regularization with an RKHS norm penalty. [sent-61, score-0.233]
38 This provides a flexible and principled framework, with a variety of different norms and regularization techniques available. [sent-62, score-0.124]
39 It separates the underlying inverse problem from the necessary regularization and leads to a family of very simple and direct algorithms within the kernel learning framework in machine learning. [sent-63, score-0.378]
40 Recently the problem of density ratio estimation has received significant attention due in part to the increased interest in transfer learning [15] and, in particular to the form of transfer learning known as covariate shift [19]. [sent-71, score-0.57]
41 To give a brief summary, given the feature space X and the label space Y , two probability distributions p and q on X × Y satisfy the covariate assumption if for all x, y, p(y|x) = q(y|x). [sent-72, score-0.145]
42 It is easy to see that training a classifier to minimize the error for q, q given a sample from p requires estimating the ratio of the marginal distributions pX (x) . [sent-73, score-0.176]
43 The work on X (x) covariate shift, density ratio estimation and related settings includes [27, 2, 6, 10, 22, 9, 23, 14, 7]. [sent-74, score-0.423]
44 The quantity on the right can be estimated given a sample from p and a sample from q and the minimization becomes a quadratic optimization problem over the values of β at the points sampled from p. [sent-78, score-0.193]
45 , recalling that Φ(x) = KH (x, ·), we see that q the equality Eq (Φ(x)) = Ep ( p Φ(x)) is equivalent to the integral equation Eq. [sent-81, score-0.19]
46 Thus the problem of KMM can be viewed within our setting Type I (see the Remark 2 in the introduction), with a RKHS norm but a different optimization algorithm. [sent-83, score-0.106]
47 Another related recent algorithm is Least Squares Importance Sampling (LSIF) [10], which attempts to estimate the density ratio by choosing a parametric linear family of functions and choosing a function from this family to minimize the L2,p distance to the density ratio. [sent-87, score-0.456]
48 We note that our method for unsupervised parameter selection in Section 4 is related to their ideas. [sent-90, score-0.083]
49 We note that our methods are closely related to a large body of work on kernel methods in machine learning and statistical estimation (e. [sent-92, score-0.223]
50 , [3, 20] in the Tikhonov regularization or other regularization frameworks. [sent-97, score-0.156]
51 In particular, we note interesting methods for density estimation proposed in [12] and estimating the support of density through spectral regularization in [4], as well as robust density estimation using RKHS formulations [11] and conditional density [8]. [sent-98, score-0.806]
52 Among those works that provide theoretical analysis of algorithms for estimating density ratios, 3 [14] establishes minimax rates for likelihood ratio estimation. [sent-100, score-0.292]
53 Another recent theoretical analysis of KMM in [26] contains bounds for the output of the corresponding integral operators. [sent-101, score-0.128]
54 As usual, the norm in space of square-integrable functions with respect to a measure ρ, is defined as follows: L2,ρ = f : Ω |f (x)|2 dρ < ∞ . [sent-104, score-0.12]
55 Given a kernel k(x, y) we define the operator Kρ : Kρ f (y) := Ω k(x, y)f (x)dρ(x). [sent-106, score-0.278]
56 We will use the notation Kt,ρ to explicitly refer to the parameter of the kernel function kt (x, y), when it is a δ-family. [sent-107, score-0.336]
57 We recall the key property of the kernel kH : for any f ∈ H, f, kH (x, ·) H = f (x). [sent-109, score-0.223]
58 The Representer Theorem allows us to write solutions to various optimization problems over H in terms of linear combinations of kernels supported on sample points (see [21] for an in-depth discussion or the RKHS theory and the issues related to learning). [sent-110, score-0.169]
59 , xn from p, one 1 can approximate the L2,p norm of a sufficiently smooth functionf by f 2 ≈ n i |f (xi )|2 , and 2,p 1 similarly, the integral operator Kp f (x) ≈ n i k(xi , x)f (xi ). [sent-114, score-0.293]
60 Type II setting comes from the fact Kt,q f (x) ≈ f (x)p(x) + O(t) for a “δfunction-like” kernel and we keep t in the notation in that case. [sent-119, score-0.223]
61 4 become integral equations for p , known as Fredholm equations of the first kind. [sent-121, score-0.128]
62 To provide a framework for solving these inverse problems, we apply the classical techniques of regularization combined with the RKHS norm popular in machine learning. [sent-123, score-0.266]
63 5), with the L2,p norm is as follows: [Type I]: I fλ = arg min Kp f − Kq 1 f ∈H 2 2,p +λ f 2 H (5) Here H is an appropriate Reproducing Kernel Hilbert Space. [sent-125, score-0.128]
64 Given an iid sample from p, z p = {xi }n and i=1 an iid sample from q, z q = {xj }m (z for the combined sample), we can approximate the j=1 1 integral operators Kp and Kq by Kzp f (x) = n xi ∈zp k(xi , x)f (xi ) and Kzq f (x) = 1 x ∈z q k(xi , x)f (xi ). [sent-129, score-0.399]
65 5 becomes m i I fλ,z = arg min f ∈H 1 nx ((Kzp f )(xi ) − (Kzq 1)(xi ))2 + λ f 2 H (7) i ∈z p The first term of the optimization problem involves only evaluations of the function f at the points of the sample. [sent-131, score-0.143]
66 (8) xi ∈z p 1 where the kernel matrices are defined as follows: (Kp,p )ij = n k(xi , xj ), (KH )ij = kH (xi , xj ) for 1 xi , xj ∈ z p and Kp,q is defined as (Kp,q )ij = m k(xi , xj ) for xi ∈ z p and xj ∈ z q . [sent-133, score-1.042]
67 4 If KH and Kp,p are the same kernel we simply have: v = 1 n −1 3 Kp,p + λI Kp,p Kp,q 1zq . [sent-134, score-0.223]
68 First we state the continuous version of the problem: * fλ = arg min γ Kp f − Kq 1 f ∈H 2 2,p 2 2,q + (1 − γ) Kp f − Kq 1 +λ f 2 H (9) Given a sample from p, z p = {x1 , x2 , . [sent-138, score-0.098]
69 Despite the loss function combining both samples, the solution is still a summation of kernels over the points in the sample from p. [sent-146, score-0.168]
70 In addition to using the RKHS norm for regularization norm, we * can also use it as a loss function: fλ = arg minf ∈H Kp f − Kq 1 2 + λ f 2 Here the Hilbert H H space H must correspond to the kernel k and can potentially be different from the space H used for regularization. [sent-148, score-0.456]
71 (10) xi ∈z p The result is somewhat similar to our Type I formulation with the L2,p norm. [sent-151, score-0.179]
72 We note the connection between this formulation of using the RKHS norm as a loss function and the KMM algorithm [9]. [sent-152, score-0.146]
73 In Type II setting we assume that we have a sample z = {xi }n drawn from p i=1 and that we know the function values q(xi ) at the points of the sample. [sent-156, score-0.084]
74 xi ∈z where the kernel matrix K is defined by Kij = 1 n kt (xi , xj ), (KH )ij = kH (xi , xj ) and qi = q(xi ). [sent-158, score-0.636]
75 In Type II setting q does not have to be a density function (i. [sent-161, score-0.164]
76 Several norms are available in the Type I setting, but only the L2,p norm is available for Type II. [sent-177, score-0.124]
77 To simplify the theoretical development, the integral operator and the RKHS H will correspond to the same Gaussian kernel kt (x, y). [sent-180, score-0.519]
78 Assumptions: The set Ω, where the density function p is defined, could be one of the following: (1) the whole Rd ; (2) a compact smooth Riemannian sub-manifold M of d-dimension in Rn . [sent-182, score-0.164]
79 ) Let p be a density function on Ω and q be a function satisfying the assumptions. [sent-207, score-0.164]
80 Note that 6 q learning p is unsupervised and our algorithms typically have two parameters: the kernel width t and regularization parameter λ. [sent-234, score-0.343]
81 For a given function u, we have the following importance sampling q equality (Eq. [sent-237, score-0.127]
82 If f is an approximation of the true ratio p , and q(x) X p , X q are samples from p, q respectively, we will have the following approximation to the previous n m 1 1 equation: n i=1 u(xp )f (xp ) ≈ m j=1 u(xq ). [sent-239, score-0.086]
83 However, there a similar q measure is used to construct an approximation to the ratio p using functions u1 , . [sent-246, score-0.128]
84 , linear functions) which are poorly suited as a basis for approximating the density ratio. [sent-252, score-0.164]
85 We will use the following two families of test functions for parameter selection: (1) Sets of random linear functions ui (x) = β T x where β ∼ N (0, Id); (2) Sets of random half-space indicator functions, ui (x) = 1β T x>0 . [sent-253, score-0.084]
86 9 The range we use for kernel width t is (t0 , 2t0 , . [sent-256, score-0.223]
87 , (xn , yn )} and denoting Pi the probability of i’th instance being chosen, we resample as follows: e(a xi ,e1 −b)/σv (1) Resampling using features (labels yi are not used). [sent-270, score-0.138]
88 Pi = 1+e(a xi ,e1 −b)/σv , where a, b are the resampling parameters, e1 is the first principal component, and σv is the standard deviation of the projections to e1 . [sent-271, score-0.263]
89 This resampling method will be denoted by PCA(a, b). [sent-272, score-0.125]
90 One issue of this heuristic ˆ p is that it could underestimate at the region with high density ratio, due to the uniform thresholding. [sent-290, score-0.164]
91 We use different validation functions f cv (Columns) and error-measuring functions f err (Row). [sent-295, score-0.142]
92 ): random linear combinations of kernel functions centered at training data, f (x) = γ T K where γ ∼ N (0, Id) and Kij = k(xi , xj ) for xi from training set; (4) Kernel indicator(K. [sent-301, score-0.484]
93 Table 1: USPS data set with resampling using PCA(5, σv ) with |X p | = 500, |X q | = 1371. [sent-304, score-0.125]
94 CPUsmall, resampled by PCA(5, σv ) Kin8nm, resampled by PCA(1, σv ) No. [sent-403, score-0.098]
95 Left half of the table uses resampling PCA(5, σv ), where σv . [sent-486, score-0.125]
96 Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. [sent-665, score-0.086]
97 Improving predictive inference under covariate shift by weighting the loglikelihood function. [sent-692, score-0.248]
98 Covariate shift adaptation by importance weighted u cross validation. [sent-709, score-0.173]
99 Direct importance estimation with model selection and its application to covariate shift adaptation. [sent-712, score-0.359]
100 The effect of the input density distribution on kernel-based classifiers. [sent-721, score-0.164]
wordName wordTfidf (topN-words)
[('kh', 0.373), ('kmm', 0.234), ('kq', 0.234), ('lsif', 0.234), ('tikde', 0.234), ('kernel', 0.223), ('hs', 0.196), ('rkhs', 0.193), ('fredholm', 0.187), ('density', 0.164), ('kp', 0.156), ('covariate', 0.145), ('firep', 0.14), ('fireq', 0.14), ('xi', 0.138), ('fire', 0.133), ('integral', 0.128), ('resampling', 0.125), ('type', 0.118), ('kt', 0.113), ('shift', 0.103), ('kzp', 0.093), ('kzq', 0.093), ('representer', 0.089), ('ii', 0.086), ('ratio', 0.086), ('xp', 0.083), ('xj', 0.081), ('regularization', 0.078), ('norm', 0.078), ('inverse', 0.077), ('hilbert', 0.077), ('importance', 0.07), ('xq', 0.065), ('usps', 0.059), ('rd', 0.057), ('kernels', 0.057), ('tikhonov', 0.057), ('cpusmall', 0.057), ('pca', 0.057), ('operator', 0.055), ('euclidean', 0.055), ('ep', 0.055), ('gretton', 0.054), ('reproducing', 0.054), ('eq', 0.053), ('arg', 0.05), ('resampled', 0.049), ('dx', 0.049), ('sample', 0.048), ('equalities', 0.047), ('kliep', 0.047), ('uf', 0.047), ('xcv', 0.047), ('xtrain', 0.047), ('norms', 0.046), ('ij', 0.043), ('functions', 0.042), ('unsupervised', 0.042), ('estimating', 0.042), ('vito', 0.041), ('formulation', 0.041), ('selection', 0.041), ('belkin', 0.04), ('thresholded', 0.039), ('jcd', 0.038), ('operators', 0.037), ('transfer', 0.036), ('points', 0.036), ('reformulating', 0.036), ('rosasco', 0.036), ('lder', 0.034), ('sch', 0.034), ('sampled', 0.033), ('classical', 0.033), ('iis', 0.032), ('xn', 0.032), ('lq', 0.031), ('vi', 0.031), ('equality', 0.031), ('equation', 0.031), ('spectral', 0.03), ('err', 0.03), ('gr', 0.03), ('ul', 0.03), ('classi', 0.03), ('jmlr', 0.03), ('concentration', 0.03), ('borgwardt', 0.029), ('nx', 0.029), ('labeled', 0.029), ('kij', 0.028), ('cv', 0.028), ('id', 0.028), ('settings', 0.028), ('optimization', 0.028), ('sugiyama', 0.028), ('dy', 0.027), ('loss', 0.027), ('sampling', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
Author: Qichao Que, Mikhail Belkin
Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1
2 0.18787648 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar
Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1
3 0.18119504 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh
Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1
4 0.15619323 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
Author: Le Song, Bo Dai
Abstract: Kernel embedding of distributions has led to many recent advances in machine learning. However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspecified. In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspecification. We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation. 1
5 0.14840332 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
6 0.14067881 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
7 0.13198471 173 nips-2013-Least Informative Dimensions
8 0.10778508 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
9 0.099977195 137 nips-2013-High-Dimensional Gaussian Process Bandits
10 0.099577926 224 nips-2013-On the Sample Complexity of Subspace Learning
11 0.096389711 9 nips-2013-A Kernel Test for Three-Variable Interactions
12 0.091452919 289 nips-2013-Scalable kernels for graphs with continuous attributes
13 0.090349607 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
14 0.08871223 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
15 0.080331519 63 nips-2013-Cluster Trees on Manifolds
16 0.076210149 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
17 0.073908828 91 nips-2013-Dirty Statistical Models
18 0.073317274 75 nips-2013-Convex Two-Layer Modeling
19 0.070419453 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
20 0.069693185 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
topicId topicWeight
[(0, 0.205), (1, 0.089), (2, 0.055), (3, 0.037), (4, 0.026), (5, 0.081), (6, 0.014), (7, 0.05), (8, -0.109), (9, 0.037), (10, -0.038), (11, -0.013), (12, -0.099), (13, -0.039), (14, 0.252), (15, 0.14), (16, -0.024), (17, 0.058), (18, -0.052), (19, -0.113), (20, 0.008), (21, 0.02), (22, 0.058), (23, 0.035), (24, -0.08), (25, 0.04), (26, 0.074), (27, -0.005), (28, 0.066), (29, 0.038), (30, 0.084), (31, 0.005), (32, 0.076), (33, -0.03), (34, -0.025), (35, -0.127), (36, -0.059), (37, -0.034), (38, -0.028), (39, -0.037), (40, 0.007), (41, -0.051), (42, -0.014), (43, 0.014), (44, -0.139), (45, -0.015), (46, -0.04), (47, -0.047), (48, 0.076), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.94156051 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
Author: Qichao Que, Mikhail Belkin
Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1
2 0.80428344 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
3 0.80347162 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
Author: Corinna Cortes, Marius Kloft, Mehryar Mohri
Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classification tasks. 1
4 0.80162776 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh
Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1
5 0.76298732 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf
Abstract: Many applications require the analysis of complex interactions between time series. These interactions can be non-linear and involve vector valued as well as complex data structures such as graphs or strings. Here we provide a general framework for the statistical analysis of these dependencies when random variables are sampled from stationary time-series of arbitrary objects. To achieve this goal, we study the properties of the Kernel Cross-Spectral Density (KCSD) operator induced by positive definite kernels on arbitrary input domains. This framework enables us to develop an independence test between time series, as well as a similarity measure to compare different types of coupling. The performance of our test is compared to the HSIC test using i.i.d. assumptions, showing improvements in terms of detection errors, as well as the suitability of this approach for testing dependency in complex dynamical systems. This similarity measure enables us to identify different types of interactions in electrophysiological neural time series. 1
6 0.71964663 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
7 0.70845002 9 nips-2013-A Kernel Test for Three-Variable Interactions
8 0.69171125 76 nips-2013-Correlated random features for fast semi-supervised learning
9 0.62542862 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
10 0.62526375 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
11 0.6035012 173 nips-2013-Least Informative Dimensions
12 0.56611007 327 nips-2013-The Randomized Dependence Coefficient
13 0.53939813 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
14 0.51515698 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
15 0.51160121 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
16 0.49111143 244 nips-2013-Parametric Task Learning
17 0.49035901 289 nips-2013-Scalable kernels for graphs with continuous attributes
18 0.48321903 224 nips-2013-On the Sample Complexity of Subspace Learning
19 0.4825789 256 nips-2013-Probabilistic Principal Geodesic Analysis
20 0.47425553 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
topicId topicWeight
[(13, 0.01), (16, 0.026), (19, 0.034), (33, 0.141), (34, 0.165), (41, 0.021), (49, 0.031), (56, 0.113), (70, 0.019), (78, 0.251), (85, 0.042), (89, 0.036), (93, 0.031), (95, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.7983458 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
Author: Qichao Que, Mikhail Belkin
Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1
2 0.77550822 63 nips-2013-Cluster Trees on Manifolds
Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman
Abstract: unkown-abstract
3 0.71538818 230 nips-2013-Online Learning with Costly Features and Labels
Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari
Abstract: This paper introduces the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying to see the loss function that he is evaluated against. Either way, the learner pays for both the errors of his predictions and also whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label significantly increases the regret of the problem. 1
4 0.70132184 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
5 0.69694155 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
6 0.69592881 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
7 0.6954211 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
8 0.69490105 201 nips-2013-Multi-Task Bayesian Optimization
9 0.69198269 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
10 0.6917643 348 nips-2013-Variational Policy Search via Trajectory Optimization
11 0.69078875 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
12 0.69075769 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
13 0.69056714 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
14 0.69042826 294 nips-2013-Similarity Component Analysis
15 0.69015902 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
16 0.68952161 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
17 0.68843067 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
18 0.68813145 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
19 0.68754292 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
20 0.68664569 143 nips-2013-Integrated Non-Factorized Variational Inference