jmlr jmlr2013 jmlr2013-57 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kenji Fukumizu, Le Song, Arthur Gretton
Abstract: A kernel method for realizing Bayes’ rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. Probabilities are uniquely characterized by the mean of the canonical map to the RKHS. The prior and conditional probabilities are expressed in terms of RKHS functions of an empirical sample: no explicit parametric model is needed for these quantities. The posterior is likewise an RKHS mean of a weighted sample. The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. Some representative applications of the kernel Bayes’ rule are presented, including Bayesian computation without likelihood and filtering with a nonparametric state-space model. Keywords: kernel method, Bayes’ rule, reproducing kernel Hilbert space
Reference: text
sentIndex sentText sentNum sentScore
1 COM Gatsby Computational Neuroscience Unit University College London Alexandra House, 17 Queen Square London, WC1N 3AR, UK Editor: Ingo Steinwart Abstract A kernel method for realizing Bayes’ rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. [sent-7, score-0.249]
2 The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. [sent-11, score-0.108]
3 Some representative applications of the kernel Bayes’ rule are presented, including Bayesian computation without likelihood and filtering with a nonparametric state-space model. [sent-12, score-0.17]
4 Keywords: kernel method, Bayes’ rule, reproducing kernel Hilbert space 1. [sent-13, score-0.211]
5 Introduction Kernel methods have long provided powerful tools for generalizing linear statistical approaches to nonlinear settings, through an embedding of the sample to a high dimensional feature space, namely a reproducing kernel Hilbert space (RKHS) (Sch¨ lkopf and Smola, 2002). [sent-14, score-0.135]
6 Examples include supo port vector machines, kernel PCA, and kernel CCA, among others. [sent-15, score-0.18]
7 In these cases, data are mapped via a canonical feature map to a reproducing kernel Hilbert space (of high or even infinite dimension), in which the linear operations that define the algorithms are implemented. [sent-16, score-0.121]
8 The inner product between feature mappings need never be computed explicitly, but is given by a positive definite kernel function unique to the RKHS: this permits efficient computation without the need to deal explicitly with the feature representation. [sent-17, score-0.09]
9 F UKUMIZU , S ONG AND G RETTON kernel means of the underlying random variables. [sent-22, score-0.09]
10 With an appropriate choice of positive definite kernel, the kernel mean on the RKHS uniquely determines the distribution of the variable (Fukumizu et al. [sent-23, score-0.108]
11 , 2010), and statistical inference problems on distributions can be solved via operations on the kernel means. [sent-25, score-0.113]
12 In this paper, we propose a novel, nonparametric approach to Bayesian inference, making use of kernel means of probabilities. [sent-34, score-0.115]
13 Our main result is a nonparametric estimate of posterior kernel mean, given kernel mean representations of the prior and likelihood. [sent-36, score-0.3]
14 A valuable property of the kernel Bayes’ rule is that the kernel posterior mean is estimated nonparametrically from data. [sent-38, score-0.294]
15 The present kernel approach provides an alternative strategy for Bayesian inference in these settings. [sent-47, score-0.113]
16 We demonstrate consistency for our posterior kernel mean estimate, and derive convergence rates for the expectation of functions computed using this estimate. [sent-48, score-0.186]
17 An alternative to the kernel mean representation would be to use nonparametric density estimates for the posterior. [sent-49, score-0.151]
18 Classical approaches include kernel density estimation (KDE) or distribution estimation on a finite partition of the domain. [sent-50, score-0.108]
19 By contrast, the proposed kernel mean representation is defined as an integral or moment of the distribution, taking the form of a function in an RKHS. [sent-53, score-0.108]
20 By contrast, the kernel mean has a straightforward empirical estimate, and conditioning and marginalization can be implemented easily, at a reasonable computational cost. [sent-58, score-0.108]
21 In this earlier work, a heuristic approximation was used, where the kernel mean of the new hidden state was estimated by adding kernel mean estimates from the previous hidden state and the observation. [sent-61, score-0.25]
22 We begin in Section 2 with a review of RKHS terminology and of kernel mean embeddings. [sent-65, score-0.108]
23 In Section 3, we derive an expression for Bayes’ rule in terms of kernel means, and provide consistency guarantees. [sent-66, score-0.148]
24 We apply the kernel Bayes’ rule in Section 4 to various inference problems, with numerical results and comparisons with existing methods in Section 5. [sent-67, score-0.151]
25 We begin with a review of positive definite kernels, and of statistics on the associated reproducing kernel Hilbert spaces (Aronszajn, 1950; Berlinet and Thomas-Agnan, 2004; Fukumizu et al. [sent-73, score-0.121]
26 Given a set Ω, a (R-valued) positive definite kernel k on Ω is a symmetric kernel k : Ω × Ω → R such that ∑n j=1 ci c j k(xi , x j ) ≥ 0 for arbitrary number of points x1 , . [sent-75, score-0.18]
27 The Hilbert space H is called the reproducing kernel Hilbert space (RKHS) associated with k, since the function kx = k(·, x) serves as the reproducing kernel f , kx = f (x) for f ∈ H . [sent-84, score-0.632]
28 A positive definite kernel on Ω is said to be bounded if there is M > 0 such that k(x, x) ≤ M for any x ∈ Ω. [sent-85, score-0.09]
29 Let (X , BX ) be a measurable space, X be a random variable taking values in X with distribution PX , and k be a measurable positive definite kernel on X such that E[ k(X, X)] < ∞. [sent-86, score-0.128]
30 The kernel mean mk (also written mk X ) of X on the RKHS H is defined by X P the mean of the H -valued random variable k(·, X). [sent-88, score-0.126]
31 The existence of the kernel mean is guaranteed by E[ k(·, X) ] = E[ k(X, X)] < ∞. [sent-89, score-0.108]
32 By the reproducing property, the kernel mean satisfies the relation f , mX = E[ f (X)] 3755 (2) F UKUMIZU , S ONG AND G RETTON for any f ∈ H . [sent-91, score-0.139]
33 The kernel mean mX is also denoted by mPX , as it depends only on the distribution PX with k fixed. [sent-93, score-0.108]
34 This operator CY X can be identified with m(Y X) in the product space HY ⊗ HX , which is given by the product kernel kY kX on Y × X (Aronszajn, 1950), by the standard identification between the linear maps and the tensor product. [sent-96, score-0.112]
35 A bounded measurable positive definite kernel k on a measurable space (Ω, B ) is called characteristic if the mapping from a probability Q on (Ω, B ) to the kernel mean mk ∈ H is inQ jective (Fukumizu et al. [sent-100, score-0.261]
36 This is equivalent to assuming that EX∼P [k(·, X)] = EX ′ ∼Q [k(·, X ′ )] implies P = Q: probabilities are uniquely determined by their kernel means on the associated RKHS. [sent-103, score-0.09]
37 With this property, problems of statistical inference can be cast as inference on the kernel means. [sent-104, score-0.136]
38 A popular example of a characteristic kernel defined on Euclidean space is the Gaussian RBF kernel k(x, y) = exp(− x − y 2 /(2σ2 )). [sent-105, score-0.205]
39 A bounded measurable positive definite kernel on a measurable space (Ω, B ) with corresponding RKHS H is characteristic if and only if H + R is dense in L2 (P) for arbitrary probability P on (Ω, B ), where H + R is the direct sum of two RKHSs H and R (Aronszajn, 1950). [sent-106, score-0.153]
40 This implies that the RKHS defined by a characteristic kernel is rich enough to be dense in L2 space up to the constant functions. [sent-107, score-0.115]
41 Other useful conditions for a kernel to be characteristic can be found in Sriperumbudur et al. [sent-108, score-0.115]
42 , (Xn ,Yn ) with law P, the empirical estimators of the kernel mean and covariance operator are given straightforwardly by (n) mX = 1 n ∑ kX (·, Xi), n i=1 (n) CY X = 3756 1 n ∑ kY (·,Yi ) ⊗ kX (·, Xi), n i=1 K ERNEL BAYES ’ RULE √ (n) where CY X is written in tensor form. [sent-120, score-0.147]
43 Kernel Expression of Bayes’ Rule We review Bayes’ rule and the notion of kernel conditional mean embeddings in Section 3. [sent-131, score-0.195]
44 We provide consistency results for the empirical estimators of the conditional mean embedding for the posterior in Section 3. [sent-134, score-0.134]
45 1 Kernel Bayes’ Rule We first review Bayes’ rule in a general form without using density functions, since the kernel Bayes’ rule can be applied to situations where density functions are not available. [sent-137, score-0.202]
46 In kernel methods, however, the information on the relation between variables is expressed by covariance, which leads to finite sample estimates in terms of Gram matrices, as we see below. [sent-161, score-0.09]
47 The goal of this subsection is to derive an estimator of the kernel mean of posterior mQX |y . [sent-164, score-0.196]
48 Using Theorem 1, we have the following result, which expresses the kernel mean of QY , and implements the Sum Rule in terms of mean embeddings. [sent-169, score-0.126]
49 , 2009, Equation 6) Let mΠ and mQY be the kernel means of Π in HX and QY in HY , respectively. [sent-171, score-0.09]
50 (2009), we can regard the operator CY X CXX as the kernel expression of the conditional probability PY |x or p(y|x). [sent-176, score-0.136]
51 Then, E[g(Y )|X = x] is a constant function of x, which is known not to be included in a RKHS given by a Gaussian kernel (Steinwart and Christmann, 2008, Corollary 4. [sent-184, score-0.09]
52 If we stipulate that5 F ∗ ∈ HY ⊗ HX , and regularize by the squared Hilbert-Schmidt norm of F ∗ , we obtain the ridge regression problem argmin E kY (Y, ·) − F ∗ [kX (X, ·)] 2 Y + ε F ∗ H ∗ 2 HS , F ∈ HY ⊗ HX which has as its solution the regularized kernel conditional mean embedding. [sent-197, score-0.132]
53 (2009, 2010a) as the kernel mean of the conditional probability. [sent-209, score-0.132]
54 Suppose CXX hx = kX (·, x) were to hold for some hx ∈ HX . [sent-211, score-0.566]
55 Taking the inner product with kX (·, x) would then ˜ imply kX (x, x) = hx (x′ )kX (x, x′ )dPX (x′ ), which is not possible for many popular kernels, including the Gaussian ˜ ˜ kernel. [sent-212, score-0.283]
56 To derive kernel realization of Bayes’ rule, suppose that we know the covariance operators CZW and CWW for the random variable (Z,W ) ∼ Q, where Q is defined by Equation (5). [sent-214, score-0.126]
57 The conditional probability E[kX (·, Z)|W = y] is then exactly the kernel mean of the posterior distribution for observation y ∈ Y . [sent-215, score-0.19]
58 Equation (9) gives the regularized approximate of the kernel mean of posterior; CZW (CWW + δI)−1 kY (·, y), (10) where δ is a positive regularization constant. [sent-216, score-0.108]
59 This can be done by recalling that the kernel mean mQ = m(ZW ) ∈ HX ⊗ HY can be identified with the covariance operator CZW : HY → HX by the standard identification of a tensor ∑i fi ⊗ gi ( fi ∈ HX and gi ∈ HY ) and a Hilbert-Schmidt operator h → ∑i gi fi , h HX . [sent-218, score-0.169]
60 From Theorem 2, the kernel mean mQ is given by the following tensor representation: −1 mQ = C(Y X)X CXX mΠ ∈ HY ⊗ HX , where the covariance operator C(Y X)X : HX → HY ⊗ HX is defined by the random variable ((Y, X), X) taking values on (Y × X )× X . [sent-219, score-0.147]
61 Similarly, the kernel mean m(WW ) on the product space HY ⊗ HY is identified with CWW , and the expression −1 m(WW ) = C(YY )X CXX mΠ gives a way of estimating the operator CWW . [sent-222, score-0.13]
62 Such negative weights may appear in successive applications of the kernel Bayes rule, as in the state-space example of Section 4. [sent-236, score-0.09]
63 We call Equations (12) and (13) the kernel Bayes’ rule (KBR). [sent-257, score-0.128]
64 (ii) {(U j , γ j )}ℓj=1 : weighted sample to express the i=1 kernel mean of the prior mΠ . [sent-267, score-0.127]
65 Given conditioning value y, the kernel mean of the posterior q(x|y) is estimated by the weighted sample {(Xi , ρi )}n with weights ρ = RX|Y kY (y), where kY (y) = (kY (Yi , y))n . [sent-278, score-0.166]
66 It is obvious from the reproducing property that this theorem also guarantees the consistency of the posterior expectation in Equation (14). [sent-294, score-0.109]
67 In fact, in the case of kernel ridge regression, the optimal rates are known under additional information on the spectrum of covariance operators (Caponnetto and De Vito, 2007). [sent-314, score-0.126]
68 Given that KBR provides a posterior estimate in the form of a kernel mean (which uniquely determines the distribution when a characteristic kernel is used), we now describe how our kernel approach applies to problems in Bayesian inference. [sent-330, score-0.371]
69 In the case of the Gaussian kernel exp(− x − y 2 /(2σ2 )), the fixed point method x(t+1) = ∑n Xi ρi exp(− Xi − x(t) 2 /(2σ2 )) i=1 , ∑n ρi exp(− Xi − x(t) |2 /(2σ2 )) i=1 where ρ = RX|Y kY (y), can be used to optimize x sequentially (Mika et al. [sent-338, score-0.09]
70 • Even if explicit forms for the likelihood and prior are available, and standard sampling methods such as MCMC or sequential MC are applicable, the computation of a posterior estimate given y might still be computationally costly, making real-time applications infeasible. [sent-358, score-0.094]
71 Third, kernel choice or model selection is key to effective performance of any kernel method. [sent-370, score-0.18]
72 In the case of KBR, we have three model parameters: the kernel (or its parameter, e. [sent-371, score-0.09]
73 We are thus able to compare the discrepancy of the empirical kernel mean of PX and the average of the estimators mQX |y=Yi over Yi . [sent-380, score-0.108]
74 , n} into K disjoint [−a] subsets {Ta }K , let mQX |y be the kernel mean of posterior computed using Gram matrices on data a=1 [−a] {(Xi ,Yi )}i∈Ta , and based on the prior mean mX with data {Xi }i∈Ta . [sent-384, score-0.203]
75 Alternatively, nonparametric estimates of conditional density functions can be employed, including kernel density estimation or distribution estimates on a partitioning of the space (Monbet et al. [sent-390, score-0.175]
76 (2010b), in which the kernel means and covariance operators are used to implement a nonparametric HMM. [sent-401, score-0.151]
77 , yt , we wish to estimate the ˜ ˜ current hidden state xt . [sent-423, score-0.106]
78 The sequential estimate for the kernel mean of p(xt |y1 , . [sent-424, score-0.108]
79 Suppose we already have an estimator of the kernel mean of p(xt |y1 , . [sent-428, score-0.138]
80 , yt )dxt , based on Equa˜ ˜ ˜ ˜ tion (9) the kernel mean of xt+1 given y1 , . [sent-452, score-0.157]
81 , yt )dxt , we have an estimate for the kernel mean of ˜ ˜ ˜ ˜ the prediction p(yt+1 |y1 , . [sent-471, score-0.157]
82 , yt )dxt+1 ˜ ˜ the kernel Bayes’ rule with the prior p(xt+1 |y1 , . [sent-491, score-0.196]
83 ˜ ˜ If the prior π(x1 ) is available, the posterior estimate at x1 given y1 is obtained by the kernel ˜ Bayes’ rule. [sent-502, score-0.167]
84 In the same setting as ABC, KBR gives the following sampling-based method for computing the kernel posterior mean: 3767 F UKUMIZU , S ONG AND G RETTON 1. [sent-512, score-0.148]
85 Alternatively, since (Xt ,Yt ) is an sample from Q, it is possible to use simply Equation (8) for the kernel mean of the conditional probability q(x|y). [sent-527, score-0.132]
86 (19) t=1 The distribution of a sample generated by ABC approaches to the true posterior if τ goes to zero, while empirical estimates via the kernel approaches converge to the true posterior mean embedding in the limit of infinite sample size. [sent-530, score-0.238]
87 Given a posterior mean obtained by one of the kernel methods, however, we may only obtain expectations of functions in the RKHS, meaning that certain statistics (such as confidence intervals) are not straightforward to compute. [sent-533, score-0.166]
88 In addition to the basic comparison with kernel density estimation, we show simple experiments for Bayesian inference without likelihood and filtering with nonparametric hidden Markov models. [sent-538, score-0.19]
89 1 Nonparametric Inference of Posterior The first numerical example is a comparison between KBR and a kernel density estimation (KDE) approach to obtaining conditional densities. [sent-544, score-0.132]
90 belong to the Gaussian kernel RKHS, hence the consistency result of Theorem 6 does not apply to this function. [sent-564, score-0.11]
91 2 Bayesian Computation Without Likelihood We compare ABC and the kernel methods, KBR and conditional mean, in terms of estimation accuracy and computational time, since they have an obvious tradeoff. [sent-584, score-0.114]
92 For the kernel methods, the sample size n is varied. [sent-594, score-0.09]
93 The kernels in the kernel methods 3769 F UKUMIZU , S ONG AND G RETTON CPU time vs Error (2 dim. [sent-598, score-0.11]
94 The experimental results indicate that kernel methods achieve more accurate results than ABC at a given computational cost. [sent-621, score-0.09]
95 The conditional kernel mean yields the best results, since in this instance, it is not necessary to correct for a difference in distribution between Π and PX . [sent-622, score-0.132]
96 KBR has the regularization parameters εT , δT , and kernel parameters for kX and kY (e. [sent-627, score-0.09]
97 To reduce the search space and attendant computational cost, we used a simpler procedure, setting δT = 2εT , and Gaussian kernel bandwidths βσX and βσY , where σX and σY were the median of pairwise distances in the training samples (Gretton et al. [sent-632, score-0.09]
98 In the second setting, we exploited the fact that St ∈ SO(3): for the Kalman filter, St was represented by a quanternion, which is a standard vector representation of rotations; for the KBR filter the kernel k(A, B) = Tr[ABT ] was used for St , and St was estimated within SO(3). [sent-703, score-0.09]
99 Before stating the main proofs, we provide a proof of consistency of the empirical counterparts in Proposition 3 to the kernel Sum rule (Theorem 2). [sent-711, score-0.148]
100 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-873, score-0.121]
wordName wordTfidf (topN-words)
[('cww', 0.505), ('cxx', 0.461), ('kbr', 0.304), ('ky', 0.289), ('hx', 0.283), ('hy', 0.223), ('kx', 0.195), ('cy', 0.139), ('mqy', 0.122), ('czw', 0.103), ('kernel', 0.09), ('cw', 0.079), ('mqx', 0.078), ('retton', 0.073), ('ukumizu', 0.073), ('fukumizu', 0.072), ('rkhs', 0.072), ('qy', 0.07), ('kenji', 0.069), ('gretton', 0.068), ('rx', 0.068), ('posterior', 0.058), ('arthur', 0.055), ('dpx', 0.054), ('song', 0.052), ('ernel', 0.052), ('abc', 0.052), ('bayes', 0.052), ('yt', 0.049), ('gy', 0.043), ('kde', 0.042), ('jy', 0.042), ('py', 0.042), ('xt', 0.04), ('ekf', 0.039), ('hilbert', 0.039), ('ong', 0.038), ('rule', 0.038), ('gx', 0.033), ('gram', 0.032), ('reproducing', 0.031), ('mx', 0.031), ('estimator', 0.03), ('cyy', 0.029), ('sriperumbudur', 0.029), ('px', 0.029), ('kalman', 0.028), ('bernhard', 0.026), ('bayesian', 0.026), ('iw', 0.025), ('characteristic', 0.025), ('nonparametric', 0.025), ('embeddings', 0.025), ('dpy', 0.024), ('ukf', 0.024), ('wy', 0.024), ('conditional', 0.024), ('equation', 0.024), ('inference', 0.023), ('bx', 0.023), ('ta', 0.023), ('operator', 0.022), ('kt', 0.021), ('lter', 0.021), ('kernels', 0.02), ('ft', 0.02), ('consistency', 0.02), ('bharath', 0.02), ('cxy', 0.02), ('marjoram', 0.02), ('mxt', 0.02), ('nikodym', 0.02), ('sisson', 0.02), ('zw', 0.02), ('measurable', 0.019), ('ww', 0.019), ('injective', 0.019), ('prior', 0.019), ('operators', 0.019), ('lder', 0.019), ('density', 0.018), ('mean', 0.018), ('ltering', 0.018), ('radon', 0.017), ('berlinet', 0.017), ('ib', 0.017), ('hidden', 0.017), ('likelihood', 0.017), ('covariance', 0.017), ('gr', 0.017), ('aronszajn', 0.016), ('inversion', 0.016), ('genetics', 0.015), ('gxx', 0.015), ('maceachern', 0.015), ('nishiyama', 0.015), ('tavar', 0.015), ('sch', 0.015), ('dynamics', 0.014), ('embedding', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
Author: Kenji Fukumizu, Le Song, Arthur Gretton
Abstract: A kernel method for realizing Bayes’ rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. Probabilities are uniquely characterized by the mean of the canonical map to the RKHS. The prior and conditional probabilities are expressed in terms of RKHS functions of an empirical sample: no explicit parametric model is needed for these quantities. The posterior is likewise an RKHS mean of a weighted sample. The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. Some representative applications of the kernel Bayes’ rule are presented, including Bayesian computation without likelihood and filtering with a nonparametric state-space model. Keywords: kernel method, Bayes’ rule, reproducing kernel Hilbert space
2 0.055036329 76 jmlr-2013-Nonparametric Sparsity and Regularization
Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri
Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI
3 0.05182638 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
Author: Aleksandr Y. Aravkin, James V. Burke, Gianluigi Pillonetto
Abstract: We introduce a new class of quadratic support (QS) functions, many of which already play a crucial role in a variety of applications, including machine learning, robust statistical inference, sparsity promotion, and inverse problems such as Kalman smoothing. Well known examples of QS penalties include the ℓ2 , Huber, ℓ1 and Vapnik losses. We build on a dual representation for QS functions, using it to characterize conditions necessary to interpret these functions as negative logs of true probability densities. This interpretation establishes the foundation for statistical modeling with both known and new QS loss functions, and enables construction of non-smooth multivariate distributions with specified means and variances from simple scalar building blocks. The main contribution of this paper is a flexible statistical modeling framework for a variety of learning applications, together with a toolbox of efficient numerical methods for estimation. In particular, a broad subclass of QS loss functions known as piecewise linear quadratic (PLQ) penalties has a dual representation that can be exploited to design interior point (IP) methods. IP methods solve nonsmooth optimization problems by working directly with smooth systems of equations characterizing their optimality. We provide several numerical examples, along with a code that can be used to solve general PLQ problems. The efficiency of the IP approach depends on the structure of particular applications. We consider the class of dynamic inverse problems using Kalman smoothing. This class comprises a wide variety of applications, where the aim is to reconstruct the state of a dynamical system with known process and measurement models starting from noisy output samples. In the classical case, Gaus∗. The authors would like to thank Bradley M. Bell for insightful discussions and helpful suggestions. The research leading to these results has received funding from the European Union Seventh Framework Programme [FP7/2007-2013
5 0.0411397 32 jmlr-2013-Differential Privacy for Functions and Functional Data
Author: Rob Hall, Alessandro Rinaldo, Larry Wasserman
Abstract: Differential privacy is a rigorous cryptographically-motivated characterization of data privacy which may be applied when releasing summaries of a database. Previous work has focused mainly on methods for which the output is a finite dimensional vector, or an element of some discrete set. We develop methods for releasing functions while preserving differential privacy. Specifically, we show that adding an appropriate Gaussian process to the function of interest yields differential privacy. When the functions lie in the reproducing kernel Hilbert space (RKHS) generated by the covariance kernel of the Gaussian process, then the correct noise level is established by measuring the “sensitivity” of the function in the RKHS norm. As examples we consider kernel density estimation, kernel support vector machines, and functions in RKHSs. Keywords: differential privacy, density estimation, Gaussian processes, reproducing kernel Hilbert space
6 0.034391239 90 jmlr-2013-Quasi-Newton Method: A New Direction
7 0.031756204 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
8 0.029584486 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
9 0.027740438 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
10 0.027197342 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
11 0.026354009 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
12 0.026183857 108 jmlr-2013-Stochastic Variational Inference
13 0.025970582 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
14 0.025362464 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
15 0.025186207 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
16 0.024679696 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
17 0.023951331 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
18 0.023858283 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
19 0.023503952 121 jmlr-2013-Variational Inference in Nonconjugate Models
20 0.021872234 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
topicId topicWeight
[(0, -0.118), (1, -0.01), (2, 0.029), (3, -0.015), (4, -0.001), (5, -0.052), (6, 0.008), (7, 0.05), (8, 0.012), (9, -0.063), (10, 0.01), (11, 0.003), (12, 0.028), (13, -0.044), (14, -0.011), (15, -0.013), (16, 0.008), (17, -0.068), (18, 0.172), (19, 0.045), (20, -0.202), (21, -0.034), (22, -0.043), (23, 0.09), (24, -0.044), (25, -0.055), (26, 0.042), (27, -0.094), (28, -0.131), (29, 0.151), (30, 0.215), (31, 0.03), (32, 0.191), (33, -0.036), (34, -0.053), (35, 0.103), (36, 0.254), (37, -0.016), (38, 0.035), (39, 0.016), (40, 0.056), (41, -0.058), (42, 0.016), (43, -0.022), (44, 0.098), (45, -0.107), (46, -0.104), (47, 0.045), (48, 0.239), (49, 0.202)]
simIndex simValue paperId paperTitle
same-paper 1 0.93145233 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
Author: Kenji Fukumizu, Le Song, Arthur Gretton
Abstract: A kernel method for realizing Bayes’ rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. Probabilities are uniquely characterized by the mean of the canonical map to the RKHS. The prior and conditional probabilities are expressed in terms of RKHS functions of an empirical sample: no explicit parametric model is needed for these quantities. The posterior is likewise an RKHS mean of a weighted sample. The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. Some representative applications of the kernel Bayes’ rule are presented, including Bayesian computation without likelihood and filtering with a nonparametric state-space model. Keywords: kernel method, Bayes’ rule, reproducing kernel Hilbert space
2 0.41727996 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
Author: Wei Wu, Zhengdong Lu, Hang Li
Abstract: The task of matching data from two heterogeneous domains naturally arises in various areas such as web search, collaborative filtering, and drug design. In web search, existing work has designed relevance models to match queries and documents by exploiting either user clicks or content of queries and documents. To the best of our knowledge, however, there has been little work on principled approaches to leveraging both clicks and content to learn a matching model for search. In this paper, we propose a framework for learning to match heterogeneous objects. The framework learns two linear mappings for two objects respectively, and matches them via the dot product of their images after mapping. Moreover, when different regularizations are enforced, the framework renders a rich family of matching models. With orthonormal constraints on mapping functions, the framework subsumes Partial Least Squares (PLS) as a special case. Alternatively, with a ℓ1 +ℓ2 regularization, we obtain a new model called Regularized Mapping to Latent Structures (RMLS). RMLS enjoys many advantages over PLS, including lower time complexity and easy parallelization. To further understand the matching framework, we conduct generalization analysis and apply the result to both PLS and RMLS. We apply the framework to web search and implement both PLS and RMLS using a click-through bipartite with metadata representing features of queries and documents. We test the efficacy and scalability of RMLS and PLS on large scale web search problems. The results show that both PLS and RMLS can significantly outperform baseline methods, while RMLS substantially speeds up the learning process. Keywords: web search, partial least squares, regularized mapping to latent structures, generalization analysis
3 0.36660862 76 jmlr-2013-Nonparametric Sparsity and Regularization
Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri
Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI
4 0.34516037 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
5 0.3027027 90 jmlr-2013-Quasi-Newton Method: A New Direction
Author: Philipp Hennig, Martin Kiefel
Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes
6 0.27868327 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
8 0.24077968 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
10 0.21445265 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
11 0.21110681 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
12 0.20706248 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
13 0.20353746 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
14 0.18259048 15 jmlr-2013-Bayesian Canonical Correlation Analysis
15 0.17680302 22 jmlr-2013-Classifying With Confidence From Incomplete Information
16 0.17617092 32 jmlr-2013-Differential Privacy for Functions and Functional Data
17 0.15936038 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
18 0.15620235 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion
19 0.15206052 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
20 0.14042407 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
topicId topicWeight
[(0, 0.021), (5, 0.097), (6, 0.034), (10, 0.051), (20, 0.016), (23, 0.028), (41, 0.016), (44, 0.01), (53, 0.014), (68, 0.01), (70, 0.023), (71, 0.016), (73, 0.445), (75, 0.033), (85, 0.038), (87, 0.024), (93, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.6993382 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
Author: Kenji Fukumizu, Le Song, Arthur Gretton
Abstract: A kernel method for realizing Bayes’ rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. Probabilities are uniquely characterized by the mean of the canonical map to the RKHS. The prior and conditional probabilities are expressed in terms of RKHS functions of an empirical sample: no explicit parametric model is needed for these quantities. The posterior is likewise an RKHS mean of a weighted sample. The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. Some representative applications of the kernel Bayes’ rule are presented, including Bayesian computation without likelihood and filtering with a nonparametric state-space model. Keywords: kernel method, Bayes’ rule, reproducing kernel Hilbert space
2 0.67253125 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks
Author: Vinod K. Valsalam, Risto Miikkulainen
Abstract: Sorting networks are an interesting class of parallel sorting algorithms with applications in multiprocessor computers and switching networks. They are built by cascading a series of comparisonexchange units called comparators. Minimizing the number of comparators for a given number of inputs is a challenging optimization problem. This paper presents a two-pronged approach called Symmetry and Evolution based Network Sort Optimization (SENSO) that makes it possible to scale the solutions to networks with a larger number of inputs than previously possible. First, it uses the symmetry of the problem to decompose the minimization goal into subgoals that are easier to solve. Second, it minimizes the resulting greedy solutions further by using an evolutionary algorithm to learn the statistical distribution of comparators in minimal networks. The final solutions improve upon half-century of results published in patents, books, and peer-reviewed literature, demonstrating the potential of the SENSO approach for solving difficult combinatorial problems. Keywords: symmetry, evolution, estimation of distribution algorithms, sorting networks, combinatorial optimization
3 0.27870724 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
4 0.27357972 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
5 0.27344254 76 jmlr-2013-Nonparametric Sparsity and Regularization
Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri
Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI
6 0.27213404 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
7 0.2704176 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
8 0.27022794 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
10 0.26952991 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
11 0.26944345 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
12 0.26932544 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
14 0.2688809 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
15 0.26830348 73 jmlr-2013-Multicategory Large-Margin Unified Machines
16 0.26821986 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
17 0.26724151 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
18 0.26712972 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
19 0.26702613 121 jmlr-2013-Variational Inference in Nonconjugate Models
20 0.26700136 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning