jmlr jmlr2009 jmlr2009-1 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Takafumi Kanamori, Shohei Hido, Masashi Sugiyama
Abstract: We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches. Keywords: importance sampling, covariate shift adaptation, novelty detection, regularization path, leave-one-out cross validation
Reference: text
sentIndex sentText sentNum sentScore
1 The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. [sent-10, score-0.268]
2 Keywords: importance sampling, covariate shift adaptation, novelty detection, regularization path, leave-one-out cross validation 1. [sent-15, score-0.234]
3 The problem of estimating the importance is attracting a great deal of attention these days since the importance can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. [sent-17, score-0.353]
4 A naive approach to estimating the importance is to first estimate the training and test density functions from the sets of training and test samples separately, and then take the ratio of the estimated densities. [sent-42, score-0.223]
5 We experimentally show that the accuracy of uLSIF is comparable to the best existing method while its computation is faster than other methods in covariate shift adaptation and outlier detection scenarios. [sent-83, score-0.223]
6 ) training samples {xtr }ntr from a training distribution with density ptr (x) and i. [sent-106, score-0.229]
7 The goal of this paper is to estimate the importance w(x) from {xtr }ntr and {xte }nte : j j=1 i i=1 w(x) = pte (x) , ptr (x) which is non-negative by definition. [sent-117, score-0.322]
8 Our key restriction is that we want to avoid estimating densities pte (x) and ptr (x) when estimating the importance w(x). [sent-118, score-0.322]
9 Let E be the expectation over all possible training samples of size ntr and all possible test samples of size nte . [sent-162, score-1.0]
10 Theorem 2 Let P be the probability over all possible training samples of size ntr and test samples of size nte . [sent-209, score-1.0]
11 Then, there exists a positive constant c > 0 and a natural number N such that for min{ntr , nte } ≥ N, P(A = A ) < e−c min{ntr ,nte } . [sent-211, score-0.274]
12 tr (18) Then, for any λ ≥ 0, we have E[J(α(λ))] = J(α∗ (λ)) + 1 1 tr(A(Cw∗ ,w∗ − 2λCw∗ ,v )) + o 2ntr ntr . [sent-221, score-0.69]
13 However, the value of the cost function J is inaccessible since it includes the expectation over unknown probability density functions ptr (x) and pte (x). [sent-231, score-0.267]
14 1 I NFORMATION C RITERION In the same way as Theorem 3, we can obtain an asymptotic expansion of the empirical cost E J(α(λ)) as follows: E[J(α(λ))] = J(α∗ (λ)) − 1 1 tr(A(Cw∗ ,w∗ + 2λCw∗ ,v )) + o 2ntr ntr . [sent-236, score-0.627]
15 (19) and (20), we have E[J(α(λ))] = E[J(α(λ))] + 1 1 tr(ACw∗ ,w∗ ) + o ntr ntr . [sent-238, score-1.254]
16 From this, we can immediately obtain an information criterion (Akaike, 1974; Konishi and Kitagawa, 1996) for LSIF: 1 J (IC) = J(α(λ)) + tr(ACw,w ), ntr where A is defined by Eq. [sent-239, score-0.627]
17 Then an importance estimate wXrtr ,Xrte (x) is obtained using r=1 r=1 {X jtr } j=r and {X jte } j=r (that is, without Xrtr and Xrte ), and the cost J is approximated using the held-out samples Xrtr and Xrte as (CV) JX tr ,X te = r r 1 1 wX tr ,X te (xtr )2 − te ∑ wXrtr ,Xrte (xte ). [sent-251, score-0.328]
18 2|Xrtr | xtr∑ tr r r |Xr | xte ∈X te ∈X r r This procedure is repeated for r = 1, 2, . [sent-252, score-0.274]
19 As model candidates, we propose using a Gaussian kernel model centered at the test points {xte }nte , that is, j j=1 nte w(x) = ∑ αℓ Kσ (x, xte ), ℓ ℓ=1 where Kσ (x, x′ ) is the Gaussian kernel with kernel width σ: Kσ (x, x′ ) = exp − x − x′ 2σ2 2 . [sent-261, score-0.62]
20 When nte is large, just using all the test points {xte }nte as Gaussian j j=1 centers is already computationally rather demanding. [sent-269, score-0.325]
21 , b are template points randomly chosen from {xte }nte without replacement j j=1 and b (≤ nte ) is a prefixed number. [sent-273, score-0.285]
22 In the rest of this paper, we usually fix the number of template points at b = min(100, nte ), and optimize the kernel width σ and the regularization parameter λ by cross-validation with grid search. [sent-274, score-0.422]
23 Then, there exists a positive constant c and ℓ a natural number N such that for min{ntr , nte } ≥ N, P(B = B ) < e−c min{ntr ,nte } . [sent-372, score-0.274]
24 Then, for any λ ≥ 0, we have E[J(β(λ))] = J(β∗ (λ)) + 1 1 tr(B−1 DHDB−1Cw◦ ,w◦ + 2B−1Cw◦ ,u ) + o λ λ λ 2ntr ntr . [sent-389, score-0.627]
25 ℓ=1 In the theoretical analysis below, we assume ntr ∑ w(xtr ; β(λ)) = 0. [sent-402, score-0.627]
26 1405 (27) K ANAMORI , H IDO AND S UGIYAMA Then we have diff(λ) ≤ β(λ) H ntr tr ; β(λ)) ∑i=1 w(xi ≤ b2 1 + (28) 1 b λ minℓ ∑ntr ϕℓ (xtr ) i i=1 · nte , nte minℓ ∑ j=1 ϕℓ (xte ) j (29) where b is the number of basis functions. [sent-407, score-1.25]
27 Theorem 7 (Bridge bound) For any λ ≥ 0, the following inequality holds: diff(λ) ≤ λ γ(λ) 1· γ(λ) ∞− γ(λ) 2 2 ntr ∑i=1 w(xtr ; β(λ)) i + γ(λ) − β(λ) H . [sent-425, score-0.627]
28 For simplicity, we assume that ntr < nte and the i-th training sample xtr i and the i-th test sample xte are held out at the same time; the test samples {xte }nte tr +1 are always i j j=n used for importance estimation. [sent-434, score-1.58]
29 Let w(i) (x) be an estimate of the importance obtained without the i-th training sample xtr and the i i-th test sample xte . [sent-436, score-0.565]
30 Then the LOOCV score is expressed as i LOOCV = 1 ntr 1 (i) tr 2 ∑ (w (xi )) − w(i) (xte ) . [sent-437, score-0.702]
31 i ntr i=1 2 (32) Our approach to efficiently computing the LOOCV score is to use the Sherman-Woodbury-Morrison formula (Golub and Loan, 1996) for computing matrix inverses: for an invertible square matrix A and vectors u and v such that v⊤ A−1 u = −1, (A + uv⊤ )−1 = A−1 − A−1 uv⊤ A−1 . [sent-438, score-0.639]
32 1 Setup Let the dimension of the domain be d = 1 and the training and test densities be ptr (x) = N (x; 1, (1/2)2 ), pte (x) = N (x; 2, (1/4)2 ), where N (x; µ, σ2 ) denotes the Gaussian density with mean µ and variance σ2 . [sent-447, score-0.306]
33 , b; Hℓ,ℓ′ ←− ∑ exp − i ntr i=1 2σ2 xte − cℓ 2 1 nte j for ℓ = 1, 2, . [sent-453, score-1.083]
34 , b; hℓ ←− ∑ exp − 2σ2 nte j=1 xtr − cℓ 2 tr Xℓ,i ←− exp − i 2 for i = 1, 2, . [sent-456, score-0.596]
35 , b; Hℓ,ℓ′ ←− ∑ exp − i ntr i=1 2σ2 xte − cℓ 2 1 nte j for ℓ = 1, 2, . [sent-471, score-1.083]
36 , b; hℓ ←− ∑ exp − 2σ2 nte j=1 α ←− max(0b , (H + λI b )−1 h); b x − cℓ w(x) ←− ∑ αℓ exp − 2σ2 ℓ=1 2 ; Figure 2: Pseudo code of uLSIF algorithm with LOOCV. [sent-474, score-0.274]
37 We set the number of training and test samples at ntr = 200 and nte = 1000, respectively. [sent-484, score-0.97]
38 We set the number of training and test samples at ntr = 50 and nte = 100, respectively. [sent-504, score-0.97]
39 We set the number of training and test samples at ntr = 200 and nte = 1000, respectively. [sent-628, score-0.97]
40 We set the number of training and test samples at ntr = 200 and nte = 1000, respectively. [sent-690, score-0.97]
41 ntr (2πσ2 )d/2 k=1 The performance of KDE depends on the choice of the kernel width σ. [sent-716, score-0.716]
42 KDE can be used for importance estimation by first obtaining density estimators ptr (x) and pte (x) separately from {xtr }ntr and {xte }nte , and then estimating the importance by w(x) = j j=1 i i=1 pte (x)/ ptr (x). [sent-745, score-0.692]
43 The basic idea of KMM is to find w(x) such that the mean discrepancy between nonlinearly transformed samples drawn from pte (x) and ptr (x) is minimized in a universal reproducing kernel Hilbert space (Steinwart, 2001). [sent-753, score-0.294]
44 An empirical version of the above problem is reduced to the following quadratic program: min n tr {wi }i=1 ntr 1 ntr ∑ wi wi′ Kσ (xtr , xtr′ ) − ∑ wi κi i i 2 i,i′ =1 i=1 ntr subject to ∑ wi − ntr i=1 ≤ ntr ε and 0 ≤ w1 , w2 , . [sent-755, score-3.265]
45 , wntr ≤ B, where κi = ntr nte nte ∑ Kσ (xtr , xte ). [sent-758, score-1.357]
46 Let us assign a selector variable η = −1 to training samples and η = 1 to test samples, that is, the training 1415 K ANAMORI , H IDO AND S UGIYAMA and test densities are written as ptr (x) = p(x|η = −1), pte (x) = p(x|η = 1). [sent-770, score-0.345]
47 p(η = 1) p(η = −1|x) The probability ratio of test and training samples may be simply estimated by the ratio of the numbers of samples: p(η = −1) ntr ≈ . [sent-774, score-0.696]
48 p(η = 1) nte The conditional probability p(η|x) could be approximated by discriminating test samples from training samples using a logistic regression (LogReg) classifier, where η plays the role of a class variable. [sent-775, score-0.373]
49 The parameter ζ ℓ=1 is learned so that the negative regularized log-likelihood is minimized: ζ = argmin ζ ntr ∑ log 1 + exp i=1 m ∑ ζℓ φℓ (xtr ) i ℓ=1 nte m j=1 ℓ=1 + ∑ log 1 + exp − ∑ ζℓ φℓ (xte ) + λζ⊤ ζ . [sent-778, score-0.901]
50 Then the importance estimate is given by w(x) = ntr exp nte m ∑ ζℓ φℓ (x) . [sent-780, score-0.986]
51 An estimate of the test density pte (x) is given by using the model w(x) as pte (x) = w(x)ptr (x). [sent-786, score-0.259]
52 In KLIEP, the parameters α are determined so that the Kullback-Leibler divergence from pte (x) to pte (x) is minimized: pte (x) dx w(x)ptr (x) D Z Z pte (x) pte (x) log = pte (x) log w(x)dx. [sent-787, score-0.661]
53 dx − ptr (x) D D Z KL[pte (x) pte (x)] = pte (x) log (35) The first term is a constant, so it can be safely ignored. [sent-788, score-0.378]
54 Since pte (x) (= w(x)ptr (x)) is a probability density function, it should satisfy 1= Z D pte (x)dx = Z D w(x)ptr (x)dx. [sent-789, score-0.238]
55 (35) and (36) with empirical averages as follows: nte max {αℓ }b ℓ=1 ∑ log j=1 b ∑ αℓ ϕℓ (xte ) j ℓ=1 b subject to ∑ αℓ ℓ=1 ntr ∑ ϕℓ (xtr ) i i=1 = ntr and α1 , α2 , . [sent-791, score-1.541]
56 LogReg and KLIEP also do not involve density estimation, but different from KMM, they give an estimate the entire importance function, not only the values of the importance at training points. [sent-812, score-0.218]
57 The task is to estimate the importance at training points: wi = w(xtr ) = i pte (xtr ) i ptr (xtr ) i for i = 1, 2, . [sent-835, score-0.353]
58 We set B = 1000 and ε = √ √ ( ntr −1)/ ntr following the original paper (Huang et al. [sent-841, score-1.254]
59 We fixed the number of test points at nte = 1000 and consider the following two setups for the number ntr of training samples and the input dimensionality d: (a) ntr is fixed at ntr = 100 and d is changed as d = 1, 2, . [sent-852, score-2.236]
60 , 20, (b) d is fixed at d = 10 and ntr is changed as ntr = 50, 60, . [sent-855, score-1.266]
61 We run the experiments 100 times for each d, each ntr , and each method, and evaluate the quality of the importance estimates {wi }ntr by the normalized mean squared error (NMSE): i=1 NMSE = 1 ntr wi wi ∑ ∑n′tr wi′ − ∑n′tr wi′ ntr i=1 i =1 i =1 2 . [sent-859, score-1.992]
62 NMSEs averaged over 100 trials (a) as a function of input dimensionality d and (b) as a function of the training sample size ntr are plotted in log scale in Figure 8. [sent-869, score-0.664]
63 2 Covariate Shift Adaptation in Regression and Classification Next, we illustrate how the importance estimation methods could be used in covariate shift adaptation (Shimodaira, 2000; Zadrozny, 2004; Sugiyama and M¨ ller, 2005; Huang et al. [sent-893, score-0.228]
64 In addition to training input samples {xtr }ntr drawn from a training input density ptr (x) and test i i=1 input samples {xte }nte drawn from a test input density pte (x), suppose that we are given training j j=1 output samples {ytr }ntr at the training input points {xtr }ntr . [sent-915, score-0.501]
65 , 2000; Sugiyama and M¨ ller, u 2005): θIWRLS ≡ argmin θ ntr ∑ w(xtr ) i i=1 f (xtr ; θ) − ytr i i 2 +γ θ 2 . [sent-920, score-0.662]
66 The kernel width h and the regularization parameter γ in IWRLS (37) are chosen by importance weighted CV (IWCV) (Sugiyama et al. [sent-936, score-0.222]
67 Then a function fr (x) is learned using i i i i r=1 i=1 tr tr } {Z j j=r by IWRLS and its mean test error for the remaining samples Zr is computed: 1 ∑ tr |Zr | (x,y)∈Z tr w(x)loss fr (x), y , r where loss (y, y) = (y − y)2 (Regression), 1 2 (1 − sign{yy}) (Classification). [sent-939, score-0.303]
68 Then we remove xk from the pool regardless of its rejection or acceptance, and repeat this procedure until nte samples are accepted. [sent-1088, score-0.322]
69 We set the number of samples at ntr = 100 and nte = 500 for all data sets. [sent-1091, score-0.931]
70 We run the experiments 100 times for each data set and evaluate the mean test error: 1 nte nte ∑ loss f (xte ), yte . [sent-1093, score-0.569]
71 Defining the importance over two sets of samples, we can see that the importance values for regular samples are close to one, while those for outliers tend to be significantly deviated from one. [sent-1110, score-0.218]
72 Kernel density estimator (KDE’): A naive density estimation of all data samples {xk }n can also k=1 be used for outlier detection since the density value itself could be regarded as an outlier score. [sent-1138, score-0.279]
73 Conclusions The importance is useful in various machine learning scenarios such as covariate shift adaptation and outlier detection. [sent-1163, score-0.268]
74 We carried out extensive simulations in covariate shift adaptation and outlier detection, and experimentally confirmed that the proposed uLSIF is computationally more efficient than existing approaches, while the accuracy of uLSIF is comparable to the best existing methods. [sent-1169, score-0.21]
75 Due to the large deviation principle (Dembo and Zeitouni, 1998), there is a positive constant c such that − 1 log P((H, h) ∈ Bε ) > c > 0, min{ntr , nte } if min{ntr , nte } is large enough. [sent-1582, score-0.548]
76 −E O|A |×|A | From the central limit theorem and the assumption (18), we have G= h = h + Op 1 √ nte = h + op 1 ntr , (41) where O p and o p denote the asymptotic order in probability. [sent-1591, score-0.901]
77 (12), (40), (41), and (42), we have α(λ) ′ ξ (λ) −1 =G G α∗ (λ) + op ′ ξ∗ (λ) 1 ntr . [sent-1596, score-0.627]
78 (44) The matrix Taylor expansion (Petersen and Pedersen, 2007) yields −1 G = G−1 − G−1 δGG−1 + G−1 δGG−1 δGG−1 − · · · , 1431 (45) K ANAMORI , H IDO AND S UGIYAMA and the central limit theorem asserts that δH = O p 1 √ ntr . [sent-1597, score-0.627]
79 (44), (45), (14), and (46), we have δα = α(λ) − α∗ (λ) (47) = −AδHα∗ (λ) + AδHAδHα∗ (λ) + o 1 ntr . [sent-1599, score-0.627]
80 (46), (48), and (49), we have E δα⊤ Hδα = tr(H E δαδα⊤ ) 1 ntr = tr(AHA E (δHα∗ (λ))(δHα∗ (λ))⊤ ) + o 1 ntr = tr(A E (δHα∗ (λ))(δHα∗ (λ))⊤ ) + o . [sent-1610, score-1.254]
81 (48), (51), and (52), we have E δα⊤ (Hα∗ (λ) − h) = − E (δHα∗ (λ) − δHAδHα∗ (λ))⊤ A(Hα∗ (λ) − h) + o =E (δHα∗ (λ) − δHAδHα∗ (λ))⊤ λA1b + o = − λtr(A E (δHα∗ (λ))(δHA1b )⊤ ) + o 1432 1 ntr 1 ntr 1 ntr . [sent-1612, score-1.881]
82 (53), (54), and (55), we have √ √ 1 tr(A E ( ntr δHα∗ (λ))( ntr δHα∗ (λ))⊤ ) 2ntr √ √ λ 1 − tr(A E ( ntr δHα∗ (λ))( ntr δHA1b )⊤ ) + o . [sent-1614, score-2.508]
83 Then we have √ √ lim E ( ntr δHα∗ (λ))( ntr δHα∗ (λ))⊤ = Cw∗ ,w∗ , ntr →∞ √ √ lim E ( ntr δHα∗ (λ))( ntr δHA1b )⊤ = Cw∗ ,v , ntr →∞ where Cw,w′ is the b × b covariance matrix with the (ℓ, ℓ′ )-th element being the covariance between w(x)ϕℓ (x) and w′ (x)ϕℓ′ (x) under ptr (x). [sent-1616, score-3.895]
84 Due to the large deviation principle (Dembo and Zeitouni, 1998), there is a positive constant c such that − 1 log P((H, h) ∈ Bε ) > c > 0, min{ntr , nte } if min{ntr , nte } is large enough. [sent-1650, score-0.548]
85 (59), (41), (58), and (24), we have δβ = β(λ) − β∗ (λ) −1 = DBλ h − DB−1 h λ = −DB−1 δHβ◦ (λ) + DB−1 δHB−1 δHβ◦ (λ) + o λ λ λ 1 ntr . [sent-1666, score-0.627]
86 (46) and (60), we have E δβ⊤ Hδβ = tr(B−1 DHDB−1 E (δHβ◦ (λ))(δHβ◦ (λ))⊤ ) + o λ λ 1 ntr . [sent-1668, score-0.627]
87 (52) and (24), we have E δβ⊤ (Hβ∗ (λ) − h) =E (−δHβ◦ (λ) + δHB−1 δHβ◦ (λ))⊤ B−1 D(Hβ∗ (λ) − h) λ λ +o 1 ntr =E tr(B−1 (δHβ◦ (λ))(δHB−1 D(Hβ∗ (λ) − h))⊤ ) λ λ +o 1 ntr . [sent-1670, score-1.254]
88 Using the weighted norm (27), we can express diff(λ) as diff(λ) = infλ′ ≥0 α(λ′ ) − β(λ) ntr ∑i=1 w(xtr ; β(λ)) i 1435 H . [sent-1677, score-0.627]
89 Then we immediately have diff(λ) ≤ β(λ) H , ntr ∑i=1 w(xtr ; β(λ)) i which proves Eq. [sent-1680, score-0.627]
90 Now we have β(λ) H ntr ∑i=1 w(xtr ; β(λ)) i ≤ = √ 1 ntr ∑i=1 w(xtr ; β(λ)) i κmax h λ 2 √ κmax h 1 ntr ∑i=1 ∑b ϕℓ (xtr )βℓ (λ)/ ℓ=1 i β(λ) 1 λ β(λ) 2 . [sent-1686, score-1.881]
91 1 For the denominator of the above expression, we have ntr b ∑ ∑ ϕℓ (xtr ) i i=1 ℓ=1 ntr βℓ (λ) β(λ) 1 b βℓ (λ) i=1 ℓ=1 β(λ) ≥ min( ∑ ϕℓ′ (xtr )) · ∑ i ′ ℓ ntr 1 = min ∑ ϕℓ (xtr ), i ℓ i=1 where the last equality follows from the non-negativity of βℓ (λ). [sent-1687, score-1.895]
92 Therefore, we have the inequality √ κmax h 2 1 n λ ∑ w(xtr ; β(λ)) i=1 i ≤ b b κmax 1 + κmax 1 nte ntr tr ) · min ′ nte ϕ ′ (xte ) . [sent-1720, score-1.252]
93 (64), we obtain the inequality 1 ntr a Ha = ∑ ntr i=1 ⊤ 2 b ∑ aℓ ϕℓ (xtr ) i ℓ=1 1 ntr ≤ ∑ ntr i=1 2 b ∑ ℓ=1 for any a ∈ Rb . [sent-1729, score-2.508]
94 ℓ=1 1 ntr max ∑ 2 =1, a≥0b ntr i=1 b 2 ∑ aℓ ℓ=1 b ≤ max a 2 =1, a≥0b = b, b · ∑ a2 ℓ ℓ=1 (67) where the Schwarz inequality for a and 1b is used in the last inequality. [sent-1734, score-1.28]
95 ntr tr ; β(λ)) ∑i=1 w(xi (68) We derive an upper bound of the first term. [sent-1743, score-0.69]
96 (68), (76), (77), and (78), we obtain diff(λ) ≤ λ( γ(λ) 1· γ(λ) ∞− γ(λ) 2 ) + γ(λ) − β(λ) 2 ntr ∑i=1 w(xtr ; β(λ)) i H . [sent-1766, score-0.627]
97 Then the matrix H and the vector h are expressed as H= 1 ntr ∑ ϕ(xtr )ϕ(xtr )⊤ , i i ntr i=1 h= 1 nte nte ∑ ϕ(xte ), j j=1 1440 A L EAST- SQUARES A PPROACH TO D IRECT I MPORTANCE E STIMATION and the coefficients β(λ) can be computed by −1 β(λ) = Bλ h. [sent-1773, score-1.802]
98 (i) Let β be the estimator obtained without the i-th training sample xtr and the i-th test sample xte . [sent-1774, score-0.48]
99 i i Then the estimator has the following closed form: (i) (i) β (λ) = max(0b , β (λ)), 1 (ntr H − ϕ(xtr )ϕ(xtr )⊤ ) + λI b i i ntr − 1 (i) β (λ) = −1 1 (nte h − ϕ(xte )). [sent-1775, score-0.627]
100 j nte − 1 tr −1) Let B = H + λ(nntr I b and β = B h in the following calculation. [sent-1776, score-0.337]
wordName wordTfidf (topN-words)
[('ntr', 0.627), ('ulsif', 0.347), ('lsif', 0.284), ('nte', 0.274), ('xtr', 0.259), ('logreg', 0.182), ('xte', 0.182), ('ptr', 0.133), ('kmm', 0.119), ('kliep', 0.113), ('pte', 0.104), ('anamori', 0.095), ('ido', 0.095), ('ugiyama', 0.095), ('cv', 0.087), ('importance', 0.085), ('irect', 0.08), ('mportance', 0.08), ('pproach', 0.08), ('kde', 0.071), ('loocv', 0.063), ('tr', 0.063), ('width', 0.062), ('sugiyama', 0.061), ('outlier', 0.058), ('stimation', 0.058), ('covariate', 0.054), ('regularization', 0.048), ('shift', 0.047), ('lof', 0.039), ('osvm', 0.039), ('rb', 0.037), ('dx', 0.037), ('ytr', 0.035), ('path', 0.033), ('diff', 0.032), ('density', 0.03), ('samples', 0.03), ('te', 0.029), ('tracking', 0.029), ('gaussian', 0.028), ('squares', 0.027), ('cw', 0.027), ('kernel', 0.027), ('ha', 0.027), ('detection', 0.025), ('iwrls', 0.025), ('bridge', 0.024), ('med', 0.024), ('adaptation', 0.024), ('complementarity', 0.024), ('hb', 0.021), ('tends', 0.021), ('imdk', 0.021), ('lsifq', 0.021), ('test', 0.021), ('bickel', 0.019), ('trials', 0.019), ('centers', 0.018), ('outliers', 0.018), ('xk', 0.018), ('training', 0.018), ('estimation', 0.018), ('hido', 0.018), ('nmse', 0.018), ('kkt', 0.017), ('numerically', 0.017), ('experimentally', 0.015), ('solution', 0.014), ('quadratic', 0.014), ('min', 0.014), ('karatzoglou', 0.014), ('nearesti', 0.014), ('petersen', 0.014), ('xrtr', 0.014), ('active', 0.014), ('widths', 0.013), ('wi', 0.013), ('max', 0.013), ('pseudo', 0.013), ('lagrange', 0.012), ('changed', 0.012), ('lkopf', 0.012), ('basis', 0.012), ('computationally', 0.012), ('delve', 0.012), ('ida', 0.012), ('rdle', 0.012), ('shimodaira', 0.012), ('zadrozny', 0.012), ('score', 0.012), ('hi', 0.012), ('theoretically', 0.012), ('tuning', 0.012), ('ji', 0.012), ('multiplier', 0.012), ('song', 0.011), ('template', 0.011), ('analytically', 0.011), ('ideal', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
Author: Takafumi Kanamori, Shohei Hido, Masashi Sugiyama
Abstract: We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches. Keywords: importance sampling, covariate shift adaptation, novelty detection, regularization path, leave-one-out cross validation
2 0.070871577 23 jmlr-2009-Discriminative Learning Under Covariate Shift
Author: Steffen Bickel, Michael Brückner, Tobias Scheffer
Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning
3 0.025131917 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
Author: Jacob Abernethy, Francis Bach, Theodoros Evgeniou, Jean-Philippe Vert
Abstract: We present a general approach for collaborative filtering (CF) using spectral regularization to learn linear operators mapping a set of “users” to a set of possibly desired “objects”. In particular, several recent low-rank type matrix-completion methods for CF are shown to be special cases of our proposed framework. Unlike existing regularization-based CF, our approach can be used to incorporate additional information such as attributes of the users/objects—a feature currently lacking in existing regularization-based CF approaches—using popular and well-known kernel methods. We provide novel representer theorems that we use to develop new estimation methods. We then provide learning algorithms based on low-rank decompositions and test them on a standard CF data set. The experiments indicate the advantages of generalizing the existing regularization-based CF methods to incorporate related information about users and objects. Finally, we show that certain multi-task learning methods can be also seen as special cases of our proposed approach. Keywords: collaborative filtering, matrix completion, kernel methods, spectral regularization
4 0.024583727 29 jmlr-2009-Estimating Labels from Label Proportions
Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le
Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classification and prediction, probabilistic models, missing variables
5 0.022559056 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression
Author: Saharon Rosset
Abstract: We show how to follow the path of cross validated solutions to families of regularized optimization problems, defined by a combination of a parameterized loss function and a regularization term. A primary example is kernel quantile regression, where the parameter of the loss function is the quantile being estimated. Even though the bi-level optimization problem we encounter for every quantile is non-convex, the manner in which the optimal cross-validated solution evolves with the parameter of the loss function allows tracking of this solution. We prove this property, construct the resulting algorithm, and demonstrate it on real and artificial data. This algorithm allows us to efficiently solve the whole family of bi-level problems. We show how it can be extended to cover other modeling problems, like support vector regression, and alternative in-sample model selection approaches.1
6 0.022038611 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)
7 0.021354556 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
9 0.020987647 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
10 0.02097171 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning
11 0.020451603 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
12 0.01809437 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
13 0.017761871 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
14 0.017553955 38 jmlr-2009-Hash Kernels for Structured Data
15 0.017418699 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
16 0.017309953 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
17 0.016020902 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
18 0.015749034 78 jmlr-2009-Refinement of Reproducing Kernels
19 0.015693687 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
20 0.015670912 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
topicId topicWeight
[(0, 0.094), (1, -0.019), (2, 0.047), (3, -0.029), (4, 0.01), (5, -0.005), (6, 0.003), (7, -0.008), (8, 0.017), (9, -0.008), (10, 0.077), (11, -0.006), (12, 0.044), (13, -0.082), (14, 0.051), (15, -0.074), (16, 0.089), (17, -0.016), (18, 0.022), (19, -0.029), (20, 0.072), (21, 0.028), (22, -0.071), (23, -0.116), (24, 0.203), (25, 0.079), (26, 0.018), (27, 0.001), (28, 0.181), (29, -0.207), (30, -0.044), (31, -0.224), (32, 0.205), (33, 0.083), (34, 0.18), (35, 0.283), (36, -0.181), (37, -0.182), (38, -0.099), (39, -0.377), (40, 0.146), (41, 0.001), (42, 0.024), (43, 0.403), (44, 0.098), (45, -0.015), (46, -0.007), (47, -0.054), (48, -0.122), (49, -0.2)]
simIndex simValue paperId paperTitle
same-paper 1 0.95259267 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
Author: Takafumi Kanamori, Shohei Hido, Masashi Sugiyama
Abstract: We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches. Keywords: importance sampling, covariate shift adaptation, novelty detection, regularization path, leave-one-out cross validation
2 0.27435717 23 jmlr-2009-Discriminative Learning Under Covariate Shift
Author: Steffen Bickel, Michael Brückner, Tobias Scheffer
Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning
3 0.14563513 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
Author: Volkan Vural, Glenn Fung, Balaji Krishnapuram, Jennifer G. Dy, Bharat Rao
Abstract: Most classification methods assume that the samples are drawn independently and identically from an unknown data generating distribution, yet this assumption is violated in several real life problems. In order to relax this assumption, we consider the case where batches or groups of samples may have internal correlations, whereas the samples from different batches may be considered to be uncorrelated. This paper introduces three algorithms for classifying all the samples in a batch jointly: one based on a probabilistic formulation, and two based on mathematical programming analysis. Experiments on three real-life computer aided diagnosis (CAD) problems demonstrate that the proposed algorithms are significantly more accurate than a naive support vector machine which ignores the correlations among the samples. Keywords: batch-wise classification, support vector machine, linear programming, machine learning, statistical methods, unconstrained optimization
Author: Eugene Tuv, Alexander Borisov, George Runger, Kari Torkkola
Abstract: Predictive models benefit from a compact, non-redundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for filters, wrappers, and embedded feature selection methods. We describe details of an algorithm using tree-based ensembles to generate a compact subset of non-redundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach. Keywords: trees, resampling, importance, masking, residuals
5 0.1301593 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Author: Han Liu, John Lafferty, Larry Wasserman
Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult
6 0.12357659 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
7 0.12041553 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning
8 0.12017411 29 jmlr-2009-Estimating Labels from Label Proportions
9 0.11190066 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
10 0.10894192 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)
11 0.10366562 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
12 0.10118416 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
13 0.095705569 38 jmlr-2009-Hash Kernels for Structured Data
14 0.09567596 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
15 0.094629876 46 jmlr-2009-Learning Halfspaces with Malicious Noise
16 0.0891473 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
17 0.088906333 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning
18 0.087745197 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
19 0.082438603 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
20 0.079488359 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression
topicId topicWeight
[(8, 0.017), (11, 0.029), (19, 0.014), (26, 0.019), (38, 0.023), (47, 0.018), (52, 0.075), (55, 0.042), (58, 0.026), (66, 0.065), (68, 0.024), (90, 0.044), (93, 0.473), (96, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.68797857 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
Author: Takafumi Kanamori, Shohei Hido, Masashi Sugiyama
Abstract: We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches. Keywords: importance sampling, covariate shift adaptation, novelty detection, regularization path, leave-one-out cross validation
2 0.66561282 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning
Author: Vitaly Feldman
Abstract: We study the properties of the agnostic learning framework of Haussler (1992) and Kearns, Schapire, and Sellie (1994). In particular, we address the question: is there any situation in which membership queries are useful in agnostic learning? Our results show that the answer is negative for distribution-independent agnostic learning and positive for agnostic learning with respect to a specific marginal distribution. Namely, we give a simple proof that any concept class learnable agnostically by a distribution-independent algorithm with access to membership queries is also learnable agnostically without membership queries. This resolves an open problem posed by Kearns et al. (1994). For agnostic learning with respect to the uniform distribution over {0, 1}n we show a concept class that is learnable with membership queries but computationally hard to learn from random examples alone (assuming that one-way functions exist). Keywords: agnostic learning, membership query, separation, PAC learning
3 0.24682994 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
Author: Babak Shahbaba, Radford Neal
Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification
4 0.24611452 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
5 0.24469711 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
6 0.24377273 38 jmlr-2009-Hash Kernels for Structured Data
7 0.24338402 48 jmlr-2009-Learning Nondeterministic Classifiers
8 0.23884413 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
9 0.2380614 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
10 0.23671173 46 jmlr-2009-Learning Halfspaces with Malicious Noise
11 0.23634149 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
13 0.23478195 50 jmlr-2009-Learning When Concepts Abound
14 0.23407727 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
15 0.23369069 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
16 0.23345913 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
17 0.2330545 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
18 0.23291294 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
19 0.23176466 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
20 0.22961953 13 jmlr-2009-Bounded Kernel-Based Online Learning