nips nips2011 nips2011-238 knowledge-graph by maker-knowledge-mining

238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison


Source: pdf

Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama

Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. [sent-17, score-0.383]

2 In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. [sent-18, score-0.191]

3 Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. [sent-19, score-0.174]

4 Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. [sent-20, score-0.836]

5 , outlier detection [1, 2], two-sample homogeneity test [3, 4], and transfer learning [5, 6]. [sent-25, score-0.439]

6 A standard approach to comparing probability densities p(x) and p (x) would be to estimate a divergence from p(x) to p (x), such as the Kullback-Leibler (KL) divergence [7]: KL[p(x), p (x)] := Ep(x) [log r(x)] , r(x) := p(x)/p (x), where Ep(x) denotes the expectation over p(x). [sent-26, score-0.755]

7 A naive way to estimate the KL divergence is to separately approximate the densities p(x) and p (x) from data and plug the estimated densities in the above definition. [sent-27, score-0.436]

8 However, since density estimation is known to be a hard task [8], this approach does not work well unless a good parametric model is available. [sent-28, score-0.114]

9 Recently, a divergence estimation approach which directly approximates the density-ratio r(x) without going through separate approximation of densities p(x) and p (x) has been proposed [9, 10]. [sent-29, score-0.479]

10 Such density-ratio approximation methods were proved to achieve the optimal non-parametric convergence rate in the mini-max sense. [sent-30, score-0.087]

11 However, the KL divergence estimation via density-ratio approximation is computationally rather expensive due to the non-linearity introduced by the ‘log’ term. [sent-31, score-0.412]

12 To cope with this problem, another divergence called the Pearson (PE) divergence [11] is useful. [sent-32, score-0.716]

13 The PE divergence is defined as PE[p(x), p (x)] := 1 Ep (x) (r(x) − 1)2 . [sent-33, score-0.358]

14 2 1 The PE divergence is a squared-loss variant of the KL divergence, and they both belong to the class of the Ali-Silvey-Csisz´ r divergences (which is also known as the f -divergences, see [12, 13]). [sent-34, score-0.396]

15 The practical usefulness of the uLSIF-based PE divergence estimator was demonstrated in various applications such as outlier detection [2], twosample homogeneity test [4], and dimensionality reduction [15]. [sent-39, score-0.919]

16 In this paper, we first establish the non-parametric convergence rate of the uLSIF-based PE divergence estimator, which elucidates its superior theoretical properties. [sent-40, score-0.416]

17 This implies that, in the region where the denominator density p (x) takes small values, the density-ratio r(x) = p(x)/p (x) tends to take large values and therefore the overall convergence speed becomes slow. [sent-42, score-0.156]

18 This makes the paradigm of divergence estimation based on density-ratio approximation unreliable. [sent-46, score-0.412]

19 In order to overcome this fundamental problem, we propose an alternative approach to distribution comparison called α-relative divergence estimation. [sent-47, score-0.358]

20 In the proposed approach, we estimate the α-relative divergence, which is the divergence from p(x) to the α-mixture density: qα (x) = αp(x) + (1 − α)p (x) for 0 ≤ α < 1. [sent-48, score-0.386]

21 For example, the α-relative PE divergence is given by PEα [p(x), p (x)] := PE[p(x), qα (x)] = 1 Eqα (x) (rα (x) − 1)2 , 2 (1) where rα (x) is the α-relative density-ratio of p(x) and p (x): rα (x) := p(x)/qα (x) = p(x)/ αp(x) + (1 − α)p (x) . [sent-49, score-0.358]

22 (2) We propose to estimate the α-relative divergence by direct approximation of the α-relative densityratio. [sent-50, score-0.419]

23 Based on this feature, we theoretically show that the α-relative PE divergence estimator based on α-relative density-ratio approximation is more favorable than the ordinary density-ratio approach in terms of the non-parametric convergence speed. [sent-52, score-0.581]

24 We further prove that, under a correctly-specified parametric setup, the asymptotic variance of our α-relative PE divergence estimator does not depend on the model complexity. [sent-53, score-0.625]

25 This means that the proposed α-relative PE divergence estimator hardly overfits even with complex models. [sent-54, score-0.541]

26 Through experiments on outlier detection, two-sample homogeneity test, and transfer learning, we demonstrate that our proposed α-relative PE divergence estimator compares favorably with alternative approaches. [sent-55, score-0.911]

27 ) samples {xi }n from i=1 a d-dimensional distribution P with density p(x) and i. [sent-59, score-0.092]

28 samples {xj }n from another dj=1 dimensional distribution P with density p (x). [sent-62, score-0.092]

29 Our goal is to compare the two underlying distributions P and P only using the two sets of samples {xi }n and {xj }n . [sent-63, score-0.082]

30 i=1 j=1 In this section, we give a method for estimating the α-relative PE divergence based on direct approximation of the α-relative density-ratio. [sent-64, score-0.447]

31 Finally, a density-ratio estimator is given as rα (x) := g(x; θ) = n =1 θ K(x, x ). [sent-76, score-0.11]

32 When α = 0, the above method is reduced to a direct density-ratio estimator called unconstrained least-squares importance fitting (uLSIF) [14]. [sent-77, score-0.196]

33 The performance of RuLSIF depends on the choice of the kernel function (the kernel width in the case of the Gaussian kernel) and the regularization parameter λ. [sent-80, score-0.086]

34 Using an estimator of the α-relative density-ratio rα (x), we can construct estimators of the αrelative PE divergence (1). [sent-82, score-0.503]

35 After a few lines of calculation, we can show that the α-relative PE divergence (1) is equivalently expressed as PEα = − α Ep(x) rα (x)2 − 2 (1−α) 2 Ep (x) rα (x)2 + Ep(x) [rα (x)] − 1 2 = 1 Ep(x) [rα (x)] − 1 . [sent-83, score-0.358]

36 2 2 Note that the middle expression can also be obtained via Legendre-Fenchel convex duality of the divergence functional [17]. [sent-84, score-0.358]

37 2 + 1 n n i=1 rα (xi ) − 1, 2 (4) (5) We note that the α-relative PE divergence (1) can have further different expressions than the above ones, and corresponding estimators can also be constructed similarly. [sent-86, score-0.393]

38 However, the above two expressions will be particularly useful: the first estimator PEα has superior theoretical properties (see Section 3) and the second one PEα is simple to compute. [sent-87, score-0.11]

39 3 Theoretical Analysis In this section, we analyze theoretical properties of the proposed PE divergence estimators. [sent-88, score-0.386]

40 3 For theoretical analysis, let us consider a rather abstract form of our relative density-ratio estimator described as argming∈G α 2n n i=1 g(xi )2 + n j=1 (1−α) 2n g(xj )2 − n i=1 1 n g(xi ) + λ R(g)2 , 2 (6) where G is some function space (i. [sent-90, score-0.172]

41 Non-Parametric Convergence Analysis: First, we elucidate the non-parametric convergence rate of the proposed PE estimators. [sent-93, score-0.086]

42 Here, we practically regard the function space G as an infinitedimensional reproducing kernel Hilbert space (RKHS) [18] such as the Gaussian kernel space, and R(·) as the associated RKHS norm. [sent-94, score-0.086]

43 We analyze the convergence rate of our PE divergence estimators as n := min(n, n ) tends to infinity ¯ for λ = λn under ¯ λn → o(1) and λ−1 = o(¯ 2/(2+γ) ). [sent-96, score-0.484]

44 , the first terms) of the asymptotic convergence rates become smaller as rα ∞ gets smaller. [sent-103, score-0.109]

45 Since rα ∞ = α + (1 − α)/r(x) −1 ∞ < 1 α for α > 0, larger α would be more preferable in terms of the asymptotic approximation error. [sent-104, score-0.102]

46 Thus, our proposed approach of estimating the α-relative PE divergence (with α > 0) would be more advantageous than the naive approach of estimating the plain PE divergence (which corresponds to α = 0) in terms of the non-parametric convergence rate. [sent-106, score-0.895]

47 The above results also show that PEα and PEα have different asymptotic convergence rates. [sent-107, score-0.109]

48 4 Parametric Variance Analysis: Next, we analyze the asymptotic variance of the PE divergence estimator PEα (4) under a parametric setup. [sent-124, score-0.625]

49 Let us denote the variance of PEα (4) by V[PEα ], where randomness comes from the draw of samples {xi }n and {xj }n . [sent-134, score-0.087]

50 (9) shows that, up to O(n−1 , n −1 ), the variance of PEα depends only on the true relative density-ratio rα (x), not on the estimator of rα (x). [sent-139, score-0.203]

51 Therefore, overfitting would hardly occur in the estimation of the relative PE divergence even when complex models are used. [sent-141, score-0.49]

52 We note that the above superior property is applicable only to relative PE divergence estimation, not to relative density-ratio estimation. [sent-142, score-0.482]

53 This implies that overfitting occurs in relative density-ratio estimation, but the approximation error cancels out in relative PE divergence estimation. [sent-143, score-0.511]

54 Since rα ∞ monotonically decreases as α increases, our proposed approach of estimating the α-relative PE divergence (with α > 0) would be more advantageous than the naive approach of estimating the plain PE divergence (which corresponds to α = 0) in terms of the parametric asymptotic variance. [sent-150, score-0.985]

55 4 Experiments In this section, we experimentally evaluate the performance of the proposed method in two-sample homogeneity test, outlier detection, and transfer learning tasks. [sent-152, score-0.378]

56 Two-Sample Homogeneity Test: First, we apply the proposed divergence estimator to twosample homogeneity test. [sent-153, score-0.69]

57 n Given two sets of samples X = {xi }n i=1 ∼ P and X = {xj }j=1 ∼ P , the goal of the twosample homogeneity test is to test the null hypothesis that the probability distributions P and P are the same against its complementary alternative (i. [sent-160, score-0.446]

58 By using an estimator Div of some divergence between the two distributions P and P , homogeneity of two distributions can be tested based on the permutation test procedure [20]. [sent-163, score-0.692]

59 The mean (and standard deviation in the bracket) rate of accepting the null hypothesis (i. [sent-165, score-0.161]

60 Right: when the set of samples corresponding to the numerator of the density-ratio are taken from the positive training set and the set of samples corresponding to the denominator of the density-ratio are taken from the positive training set and the negative training set (i. [sent-173, score-0.231]

61 The best method having the lowest mean accepting rate and comparable methods according to the two-sample t-test at the significance level 5% are specified by bold face. [sent-176, score-0.101]

62 Datasets d n=n MMD banana thyroid titanic diabetes b-cancer f-solar heart german ringnorm waveform 2 5 5 8 9 9 13 20 20 21 100 19 21 85 29 100 38 100 100 66 . [sent-177, score-0.181]

63 00) When an asymmetric divergence such as the KL divergence [7] or the PE divergence [11] is adopted for two-sample test, the test results depend on the choice of directions: a divergence from P to P or from P to P . [sent-343, score-1.469]

64 We test LSTT with the RuLSIF-based PE divergence estimator for α = 0, 0. [sent-350, score-0.505]

65 First, we investigate the rate of accepting the null hypothesis when the null hypothesis is correct (i. [sent-356, score-0.257]

66 We split all the positive training samples into two sets and perform two-sample test for the two sets of samples. [sent-359, score-0.093]

67 Next, we consider the situation where the null hypothesis is not correct (i. [sent-363, score-0.096]

68 The numerator samples are generated in the same way as above, but a half of denominator samples are replaced with negative training samples. [sent-366, score-0.231]

69 Thus, while the numerator sample set contains only positive training samples, the denominator sample set includes both positive and negative training samples. [sent-367, score-0.119]

70 5 is shown to be a useful method for two-sample homogeneity test. [sent-374, score-0.135]

71 Inlier-Based Outlier Detection: Next, we apply the proposed method to outlier detection. [sent-376, score-0.169]

72 Let us consider an outlier detection problem of finding irregular samples in a dataset (called an “evaluation dataset”) based on another dataset (called a “model dataset”) that only contains regular samples. [sent-377, score-0.299]

73 Defining the density-ratio over the two sets of samples, we can see that the density-ratio 6 Table 2: Experimental results of outlier detection. [sent-378, score-0.141]

74 Since the evaluation dataset usually has a wider support than the model dataset, we regard the evaluation dataset as samples corresponding to the denominator density p (x), and the model dataset as samples corresponding to the numerator density p(x). [sent-627, score-0.378]

75 Thus, density-ratio approximators can be used for outlier detection. [sent-631, score-0.141]

76 We evaluate the proposed method using various datasets: IDA benchmark repository [21], an inhouse French speech dataset, the 20 Newsgroup dataset, and the USPS hand-written digit dataset (the detailed specification of the datasets is explained in the supplementary material). [sent-632, score-0.11]

77 Transfer Learning: Finally, we apply the proposed method to transfer learning. [sent-643, score-0.102]

78 tr Let us consider a transductive transfer learning setup where labeled training samples {(xtr , yj )}ntr j j=1 te nte drawn i. [sent-644, score-0.28]

79 from p(y|x)ptr (x) and unlabeled test samples {xi }i=1 drawn i. [sent-647, score-0.127]

80 from pte (x) (which is generally different from ptr (x)) are available. [sent-650, score-0.158]

81 The use of exponentially-weighted importance weighting was shown to be useful for adaptation from ptr (x) to pte (x) [5]: minf ∈F 1 ntr ntr j=1 pte (xtr ) j ptr (xtr ) j τ tr loss(yj , f (xtr )) , j where f (x) is a learned function and 0 ≤ τ ≤ 1 is the exponential flattening parameter. [sent-651, score-0.605]

82 We compare the plain kernel logistic regression (KLR) without importance weights, KLR with relative importance weights (RIW-KLR), KLR with exponentially-weighted importance weights (EIW-KLR), and KLR with plain importance weights (IW-KLR). [sent-656, score-0.439]

83 Here we propose to use relative importance weights instead: minf ∈F 1 ntr pte (xtr ) ntr j tr tr j=1 (1−α)pte (xtr )+αptr (xtr ) loss(yj , f (xj )) j j . [sent-690, score-0.462]

84 We apply the above transfer learning technique to human activity recognition using accelerometer data. [sent-691, score-0.133]

85 However, since the new user is not willing to label his/her accelerometer data due to troublesomeness, no labeled sample is available for the new user. [sent-695, score-0.114]

86 On the other hand, unlabeled samples for the new user and labeled data obtained from existing users are available. [sent-696, score-0.145]

87 Let labeled training data tr {(xtr , yj )}ntr be the set of labeled accelerometer data for 20 existing users. [sent-697, score-0.159]

88 Each user has at most j j=1 100 labeled samples for each action. [sent-698, score-0.111]

89 Let unlabeled test data {xte }nte be unlabeled accelerometer i i=1 data obtained from the new user. [sent-699, score-0.164]

90 The experiments are repeated 100 times with different sample choice for ntr = 500 and nte = 200. [sent-700, score-0.155]

91 The classification accuracy for 800 test samples from the new user (which are different from the 200 unlabeled samples) are summarized in Table 3, showing that the proposed method using relative importance weights for α = 0. [sent-701, score-0.303]

92 5 Conclusion In this paper, we proposed to use a relative divergence for robust distribution comparison. [sent-703, score-0.448]

93 We gave a computationally efficient method for estimating the relative Pearson divergence based on direct relative density-ratio approximation. [sent-704, score-0.542]

94 We theoretically elucidated the convergence rate of the proposed divergence estimator under non-parametric setup, which showed that the proposed approach of estimating the relative Pearson divergence is more preferable than the existing approach of estimating the plain Pearson divergence. [sent-705, score-1.117]

95 Furthermore, we proved that the asymptotic variance of the proposed divergence estimator is independent of the model complexity under a correctly-specified parametric setup. [sent-706, score-0.653]

96 Thus, the proposed divergence estimator hardly overfits even with complex models. [sent-707, score-0.541]

97 Experimentally, we demonstrated the practical usefulness of the proposed divergence estimator in two-sample homogeneity test, inlier-based outlier detection, and transfer learning tasks. [sent-708, score-0.873]

98 Thus, it would be promising to explore more applications of the proposed relative density-ratio approximator beyond two-sample homogeneity test, inlier-based outlier detection, and transfer learning. [sent-710, score-0.44]

99 Estimating divergence functionals and the likelihood ratio by convex risk minimization. [sent-787, score-0.358]

100 A general class of coefficients of divergence of one distribution from another. [sent-798, score-0.358]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pe', 0.649), ('divergence', 0.358), ('ida', 0.219), ('lstt', 0.219), ('rulsif', 0.169), ('outlier', 0.141), ('homogeneity', 0.135), ('ep', 0.122), ('usps', 0.121), ('xtr', 0.119), ('estimator', 0.11), ('ntr', 0.104), ('vp', 0.097), ('mmd', 0.095), ('klr', 0.084), ('osvm', 0.084), ('pte', 0.084), ('ptr', 0.074), ('transfer', 0.074), ('asymptotic', 0.073), ('numerator', 0.068), ('ulsif', 0.067), ('null', 0.064), ('relative', 0.062), ('twosample', 0.059), ('kanamori', 0.059), ('accelerometer', 0.059), ('plain', 0.059), ('pearson', 0.058), ('samples', 0.056), ('sugiyama', 0.056), ('importance', 0.054), ('parametric', 0.053), ('detection', 0.052), ('kl', 0.051), ('denominator', 0.051), ('nte', 0.051), ('suzuki', 0.046), ('hardly', 0.045), ('kernel', 0.043), ('accepting', 0.043), ('favorably', 0.042), ('bracket', 0.041), ('densities', 0.039), ('divergences', 0.038), ('auc', 0.037), ('test', 0.037), ('bold', 0.036), ('cance', 0.036), ('density', 0.036), ('convergence', 0.036), ('xj', 0.035), ('estimators', 0.035), ('unlabeled', 0.034), ('banana', 0.034), ('hido', 0.034), ('titanic', 0.034), ('yamada', 0.034), ('tends', 0.033), ('sch', 0.033), ('supplementary', 0.032), ('user', 0.032), ('direct', 0.032), ('hypothesis', 0.032), ('variance', 0.031), ('walks', 0.031), ('tokyo', 0.03), ('diabetes', 0.03), ('thyroid', 0.03), ('ringnorm', 0.03), ('lkopf', 0.03), ('gretton', 0.029), ('approximation', 0.029), ('rkhs', 0.029), ('material', 0.028), ('proposed', 0.028), ('estimating', 0.028), ('tr', 0.027), ('kashima', 0.027), ('hachiya', 0.027), ('yj', 0.027), ('usefulness', 0.027), ('covariate', 0.027), ('distributions', 0.026), ('estimation', 0.025), ('favorable', 0.025), ('op', 0.025), ('dataset', 0.025), ('repository', 0.025), ('xi', 0.024), ('judged', 0.024), ('labeled', 0.023), ('ordinary', 0.023), ('rasch', 0.023), ('waveform', 0.023), ('compares', 0.023), ('shift', 0.023), ('setup', 0.022), ('rate', 0.022), ('bicycle', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama

Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1

2 0.096064813 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems

Author: David R. Karger, Sewoong Oh, Devavrat Shah

Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1

3 0.081574187 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

Abstract: We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance. 1

4 0.080953926 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Author: Siwei Lyu

Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1

5 0.072034709 306 nips-2011-t-divergence Based Approximate Inference

Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan

Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1

6 0.067815006 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

7 0.054646656 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

8 0.054332212 291 nips-2011-Transfer from Multiple MDPs

9 0.05036433 240 nips-2011-Robust Multi-Class Gaussian Process Classification

10 0.049043678 162 nips-2011-Lower Bounds for Passive and Active Learning

11 0.04896583 4 nips-2011-A Convergence Analysis of Log-Linear Training

12 0.044801541 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors

13 0.04475743 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions

14 0.042292409 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs

15 0.041806612 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

16 0.041643325 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression

17 0.040282134 165 nips-2011-Matrix Completion for Multi-label Image Classification

18 0.039671995 139 nips-2011-Kernel Bayes' Rule

19 0.039433263 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

20 0.039313454 275 nips-2011-Structured Learning for Cell Tracking


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.124), (1, 0.006), (2, -0.005), (3, -0.039), (4, -0.039), (5, -0.026), (6, 0.009), (7, -0.077), (8, 0.007), (9, 0.008), (10, -0.072), (11, -0.022), (12, 0.055), (13, 0.009), (14, -0.007), (15, 0.002), (16, 0.009), (17, 0.001), (18, 0.073), (19, 0.096), (20, -0.073), (21, 0.05), (22, 0.016), (23, -0.04), (24, 0.075), (25, -0.017), (26, 0.003), (27, -0.054), (28, -0.024), (29, 0.05), (30, 0.012), (31, 0.007), (32, -0.035), (33, 0.062), (34, -0.089), (35, -0.011), (36, 0.077), (37, 0.06), (38, 0.031), (39, 0.027), (40, -0.117), (41, -0.052), (42, 0.097), (43, -0.041), (44, 0.019), (45, 0.063), (46, 0.058), (47, -0.057), (48, 0.01), (49, 0.115)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93523848 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama

Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1

2 0.76038629 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Author: Siwei Lyu

Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1

3 0.73486197 306 nips-2011-t-divergence Based Approximate Inference

Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan

Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1

4 0.61022598 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Author: Sham M. Kakade, Varun Kanade, Ohad Shamir, Adam Kalai

Abstract: Generalized Linear Models (GLMs) and Single Index Models (SIMs) provide powerful generalizations of linear regression, where the target variable is assumed to be a (possibly unknown) 1-dimensional function of a linear predictor. In general, these problems entail non-convex estimation procedures, and, in practice, iterative local search heuristics are often used. Kalai and Sastry (2009) provided the first provably efficient method, the Isotron algorithm, for learning SIMs and GLMs, under the assumption that the data is in fact generated under a GLM and under certain monotonicity and Lipschitz (bounded slope) constraints. The Isotron algorithm interleaves steps of perceptron-like updates with isotonic regression (fitting a one-dimensional non-decreasing function). However, to obtain provable performance, the method requires a fresh sample every iteration. In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient. We modify the isotonic regression step in Isotron to fit a Lipschitz monotonic function, and also provide an efficient O(n log(n)) algorithm for this step, improving upon the previous O(n2 ) algorithm. We provide a brief empirical study, demonstrating the feasibility of our algorithms in practice. 1

5 0.59346139 69 nips-2011-Differentially Private M-Estimators

Author: Jing Lei

Abstract: This paper studies privacy preserving M-estimators using perturbed histograms. The proposed approach allows the release of a wide class of M-estimators with both differential privacy and statistical utility without knowing a priori the particular inference procedure. The performance of the proposed method is demonstrated through a careful study of the convergence rates. A practical algorithm is given and applied on a real world data set containing both continuous and categorical variables. 1

6 0.558819 123 nips-2011-How biased are maximum entropy models?

7 0.54445398 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

8 0.51431054 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

9 0.50963271 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

10 0.50363022 288 nips-2011-Thinning Measurement Models and Questionnaire Design

11 0.49995047 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs

12 0.48549965 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

13 0.48539221 240 nips-2011-Robust Multi-Class Gaussian Process Classification

14 0.46901071 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression

15 0.46015373 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

16 0.45944989 232 nips-2011-Ranking annotators for crowdsourced labeling tasks

17 0.43431753 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

18 0.43084198 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

19 0.43052965 33 nips-2011-An Exact Algorithm for F-Measure Maximization

20 0.42656502 120 nips-2011-History distribution matching method for predicting effectiveness of HIV combination therapies


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (4, 0.038), (20, 0.019), (26, 0.02), (31, 0.062), (33, 0.011), (43, 0.051), (45, 0.573), (57, 0.022), (74, 0.035), (83, 0.023), (99, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99855322 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

Author: Youwei Zhang, Laurent E. Ghaoui

Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1

2 0.9968698 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach

Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1

3 0.99443245 28 nips-2011-Agnostic Selective Classification

Author: Yair Wiener, Ran El-Yaniv

Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1

4 0.99393243 272 nips-2011-Stochastic convex optimization with bandit feedback

Author: Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin

Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f (x) at any query point x ∈ X . We demonstrate √ a generalization of the ellipsoid algorithm that √ incurs O(poly(d) T ) regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T . 1

5 0.98953074 143 nips-2011-Learning Anchor Planes for Classification

Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari

Abstract: Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a fixed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classification tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points. 1

6 0.98758674 293 nips-2011-Understanding the Intrinsic Memorability of Images

same-paper 7 0.98742586 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

8 0.95539212 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

9 0.92255831 19 nips-2011-Active Classification based on Value of Classifier

10 0.90730429 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

11 0.90421754 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

12 0.89977866 220 nips-2011-Prediction strategies without loss

13 0.89867717 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

14 0.89711207 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

15 0.89641285 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

16 0.8961044 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

17 0.89312506 78 nips-2011-Efficient Methods for Overlapping Group Lasso

18 0.88351059 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

19 0.88133878 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

20 0.88068444 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation