nips nips2010 nips2010-142 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Corinna Cortes, Yishay Mansour, Mehryar Mohri
Abstract: This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the R´ nyi divergence of the training and test distributions. e These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. [sent-6, score-0.604]
2 We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. [sent-7, score-0.604]
3 e These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. [sent-9, score-0.252]
4 We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. [sent-10, score-0.131]
5 Finally, we analyze the properties of normalized importance weights which are also commonly used. [sent-11, score-0.521]
6 Similarly, in credit default analyses, there is typically some information available about the credit defaults of customers who were granted credit, but no such information is at hand about rejected costumers. [sent-14, score-0.126]
7 In other problems such as adaptation, the training data available is drawn from a source domain different from the target domain. [sent-15, score-0.101]
8 There is also a large body of literature dealing with different techniques for sample bias correction [11, 29, 16, 8, 25, 6] or domain adaptation [3, 7, 19, 10, 17] in the recent machine learning and natural language processing literature. [sent-17, score-0.217]
9 A common technique used in several of these publications for correcting the bias or discrepancy is based on the so-called importance weighting technique. [sent-18, score-0.715]
10 A common definition of the weight for point x is w(x) = P (x)/Q(x) where P is the target or test distribution and Q is the distribution according to which training points are drawn. [sent-21, score-0.098]
11 A favorable property of this definition, which is not hard to verify, is that it leads to unbiased estimates of the generalization error [8]. [sent-22, score-0.128]
12 This paper presents an analysis of importance weighting for learning from finite samples. [sent-23, score-0.604]
13 0 −2 −2 Error 1 Error x 2 100 500 Training set size 5000 20 100 500 5000 Training set size Figure 1: Example of importance weighting. [sent-40, score-0.335]
14 The hypothesis class is that of hyperplanes tangent to the unit sphere. [sent-43, score-0.148]
15 Right figures: plots of test error vs training sample size using importance weighting for two different values of the ratio σQ /σP . [sent-44, score-0.733]
16 The hypothesis class is that of hyperplanes tangent to the unit sphere. [sent-47, score-0.148]
17 3, the error of the hypothesis learned using importance weighting is close to 50% even for a training sample of 5,000 points and the standard deviation of the error is quite high. [sent-50, score-0.821]
18 In Section 4, we discuss other examples where importance weighting does not succeed. [sent-53, score-0.626]
19 We show using standard generalization bounds that importance weighting can succeed when the weights are bounded. [sent-57, score-0.837]
20 We also show that, remarkably, convergence guarantees can be given even for unbounded weights under the weak assumption that the second moment of the weights is bounded, a condition that relates to the R´ nyi divergence of P and Q. [sent-59, score-1.109]
21 We further extend e these bounds to guarantees for other possible reweightings. [sent-60, score-0.128]
22 This setting is closely related to the problem of importance sampling in statistics which is that of estimating the expectation of a random variable according to P while using a sample drawn according to Q, with w given [18]. [sent-66, score-0.359]
23 Here, we are concerned with the effect of the weights on learning from finite samples. [sent-67, score-0.143]
24 Section 2 introduces the definition of the R´ nyi e divergences and gives some basic properties of the importance weights. [sent-72, score-0.618]
25 In Section 3, we give generalization bounds for importance weighting in the bounded case. [sent-73, score-0.772]
26 We also present a general lower bound indicating the key role played by the R´ nyi divergence of P and Q in this context. [sent-74, score-0.441]
27 Section 4 e deals with the more frequent case of unbounded w. [sent-75, score-0.158]
28 Standard generalization bounds do not apply here since the loss function is unbounded. [sent-76, score-0.121]
29 We give novel generalization bounds for unbounded loss functions under the assumption that the second moment is bounded (see Appendix) and use them to derive learning guarantees for importance weighting in this more general setting. [sent-77, score-1.19]
30 We also discuss why the commonly used remedy of truncating or capping importance weights may not always provide the desired effect of improved performance. [sent-79, score-0.541]
31 Finally, in Section 6, we study 2 the properties of an alternative reweighting also commonly used which is based on normalized importance weights, and discuss its relationship with the (unnormalized) weights w. [sent-80, score-0.589]
32 We also denote by H the hypothesis set used by the learning algorithm and by f : X → Y the target labeling function. [sent-83, score-0.129]
33 1 R´ nyi divergences e Our analysis makes use of the notion of R´ nyi divergence, an information theoretical measure of e the difference between two distributions directly relevant to the study of importance weighting. [sent-85, score-0.869]
34 For α ≥ 0, the R´ nyi divergence Dα (P Q) between distributions P and Q is defined by [23] e Dα (P Q) = 1 log2 α−1 P (x) Q(x) P (x) x α−1 . [sent-86, score-0.376]
35 (1) The R´ nyi divergence is a non-negative quantity and for any α > 0, Dα (P Q) = 0 iff P = Q. [sent-87, score-0.376]
36 We denote by dα (P Q) the exponential in base 2 of the R´ nyi divergence Dα (P Q): e dα (P Q) = 2 Dα (P Q) = x P α (x) Qα−1 (x) 1 α−1 . [sent-89, score-0.376]
37 2 Importance weights The importance weight for distributions P and Q is defined by w(x) = P (x)/Q(x). [sent-91, score-0.508]
38 The second moment of w can be expressed as follows in terms of the R´ nyi divergence: e E[w2 ] = Q w2 (x) Q(x) = x∈X x∈X P (x) Q(x) 2 Q(x) = P (x) x∈X P (x) Q(x) = d2 (P Q). [sent-97, score-0.394]
39 For any hypothesis h ∈ H, we denote by R(h) its loss and by Rw (h) its weighted empirical loss: R(h) = E [L(h(x), f (x))] x∼P Rw (h) = 1 m m w(xi ) L(h(xi ), f (xi )). [sent-99, score-0.127]
40 Note that the unnormalized importance weighting of the loss is unbiased: E[w(x)Lh (x)] = Q x P (x) Lh (x) Q(x) = Q(x) P (x)Lh (x) = R(h). [sent-101, score-0.718]
41 For all α > 0 and x ∈ X, the second moment of the importance weighted loss can be bounded as follows: 1 E [w2 (x) L2 (x)] ≤ dα+1 (P Q) R(h)1− α . [sent-104, score-0.587]
42 2m The upper bound M , though finite, can be quite large. [sent-114, score-0.09]
43 The following theorem provides a more favorable bound as a function of the ratio M/m when any of the moments of w, dα+1 (P Q), is finite, which is the case when d∞ (P Q) < ∞ since the R´ nyi divergence is a non-decreasing e function of α [23, 2], in particular: ∀α > 0, dα+1 (P Q) ≤ d∞ (P Q). [sent-115, score-0.556]
44 Then, for any α ≥ 1, for any δ > 0, with probability at least 1−δ, the following bound holds for the importance weighting method: R(h) ≤ Rw (h) + 1 2M log δ + 3m 1 2 dα+1 (P Q) R(h)1− α − R(h)2 log 1 δ . [sent-118, score-0.775]
45 By lemma 2, the variance of the random variable Z can be bounded in terms of the R´ nyi divergence dα+1 (P Q): e 1 σ 2 (Z) = E[w2 (x) Lh (x)2 ] − R(h)2 ≤ dα+1 (P Q) R(h)1− α − R(h)2 . [sent-123, score-0.501]
46 m These results can be straightforwardly extended to general hypothesis sets. [sent-127, score-0.123]
47 In particular, for a finite hypothesis set and for α = 1, the application of the union bound yields the following result. [sent-128, score-0.161]
48 Then, for any δ > 0, with probability at least 1−δ, the following bound holds for the importance weighting method: R(h) ≤ Rw (h) + 1 2M (log |H| + log δ ) + 3m 2d2 (P Q)(log |H| + log 1 ) δ . [sent-131, score-0.775]
49 m (7) For infinite hypothesis sets, a similar result can be shown straightforwardly using covering numbers instead of |H| or a related measure based on samples of size m [20]. [sent-132, score-0.123]
50 In the following proposition, we give a lower bound that further emphasizes the role of the R´ nyi e divergence of the second order in the convergence of importance weighting in the bounded case. [sent-133, score-1.158]
51 Assume that H contains a hypothesis h0 such that Lh0 (x) = 1 for all x. [sent-136, score-0.096]
52 2 2σQ 2πσQ 2 2 σQ (x−µ)2 −σP (x−µ )2 2 2 2σP σQ , thus, even for σP = σQ and µ = µ the P (x) supx Q(x) = +∞, and the bound of Theorem 1 importance weights are unbounded, d∞ (P Q) = is not informative. [sent-145, score-0.587]
53 The R´ nyi divergence of the second order is given by: e d2 (P Q) = = √ 2 2 σP σQ σP +∞ −∞ σQ √ 2 σP 2π exp − 2 2 σQ (x − µ)2 − σP (x − µ )2 P (x)dx 2 2 2σP σQ +∞ −∞ exp − 2 2 2σQ (x − µ)2 − σP (x − µ )2 dx. [sent-146, score-0.422]
54 2 2 2σP σQ That is, for σQ > the variance of the importance weights is bounded. [sent-147, score-0.501]
55 By the additivity property of the R´ nyi divergence, a similar situation holds for the product and sums of such Gaussian e distributions. [sent-148, score-0.279]
56 Hence, in the rightmost example of Figure 1, the importance weights are unbounded, but their second moment is bounded. [sent-149, score-0.621]
57 3σP , the same favorable guarantees do not hold, and, as illustrated in Figure 1, learning is significantly more difficult. [sent-152, score-0.113]
58 This example of Gaussians can further illustrate what can go wrong in importance weighting. [sent-153, score-0.335]
59 One could have expected this to be an easy case for importance weighting since sampling from Q provides useful information about P . [sent-155, score-0.604]
60 For a sample of size m and σQ = 1, the expected value of an extreme point is 2 log m − o(1) and its 5 2 2 weight will be in the order of m−1/σP +1/σQ = m0. [sent-157, score-0.093]
61 Therefore, a few extreme points will dominate all other weights and necessarily have a huge influence on the selection of a hypothesis by the learning algorithm. [sent-159, score-0.239]
62 If µ is large enough compared to log(m), then, with high probability, all the weights will be negligible. [sent-162, score-0.143]
63 2 Importance weighting learning bounds - unbounded case As in these examples, in practice, the importance weights are typically not bounded. [sent-166, score-0.968]
64 However, we shall show that, remarkably, under the weak assumption that the second moment of the weights w, d2 (P Q), is bounded, generalization bounds can be given for this case as well. [sent-167, score-0.422]
65 The following result relies on a general learning bound for unbounded loss functions proven in the Appendix (Corollary 1). [sent-168, score-0.254]
66 Let H be a hypothesis set such that Pdim({Lh (x) : h ∈ H}) = p < ∞. [sent-171, score-0.096]
67 Since d2 (P Q) < +∞, the second moment of w(x)Lh (x) is finite and upper bounded by d2 (P Q) (Lemma 2). [sent-175, score-0.246]
68 (9) Since by assumption w(xi ) > 0 for all i ∈ [1, k], this implies that ∀xi ∈ B, Lh (xi ) ≥ ri /w(xi ) ∀xi ∈ A − B, Lh (xi ) < ri /w(xi ). [sent-187, score-0.103]
69 Using the same observations, it is straightforward to see that conversely, any set shattered by H is shattered by H . [sent-189, score-0.096]
70 Pr sup The convergence rate of the bound is slightly weaker (O(m−3/8 )) than in the bounded case (O(m−1/2 )). [sent-190, score-0.178]
71 A faster convergence can be obtained however using the more precise bound of Theorem 8 at the expense of readability. [sent-191, score-0.1]
72 The R´ nyi divergence d2 (P Q) seems to play a critical role in e the bound and thus in the convergence of importance weighting in the unbounded case. [sent-192, score-1.264]
73 Let H be a hypothesis set such that Pdim({Lh (x) : h ∈ H}) = p < ∞. [sent-196, score-0.096]
74 Q By Corollary 2 applied to the function u Lh, | E[u(x)Lh (x)] − Ru (h)| can be bounded by √ 3 p log 2me +log 4 p δ with probability 1 − δ, with 25/4 max( EQ [u2 (x)L2 (x)], EQ [u2 (x)L2 (x)]) 8 b h h m p = Pdim({Lh (x) : h ∈ H}) by a proof similar to that of Theorem 3. [sent-230, score-0.117]
75 Here, we consider a family of functions U parameterized by the quantiles q of the weight function w. [sent-234, score-0.112]
76 For small values of γ, the bias term dominates, and very fine-grained quantiles minimize the bound of equation (11). [sent-236, score-0.175]
77 For large values of γ the variance term dominates and the bound is minimized by using just one quantile, corresponding to an even weighting of the training examples. [sent-237, score-0.424]
78 Hence by varying γ from small to large values, the algorithm interpolates between standard importance weighting with just one example per quantile, and unweighted learning where all examples are given the same weight. [sent-238, score-0.658]
79 We see that a more rapid convergence can be obtained by using these weights compared to the standard importance weights w. [sent-241, score-0.656]
80 Another natural family of functions is that of thresholded versions of the importance weights {uθ : θ > 0, ∀x ∈ X, uθ (x) = min(w(x), θ)}. [sent-242, score-0.504]
81 In fact, in practice, users often cap importance weights by choosing an arbitrary value θ. [sent-243, score-0.499]
82 The advantage of this family is that, by definition, the weights are 7 bounded. [sent-244, score-0.169]
83 However, in some cases, larger weights could be critical to achieve a better performance. [sent-245, score-0.169]
84 Compared to importance weighting, no change in performance is observed until the largest 1% of the weights are capped, in which case we only observe a performance degradation. [sent-247, score-0.478]
85 We expect the thresholding to be less beneficial when the large weights reflect the true w and are not an artifact of estimation uncertainties. [sent-248, score-0.143]
86 6 Relationship between normalized and unnormalized weights An alternative approach based on the weight function w = P (x)/Q(x) consists of normalizing the weights. [sent-249, score-0.299]
87 Thus, while in the unnormalized case the unweighted empirical error is replaced by 1 m m m w(xi ) Lh (xi ) = i=1 i=1 w(xi ) Lh (xi ), m in the normalized case it is replaced by m i=1 w(xi ) Lh (xi ), W m i=1 with W = w(xi ). [sent-250, score-0.211]
88 We refer to w(x) = w(x)/W as the normalized importance weight. [sent-251, score-0.378]
89 An advantage of the normalized weights is that they are by definition bounded by one. [sent-252, score-0.264]
90 However, the price to pay for this benefit is the fact that the weights are no more unbiased. [sent-253, score-0.143]
91 In fact, several issues similar to those we pointed out in the Section 4 affect the normalized weights as well. [sent-254, score-0.186]
92 Here, we maintain the assumption that the second moment of the importance weights is bounded and analyze the relationship between normalized and unnormalized weights. [sent-255, score-0.846]
93 We show that, under this assumption, normalized and unnormalized weights are in fact very close, with high probability. [sent-256, score-0.269]
94 Thus, by Corollary m Thus, since ES [W ] = the following inequality holds 1− W ≤ 25/4 max m d2 (P Q), which implies the same upper bound on w(xi ) − ≤ 1− W m . [sent-259, score-0.118]
95 7 Conclusion We presented a series of theoretical results for importance weighting both in the bounded weights case and in the more general unbounded case under the assumption that the second moment of the weights is bounded. [sent-261, score-1.29]
96 We also initiated a preliminary exploration of alternative weights and showed its benefits. [sent-262, score-0.143]
97 Several of the learning guarantees we gave depend on the R´ nyi divergence of the distributions P and Q. [sent-264, score-0.441]
98 Finally, our novel unbounded loss learning bounds are of independent interest and could be useful in a variety of other contexts. [sent-266, score-0.252]
99 Improving predictive inference under covariate shift by weighting the loglikelihood function. [sent-402, score-0.294]
100 Direct importance u estimation with model selection and its application to covariate shift adaptation. [sent-410, score-0.36]
wordName wordTfidf (topN-words)
[('lh', 0.621), ('importance', 0.335), ('weighting', 0.269), ('nyi', 0.251), ('rw', 0.226), ('unbounded', 0.158), ('weights', 0.143), ('moment', 0.143), ('divergence', 0.125), ('ru', 0.123), ('eq', 0.122), ('pdim', 0.117), ('hypothesis', 0.096), ('unnormalized', 0.083), ('xi', 0.08), ('bounded', 0.078), ('quantile', 0.075), ('adaptation', 0.071), ('mohri', 0.067), ('guarantees', 0.065), ('bound', 0.065), ('bounds', 0.063), ('correcting', 0.057), ('quantiles', 0.056), ('unweighted', 0.054), ('bias', 0.054), ('credit', 0.053), ('mansour', 0.05), ('shattered', 0.048), ('favorable', 0.048), ('corinna', 0.047), ('reweighting', 0.046), ('supx', 0.044), ('normalized', 0.043), ('ri', 0.041), ('capping', 0.041), ('log', 0.039), ('ratio', 0.039), ('training', 0.035), ('capped', 0.035), ('correction', 0.035), ('convergence', 0.035), ('bernstein', 0.033), ('uq', 0.033), ('target', 0.033), ('domain', 0.033), ('corollary', 0.033), ('dominates', 0.032), ('divergences', 0.032), ('error', 0.031), ('loss', 0.031), ('blitzer', 0.031), ('cortes', 0.031), ('weight', 0.03), ('mismatch', 0.03), ('von', 0.029), ('bene', 0.028), ('theorem', 0.028), ('proposition', 0.028), ('holds', 0.028), ('generalization', 0.027), ('hyperplanes', 0.027), ('straightforwardly', 0.027), ('gaussians', 0.027), ('critical', 0.026), ('family', 0.026), ('pr', 0.026), ('colt', 0.025), ('centered', 0.025), ('tangent', 0.025), ('covariate', 0.025), ('dasgupta', 0.025), ('weak', 0.025), ('upper', 0.025), ('nite', 0.025), ('sample', 0.024), ('du', 0.024), ('crammer', 0.024), ('lemma', 0.024), ('springer', 0.023), ('exp', 0.023), ('variance', 0.023), ('demonstrating', 0.022), ('unbiased', 0.022), ('fix', 0.022), ('boosting', 0.022), ('google', 0.022), ('discuss', 0.022), ('assumption', 0.021), ('ny', 0.021), ('users', 0.021), ('zadrozny', 0.02), ('ckner', 0.02), ('nau', 0.02), ('bureau', 0.02), ('asymptotics', 0.02), ('nakajima', 0.02), ('defaults', 0.02), ('probabilit', 0.02), ('reverting', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 142 nips-2010-Learning Bounds for Importance Weighting
Author: Corinna Cortes, Yishay Mansour, Mehryar Mohri
Abstract: This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the R´ nyi divergence of the training and test distributions. e These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.
2 0.14712343 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
Author: Barnabás Póczos, Csaba Szepesvári, David Tax
Abstract: We present simple and computationally efficient nonparametric estimators of R´ nyi entropy and mutual information based on an i.i.d. sample drawn from an e unknown, absolutely continuous distribution over Rd . The estimators are calculated as the sum of p-th powers of the Euclidean lengths of the edges of the ‘generalized nearest-neighbor’ graph of the sample and the empirical copula of the sample respectively. For the first time, we prove the almost sure consistency of these estimators and upper bounds on their rates of convergence, the latter of which under the assumption that the density underlying the sample is Lipschitz continuous. Experiments demonstrate their usefulness in independent subspace analysis. 1
3 0.14349648 27 nips-2010-Agnostic Active Learning Without Constraints
Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu
Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1
4 0.11192702 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
5 0.096319795 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
6 0.085155584 243 nips-2010-Smoothness, Low Noise and Fast Rates
7 0.071568601 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
8 0.070962779 268 nips-2010-The Neural Costs of Optimal Control
9 0.066845089 191 nips-2010-On the Theory of Learnining with Privileged Information
10 0.066679686 22 nips-2010-Active Estimation of F-Measures
11 0.06595359 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
12 0.064712361 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
13 0.064589806 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
14 0.059061669 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
15 0.05885791 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
16 0.055099044 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation
17 0.053052191 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
18 0.052037664 222 nips-2010-Random Walk Approach to Regret Minimization
19 0.051585879 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
20 0.0500005 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
topicId topicWeight
[(0, 0.157), (1, -0.001), (2, 0.09), (3, 0.011), (4, 0.01), (5, 0.077), (6, -0.058), (7, -0.059), (8, -0.003), (9, -0.082), (10, 0.018), (11, -0.095), (12, -0.068), (13, -0.056), (14, -0.081), (15, -0.001), (16, 0.069), (17, 0.01), (18, 0.007), (19, 0.031), (20, 0.008), (21, 0.006), (22, -0.003), (23, -0.001), (24, 0.018), (25, -0.045), (26, -0.0), (27, 0.014), (28, 0.09), (29, -0.02), (30, -0.052), (31, 0.024), (32, -0.018), (33, -0.031), (34, 0.042), (35, 0.0), (36, -0.039), (37, 0.073), (38, 0.009), (39, 0.123), (40, -0.051), (41, -0.169), (42, 0.046), (43, 0.128), (44, 0.058), (45, 0.149), (46, 0.019), (47, 0.075), (48, 0.017), (49, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.92487079 142 nips-2010-Learning Bounds for Importance Weighting
Author: Corinna Cortes, Yishay Mansour, Mehryar Mohri
Abstract: This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the R´ nyi divergence of the training and test distributions. e These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.
2 0.69168591 27 nips-2010-Agnostic Active Learning Without Constraints
Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu
Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1
3 0.67102808 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
Author: Barnabás Póczos, Csaba Szepesvári, David Tax
Abstract: We present simple and computationally efficient nonparametric estimators of R´ nyi entropy and mutual information based on an i.i.d. sample drawn from an e unknown, absolutely continuous distribution over Rd . The estimators are calculated as the sum of p-th powers of the Euclidean lengths of the edges of the ‘generalized nearest-neighbor’ graph of the sample and the empirical copula of the sample respectively. For the first time, we prove the almost sure consistency of these estimators and upper bounds on their rates of convergence, the latter of which under the assumption that the density underlying the sample is Lipschitz continuous. Experiments demonstrate their usefulness in independent subspace analysis. 1
4 0.63552797 191 nips-2010-On the Theory of Learnining with Privileged Information
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
5 0.61752325 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
6 0.59511375 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
7 0.56066066 22 nips-2010-Active Estimation of F-Measures
8 0.51699203 243 nips-2010-Smoothness, Low Noise and Fast Rates
9 0.48410195 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
10 0.47541076 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
11 0.47326827 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
12 0.46047744 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
13 0.45669207 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
14 0.44470641 63 nips-2010-Distributed Dual Averaging In Networks
15 0.43910441 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation
16 0.42562795 177 nips-2010-Multitask Learning without Label Correspondences
17 0.4251774 202 nips-2010-Parallelized Stochastic Gradient Descent
18 0.41732955 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
19 0.41666377 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
20 0.4026427 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
topicId topicWeight
[(13, 0.024), (27, 0.028), (30, 0.066), (35, 0.015), (45, 0.182), (50, 0.038), (52, 0.031), (60, 0.035), (77, 0.449), (78, 0.017), (90, 0.019)]
simIndex simValue paperId paperTitle
1 0.96882999 34 nips-2010-Attractor Dynamics with Synaptic Depression
Author: K. Wong, He Wang, Si Wu, Chi Fung
Abstract: Neuronal connection weights exhibit short-term depression (STD). The present study investigates the impact of STD on the dynamics of a continuous attractor neural network (CANN) and its potential roles in neural information processing. We find that the network with STD can generate both static and traveling bumps, and STD enhances the performance of the network in tracking external inputs. In particular, we find that STD endows the network with slow-decaying plateau behaviors, namely, the network being initially stimulated to an active state will decay to silence very slowly in the time scale of STD rather than that of neural signaling. We argue that this provides a mechanism for neural systems to hold short-term memory easily and shut off persistent activities naturally.
2 0.95912576 16 nips-2010-A VLSI Implementation of the Adaptive Exponential Integrate-and-Fire Neuron Model
Author: Sebastian Millner, Andreas Grübl, Karlheinz Meier, Johannes Schemmel, Marc-olivier Schwartz
Abstract: We describe an accelerated hardware neuron being capable of emulating the adaptive exponential integrate-and-fire neuron model. Firing patterns of the membrane stimulated by a step current are analyzed in transistor level simulations and in silicon on a prototype chip. The neuron is destined to be the hardware neuron of a highly integrated wafer-scale system reaching out for new computational paradigms and opening new experimentation possibilities. As the neuron is dedicated as a universal device for neuroscientific experiments, the focus lays on parameterizability and reproduction of the analytical model. 1
3 0.91012287 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
Author: Hairong Liu, Longin J. Latecki, Shuicheng Yan
Abstract: In this paper, we regard clustering as ensembles of k-ary affinity relations and clusters correspond to subsets of objects with maximal average affinity relations. The average affinity relation of a cluster is relaxed and well approximated by a constrained homogenous function. We present an efficient procedure to solve this optimization problem, and show that the underlying clusters can be robustly revealed by using priors systematically constructed from the data. Our method can automatically select some points to form clusters, leaving other points un-grouped; thus it is inherently robust to large numbers of outliers, which has seriously limited the applicability of classical methods. Our method also provides a unified solution to clustering from k-ary affinity relations with k ≥ 2, that is, it applies to both graph-based and hypergraph-based clustering problems. Both theoretical analysis and experimental results show the superiority of our method over classical solutions to the clustering problem, especially when there exists a large number of outliers.
4 0.89674658 15 nips-2010-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. 1
same-paper 5 0.84558553 142 nips-2010-Learning Bounds for Importance Weighting
Author: Corinna Cortes, Yishay Mansour, Mehryar Mohri
Abstract: This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the R´ nyi divergence of the training and test distributions. e These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.
6 0.79589927 234 nips-2010-Segmentation as Maximum-Weight Independent Set
7 0.72481674 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks
8 0.68380123 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data
9 0.67649937 252 nips-2010-SpikeAnts, a spiking neuron network modelling the emergence of organization in a complex system
10 0.64916062 253 nips-2010-Spike timing-dependent plasticity as dynamic filter
11 0.63826698 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
12 0.63811177 8 nips-2010-A Log-Domain Implementation of the Diffusion Network in Very Large Scale Integration
13 0.62980354 244 nips-2010-Sodium entry efficiency during action potentials: A novel single-parameter family of Hodgkin-Huxley models
14 0.62178117 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
15 0.59695315 127 nips-2010-Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics
16 0.59543383 117 nips-2010-Identifying graph-structured activation patterns in networks
17 0.58733207 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
18 0.57662791 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
19 0.57423276 182 nips-2010-New Adaptive Algorithms for Online Classification
20 0.56650776 96 nips-2010-Fractionally Predictive Spiking Neurons