nips nips2010 nips2010-191 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. [sent-3, score-1.306]
2 The goal of the learner is to find a classifier with a low generalization error in the decision space. [sent-4, score-0.345]
3 We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. [sent-5, score-0.947]
4 We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. [sent-6, score-0.858]
5 1 Introduction In the classical supervised machine learning paradigm the learner is given a labeled training set of examples and her goal is to find a decision function with the small generalization error on the unknown test examples. [sent-7, score-0.466]
6 if learner’s space of decision functions contains a one with zero generalization error) then, when the training size increases, the decision function found by the learner converges quickly to the optimal one. [sent-10, score-0.571]
7 However if the learning problem is hard and the learner’s space of decision functions is large then the convergence (or learning) rate is slow. [sent-11, score-0.342]
8 The example of such hard learning problem is XOR when the space of decision functions is 2-dimensional hyperplanes. [sent-12, score-0.219]
9 The obvious question is “Can we accelerate the learning rate if the learner is given an additional information about the learning problem? [sent-13, score-0.255]
10 In LUPI paradigm, in addition to the standard training data, (x, y) ∈ X × Y , a teacher supplies the learner with a privileged information x∗ in the correcting space X ∗ . [sent-19, score-1.223]
11 The privileged information is only available for the training examples and is never available for the test examples. [sent-20, score-0.647]
12 The LUPI paradigm requires, given a training set {(xi , x∗ , yi )}n , to find a decision function h : X → Y i i=1 with the small generalization error for the unknown test examples x ∈ X. [sent-21, score-0.425]
13 The above question about accelerating the learning rate, reformulated in terms of the LUPI paradigm, is “What kind of additional information should the teacher provide to the learner in order to accelerate her learning rate? [sent-22, score-0.281]
14 In this paper we outline the conditions for the additional information provided by the teacher that allow for fast learning rate even in the hard problems. [sent-26, score-0.266]
15 Let h(x) = sign(w · x + b) be a decision function and φ(x∗ ) = i w∗ · x∗ + d be a correcting function. [sent-31, score-0.413]
16 Let X (h(x), y) = 1 − y(w · x + b) be a hinge loss of the decision function h = (w, b) on the example (x, y) and X ∗ (φ(x∗ )) = [w∗ · x∗ + d]+ be a loss of the correcting function φ = (w∗ , d) on the example x∗ . [sent-37, score-0.553]
17 ∀ 1 ≤ i ≤ n, X (h(xi ), yi ) ≤ ∗ X ∗ (φ(xi ), yi ), (4) where X and X ∗ are arbitrary bounded loss functions, H is a space of decision functions and Φ is a space of correcting functions. [sent-43, score-0.734]
18 Let C > 0 be a constant (that is defined later), [t]+ = max(t, 0) and ((h, φ), (x, x∗ , y)) = 1 · C ∗ X ∗ (φ(x ), y) +[ X (h(x), y) − ∗ X ∗ (φ(x ), y)]+ (5) be the loss of the composite hypothesis (h, φ) on the example (x, x∗ , y). [sent-44, score-0.235]
19 In this paper we study the relaxation of (3): n ((h, φ), (xi , x∗ , yi )), i min h∈H,φ∈Φ (6) i=1 We refer to the learning algorithm defined by the optimization problem (6) as empirical risk minimization with privileged information, or abbreviated Privileged ERM. [sent-45, score-0.888]
20 The basic assumption of Privileged ERM is that if we can achieve a small loss X ∗ (φ(x∗ ), y) in the correcting space then we should also achieve a small loss X (h(x), y) in the decision space. [sent-46, score-0.626]
21 This assumption reflects the human learning process, where the teacher tells the learner what are the most important examples (the ones with the small loss in the correcting space) that the learner should take into account in order to find a good decision rule. [sent-47, score-0.828]
22 The regular empirical risk minimization (ERM) finds a hypothesis h ∈ H that minimizes the training n error i=1 X (h(xi ), yi ). [sent-48, score-0.563]
23 While the regular ERM directly minimizes the training error of h, the privileged ERM minimizes the training error of h indirectly, via the minimization of the training error of the correcting function φ and the relaxation of the constraint (4). [sent-49, score-1.253]
24 Let h∗ be the best possible decision function (in terms of generalization error) in the hypothesis space H. [sent-50, score-0.381]
25 Suppose that for each training example xi an oracle gives us the value of the loss X (h∗ (xi ), yi ). [sent-51, score-0.235]
26 We use these fixed losses instead of X ∗ (φ(x∗ ), yi ) and find h that satisfies the following system of i inequalities: ∀ 1 ≤ i ≤ n, X (h(xi ), yi ) ≤ X (h∗ (xi ), yi ). [sent-52, score-0.217]
27 A straightforward generalization of the proof of Proposition 1 of [9] shows that the generalization error of the hypothesis h found by OracleERM converges to the one of h∗ with the rate of 1/n. [sent-54, score-0.455]
28 This rate is much faster than the √ worst-case convergence rate 1/ n of the regular ERM [3]. [sent-55, score-0.293]
29 1 A decision function h is uniformly better than the correcting function φ if for any example (x, x∗ , y) that has non-zero probability, X ∗ (φ(x∗ ), yi ) ≥ X (h(xi ), yi ). [sent-58, score-0.581]
30 i Given a space H of decision functions and a space Φ of correcting functions we define Φ = {φ ∈ Φ | ∃h ∈ H that is uniformly better than φ}. [sent-59, score-0.595]
31 Note that Φ ⊆ Φ and Φ does not contain correcting functions that are too good for H. [sent-60, score-0.318]
32 3 There exists a correcting function φ ∈ Φ, such that for any (x, x∗ , y) that has non-zero probability, X (h∗ (xi ), yi ) = X ∗ (φ(x∗ ), yi ). [sent-65, score-0.449]
33 i Put it another way, we assume the existence of correcting function in Φ that mimics the losses of h∗ . [sent-66, score-0.303]
34 Let r be a learning rate of the Privileged ERM when it is ran over the joint X × X ∗ space with the space of decision and correcting functions H × Φ. [sent-67, score-0.619]
35 We develop an upper bound for the risk of the decision function found by Privileged ERM. [sent-68, score-0.359]
36 Under the above assumptions this bound converges to h∗ with the same rate r. [sent-69, score-0.219]
37 This implies that if the correcting space is good, so that the Privileged ERM in the joint X × X ∗ space has a fast learning rate (e. [sent-70, score-0.483]
38 That is true even if the√ decision space is hard and the regular ERM in the decision space has a slow learning rate (e. [sent-74, score-0.511]
39 We illustrate this result with the artificial learning problem, where the regular ERM in the decision space √ can not learn with the rate faster than 1/ n, but the correcting space is good and Privileged ERM learns in the decision space with the rate of 1/n. [sent-77, score-0.927]
40 In Section 3 we review the existing risk bounds that are used to derive our results. [sent-80, score-0.182]
41 Section 4 contains the proof of the risk bound for Privileged ERM. [sent-81, score-0.215]
42 They developed a risk bound (Proposition 2 in [9]) for the decision function found by their algorithm. [sent-87, score-0.341]
43 The bound of [9] is tailored to the classification setting, with 0/1-loss functions in the decision and the correcting space. [sent-89, score-0.513]
44 By contrast, our bound holds for any bounded loss functions and allows the loss functions X and X ∗ to be different. [sent-90, score-0.291]
45 The bound of [9] depends on generalization error of the correcting function φ found by Privileged ERM. [sent-91, score-0.467]
46 Vapnik and Vashist [9] concluded that if we could bound the convergence rate of φ then this bound will imply the bound on the convergence rate of the decision function found by their algorithm. [sent-92, score-0.595]
47 The spaces H and Φ of decision and correcting functions are chosen by learner. [sent-96, score-0.444]
48 3 Let R(h) = E(x,y)∼DX { X (h(x), y)} and R(φ) = E(x∗ ,y)∼DX ∗ { X ∗ (φ(x∗ ), y)} be the generalization errors of the decision function h and the correcting function φ respectively. [sent-97, score-0.484]
49 We denote by h∗ = arg minh∈H R(h) and φ∗ = arg minφ∈Φ R(φ) the decision and the correction function with the minimal generalization error w. [sent-100, score-0.306]
50 the 0/1 loss and by h∗ = arg minh∈H R01 (h) the decision function in H with the minimal generalization 0/1 error. [sent-107, score-0.31]
51 01 n 1 Let Rn (h, φ) = n i=1 ((h, φ), (xi , x∗ , yi )) and i R (h, φ) = E(x,x∗ ,y) ∼D { ((h, φ), (x, x∗ , y))} (8) be respectively empirical and generalization errors of the hypothesis (h, φ) w. [sent-108, score-0.302]
52 We denote by (h, φ) = arg min(h,φ)∈H×Φ Rn (h, φ) the empirical risk minimizer and by (h , φ ) = arg min (h,φ)∈H×Φ R (h, φ) the minimizer of the generalization error w. [sent-112, score-0.366]
53 Later on we will show how we choose the value of C that optimizes the forthcoming risk bound. [sent-132, score-0.146]
54 The risk bounds presented in this paper are based on VC-dimension of various function classes. [sent-133, score-0.182]
55 We say |T | that the set T = {(xi , ti )}i=1 ⊆ T (F) is shattered by F if for any T ⊆ T there exists a function f ∈ F such that for any (xi , ti ) ∈ T , |f (xi )| ≤ ti and for any (xi , ti ) ∈ T \ T , |f (xi )| > ti . [sent-138, score-0.245]
56 3 Review of existing excess risk bounds with fast convergence rates We derive our risk bounds from generic excess risk bounds developed by Massart and Nedelec [6] and generalized by Gine and Koltchinskii [4] and Koltchinkii [5]. [sent-140, score-0.641]
57 Let F be a space of hypotheses f : S → S , : S × {−1, +1} → R be a real-valued loss function such that 0 ≤ (f (x), y) ≤ 1 for any f ∈ F and any (x, y). [sent-142, score-0.153]
58 Let f ∗ = 4 (a) Hypothesis space with small D (b) Hypothesis space with large D Figure 1: Visualization of the hypothesis spaces. [sent-143, score-0.227]
59 The horisontal axis measures the distance (in terms of the variance) between hypothesis f and the best hypothesis f ∗ in F. [sent-144, score-0.298]
60 The large value of D in the hypothesis space in graph (b) is caused by hypothesis A, which is significantly different from f ∗ but has nearly-optimal error. [sent-147, score-0.34]
61 arg minf ∈F E(x,y) { (f (x), y)}, fn = arg minf ∈F such that for any f ∈ F, n i=1 (f (xi ), yi ) and D > 0 be a constant Var(x,y) { (f (x), y) − (f ∗ (x), y)} ≤ D · E(x,y) { (f (x), y) − (f ∗ (x), y)}. [sent-148, score-0.201]
62 (9) This condition is a generalization of Tsybakov’s low-noise condition [7] to arbitrary loss functions and arbitrary hypothesis spaces. [sent-149, score-0.349]
63 The constant D in (9) characterizes the error surface of the hypothesis space F. [sent-150, score-0.248]
64 1 does not hold, namely if n ≤ V · D2 then we can use the following fallback risk bound: Theorem 3. [sent-162, score-0.146]
65 For n ≤ T the bound√ (11) has a convergence rate of 1/n, and for n > T the bound (11) has a convergence rate of 1/ n. [sent-166, score-0.315]
66 The main difference between (10) and (11) is the fast convergence rate √ of 1/n vs. [sent-167, score-0.144]
67 Similarly, let L(H) = { X (h(·), ·) | h ∈ H} and L(Φ) = { X ∗ (φ(·), ·) | φ ∈ Φ} be the sets of loss functions that correspond to the hypotheses in H and Φ, and VL(H) and VL(Φ) be VC dimensions of L(H) and L(Φ) respectively. [sent-177, score-0.16]
68 1 to the hypothesis space H × Φ and the loss function ((h, φ), (x, x∗ , y)) and 2 obtain that there exists a constant K > 0 such that if n > VL(H,Φ) · DH,Φ then for any δ > 0, with probability at least 1 − δ R (h, φ) ≤ R (h , φ ) + KDH,Φ n VL(H,Φ) ln n 1 + ln 2 VL(H,Φ) DH,Φ δ . [sent-183, score-0.457]
69 C C C (16) We substitute (16) into (15) and obtain that there exists a constant K > 0 such that if n > VL(H,Φ) · 2 DH,Φ then for any δ > 0, with probability at least 1 − δ, R(h) ≤ R(h∗ ) + CKDH,Φ n VL(H,Φ) ln n 1 + ln 2 VL(H,Φ) DH,Φ δ . [sent-188, score-0.203]
70 1 and obtain our final risk bound, that is summarized in the following theorem: Theorem 4. [sent-190, score-0.146]
71 6 n V L(H,Φ) · 2 DH,Φ + ln 1 δ , (17) According to this bound, R(h) converges to R(h∗ ) with the rate of 1/n. [sent-199, score-0.2]
72 In this case the upper bound on R(h) converges to R(φ ) with the rate of 1/n. [sent-202, score-0.219]
73 We now provide further analysis of the risk bound (17). [sent-203, score-0.215]
74 5 When Privileged ERM is provably better than the regular ERM We show an example that demonstrates the difference between the emprical risk minimization in X space and empirical risk minimization with privileged information in the joint X × X ∗ space. [sent-218, score-1.145]
75 2) the learning rate of the regular ERM in X space is 1/ n while the learning rate of the privileged ERM in the joint X × X ∗ space is 1/n. [sent-220, score-0.953]
76 The hypothesis space H consists of hypotheses ht (x) = sign(x − t) and ht = −sign(x − t). [sent-227, score-0.264]
77 The best hypothesis in H is h1 and its generalization error is 1/4 − 2 . [sent-228, score-0.252]
78 The hypothesis space H contains also a hypothesis h3 , which is slightly worse than h1 and has generalization error of 1/4 + . [sent-229, score-0.436]
79 In order to use the risk bound (10) with our DX and H, the condition 2 n > VH · DH = 1/(18 2 ) (21) should be satisfied. [sent-235, score-0.233]
80 Hence, according to (11), for distributions DX ( ) that satisfy T (1/4 − 2 , 2, δ) ≤ 18 2 we √ ∗ obtain that R01 (h) converges to R01 (h ) with the rate of at least 1/ n. [sent-237, score-0.147]
81 √ The following lower bound shows that R01 (h) converges to R01 (h∗ ) with the rate of at most 1/ n. [sent-238, score-0.201]
82 By combining upper and lower bounds we obtain that the convergence rate of R01 (h) to R01 (h∗ ) is √ exactly 1/ n. [sent-244, score-0.177]
83 Suppose that the teacher constructed the distribution DX ∗ ( ) of examples in X ∗ space in the fol∗ ∗ ∗ ∗ lowing way. [sent-246, score-0.159]
84 The hypothesis space Φ consists of hy∗ ∗ potheses φt (x) = sign(x − t) and φt = −sign(x − t). [sent-250, score-0.184]
85 The best hypothesis in Φ is φ2 and its generalization error is 0. [sent-251, score-0.252]
86 The best hypothesis in Φ, among those that have uniformly better hypothesis in H, is φ1 and its generalization error is 1/4 − 2 . [sent-253, score-0.427]
87 7 then R01 (h) converges to R01 (h∗ ) with the rate of at least 1/n. [sent-275, score-0.147]
88 Since our bounds on DΦ and DH,Φ are independent of , the convergence rate of 1/n holds for any distribution in DX . [sent-276, score-0.179]
89 7 < n ≤ 18 2 the upper bound (17) converges to R01 (h∗ ) with the rate of √ 1/n, while the upper bound (11) converges to R01 (h∗ ) with the rate of 1/ n. [sent-278, score-0.438]
90 The hypothesis h3 caused the value of DH to be large and thus prevented us from 1/n convergence rate for a large range of n’s. [sent-280, score-0.279]
91 We constructed DX ∗ ( ) and Φ in such a way that Φ does not have a hypothesis φ that has exactly the same dichotomy as the bad hypothesis h3 . [sent-281, score-0.298]
92 With such construction any φ ∈ Φ, such that h3 is uniformly better than φ, has generalization error significantly larger than the one of h3 . [sent-282, score-0.145]
93 For example, the best hypothesis in Φ for which h3 is uniformly better, is φ0 and its generalization error is 1/2. [sent-283, score-0.286]
94 6 Conclusions We formulated the algorithm of empirical risk minimization with privileged information and derived the risk bound for it. [sent-284, score-1.036]
95 Our risk bound outlines the conditions for the correcting space that, if satisfied, will allow fast learning in the decision space, even if the original learning problem in the decision space is very hard. [sent-285, score-0.861]
96 We showed an example where the privileged information provably significantly improves the learning rate. [sent-286, score-0.666]
97 √ In this paper we showed that the good correcting space can improve the learning rate from 1/ n to 1/n. [sent-287, score-0.419]
98 But, having the good correcting space, can we achieve a learning rate faster than 1/n? [sent-288, score-0.392]
99 Finally, the important direction is to develop risk bounds for SVM+ (which is a regularized version of Privileged ERM) and show when it is provably better than SVM. [sent-291, score-0.245]
100 2008 Saint Flour lectures: Oracle inequalities in empirical risk minimization and sparse recovery problems, 2008. [sent-315, score-0.219]
wordName wordTfidf (topN-words)
[('privileged', 0.624), ('dh', 0.331), ('correcting', 0.287), ('erm', 0.28), ('vl', 0.265), ('lupi', 0.176), ('risk', 0.146), ('dx', 0.146), ('hypothesis', 0.141), ('decision', 0.126), ('learner', 0.108), ('teacher', 0.099), ('paradigm', 0.098), ('rate', 0.089), ('vh', 0.074), ('generalization', 0.071), ('loss', 0.07), ('bound', 0.069), ('ln', 0.068), ('yi', 0.067), ('regular', 0.065), ('var', 0.06), ('ckdh', 0.059), ('vashist', 0.059), ('vapnik', 0.055), ('space', 0.043), ('converges', 0.043), ('lemma', 0.043), ('xi', 0.042), ('provably', 0.042), ('hypotheses', 0.04), ('accelerate', 0.04), ('error', 0.04), ('pechyony', 0.039), ('supplies', 0.039), ('ti', 0.038), ('bounds', 0.036), ('suppose', 0.035), ('uniformly', 0.034), ('gine', 0.034), ('convergence', 0.034), ('oracle', 0.033), ('minh', 0.031), ('massart', 0.031), ('satis', 0.031), ('functions', 0.031), ('sign', 0.031), ('minimizes', 0.03), ('assumption', 0.03), ('svm', 0.03), ('minf', 0.029), ('nec', 0.029), ('veri', 0.029), ('minimization', 0.028), ('exists', 0.028), ('appendix', 0.027), ('shattered', 0.027), ('laboratories', 0.027), ('arg', 0.026), ('constant', 0.024), ('empirical', 0.023), ('derivations', 0.023), ('training', 0.023), ('inequalities', 0.022), ('version', 0.021), ('theorem', 0.021), ('fast', 0.021), ('excess', 0.02), ('ht', 0.02), ('holds', 0.02), ('outline', 0.02), ('hard', 0.019), ('let', 0.019), ('princeton', 0.018), ('condition', 0.018), ('upper', 0.018), ('assumptions', 0.018), ('additional', 0.018), ('xor', 0.017), ('hy', 0.017), ('lowing', 0.017), ('devroye', 0.017), ('nato', 0.017), ('saint', 0.017), ('minimal', 0.017), ('minimizer', 0.017), ('classi', 0.016), ('losses', 0.016), ('faster', 0.016), ('axis', 0.016), ('concluded', 0.016), ('dichotomy', 0.016), ('accelerating', 0.016), ('dmitry', 0.016), ('wellknown', 0.016), ('least', 0.015), ('annals', 0.015), ('nj', 0.015), ('caused', 0.015), ('realvalued', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 191 nips-2010-On the Theory of Learnining with Privileged Information
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
2 0.20512016 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
3 0.11614726 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
4 0.10748369 27 nips-2010-Agnostic Active Learning Without Constraints
Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu
Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1
5 0.07966274 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
6 0.076612696 43 nips-2010-Bootstrapping Apprenticeship Learning
7 0.076001577 22 nips-2010-Active Estimation of F-Measures
8 0.075583048 269 nips-2010-Throttling Poisson Processes
9 0.074664973 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
10 0.073898666 282 nips-2010-Variable margin losses for classifier design
11 0.068832435 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
12 0.066845089 142 nips-2010-Learning Bounds for Importance Weighting
13 0.06307818 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
14 0.062824719 261 nips-2010-Supervised Clustering
15 0.059995942 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
16 0.058076195 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
17 0.054579619 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
18 0.051680423 250 nips-2010-Spectral Regularization for Support Estimation
19 0.051636193 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
20 0.050915819 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
topicId topicWeight
[(0, 0.142), (1, 0.002), (2, 0.143), (3, 0.002), (4, 0.059), (5, 0.094), (6, -0.107), (7, -0.063), (8, -0.018), (9, -0.033), (10, -0.005), (11, -0.102), (12, -0.083), (13, -0.023), (14, -0.077), (15, 0.02), (16, 0.104), (17, 0.011), (18, 0.028), (19, -0.038), (20, 0.067), (21, -0.035), (22, 0.031), (23, 0.005), (24, 0.039), (25, 0.012), (26, -0.054), (27, 0.045), (28, 0.147), (29, 0.019), (30, 0.022), (31, 0.021), (32, 0.01), (33, -0.018), (34, 0.032), (35, -0.04), (36, 0.035), (37, -0.033), (38, 0.066), (39, -0.008), (40, -0.034), (41, -0.02), (42, 0.05), (43, 0.01), (44, -0.052), (45, 0.075), (46, 0.023), (47, 0.126), (48, -0.045), (49, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.92809272 191 nips-2010-On the Theory of Learnining with Privileged Information
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
2 0.82478929 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
3 0.75085503 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
4 0.7255097 142 nips-2010-Learning Bounds for Importance Weighting
Author: Corinna Cortes, Yishay Mansour, Mehryar Mohri
Abstract: This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the R´ nyi divergence of the training and test distributions. e These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.
5 0.66760659 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
Author: Malik Magdon-Ismail
Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.
6 0.64925879 27 nips-2010-Agnostic Active Learning Without Constraints
7 0.63116866 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
8 0.56637388 282 nips-2010-Variable margin losses for classifier design
9 0.56596756 22 nips-2010-Active Estimation of F-Measures
10 0.5473578 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
11 0.53370583 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
12 0.5036189 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
13 0.5018658 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
14 0.49214104 233 nips-2010-Scrambled Objects for Least-Squares Regression
15 0.47901285 290 nips-2010-t-logistic regression
16 0.47147509 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
17 0.46577865 269 nips-2010-Throttling Poisson Processes
18 0.46050602 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
19 0.45180291 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
20 0.43969941 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
topicId topicWeight
[(13, 0.042), (14, 0.304), (27, 0.024), (30, 0.053), (45, 0.186), (50, 0.026), (52, 0.034), (60, 0.062), (77, 0.055), (78, 0.027), (90, 0.072)]
simIndex simValue paperId paperTitle
1 0.74630094 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
same-paper 2 0.72510517 191 nips-2010-On the Theory of Learnining with Privileged Information
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
3 0.58394533 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
4 0.58245468 282 nips-2010-Variable margin losses for classifier design
Author: Hamed Masnadi-shirazi, Nuno Vasconcelos
Abstract: The problem of controlling the margin of a classifier is studied. A detailed analytical study is presented on how properties of the classification risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classification margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the fixed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter. 1
5 0.58137387 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
6 0.58081347 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
7 0.57979381 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
8 0.57859665 63 nips-2010-Distributed Dual Averaging In Networks
9 0.57842535 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
10 0.57741642 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
11 0.57713932 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
12 0.57701492 265 nips-2010-The LASSO risk: asymptotic results and real world examples
13 0.5768432 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
14 0.57670963 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
15 0.57665533 250 nips-2010-Spectral Regularization for Support Estimation
16 0.57624245 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
17 0.57596797 290 nips-2010-t-logistic regression
18 0.57573891 222 nips-2010-Random Walk Approach to Regret Minimization
19 0.57398444 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
20 0.57366699 27 nips-2010-Agnostic Active Learning Without Constraints