jmlr jmlr2010 jmlr2010-95 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
Reference: text
sentIndex sentText sentNum sentScore
1 EDU School of Mathematics Georgia Institute of Technology Atlanta, GA 30332-0160, USA Editor: Gabor Lugosi Abstract Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. [sent-3, score-0.483]
2 Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. [sent-5, score-0.369]
3 Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient 1. [sent-6, score-0.597]
4 The quantity EP (ℓ • g) := P(ℓ • g) − inf P(ℓ • g) g∈G is called the excess risk of g. [sent-25, score-0.491]
5 The question is whether there are such active learning algorithms for which the excess risk after n iterations is provably smaller than for the passive prediction rules based on the same number n of training examples. [sent-35, score-0.755]
6 In some classification problems, the excess risk of active learning algorithms can converge to zero with an exponential rate as n → ∞ (comparing with the rate O(n−1 ) in the case of passive learning). [sent-40, score-0.755]
7 Castro and Nowak (2008) studied several examples of binary classification problems in which the active learning approach is beneficial and suggested nice active learning algorithms in these problems. [sent-41, score-0.571]
8 There has been a progress in the design of active learning algorithms that possess some degree of adaptivity, in particular, see Dasgupta et al. [sent-44, score-0.279]
9 Hanneke showed that incorporating Rademacher complexities in active learning algorithms allows one to develop rather general versions of such algorithms that are adaptive under broad assumptions on the underlying distributions. [sent-53, score-0.348]
10 The question is how many “good” examples are needed for the excess risk EP (ℓ • g) of the resulting classifier g to ˆ ˆ become smaller than δ with a guaranteed probability at least 1 − α. [sent-65, score-0.46]
11 To be more specific, suppose that we are dealing with a binary classification problem and that the empirical risk (with respect to the binary loss) is being minimized over a class G of binary functions. [sent-68, score-0.357]
12 These sets can be estimated based on the empirical data and Rademacher complexities of such estimated sets for small enough values of δ are used to define reasonably tight bounds on the excess risk. [sent-70, score-0.39]
13 Moreover, it was used by Gin´ and Koltchinskii (2006) e to obtain refined excess risk bounds in binary classification (in the passive learning case). [sent-82, score-0.567]
14 Thus, there is a possibility to come up with an active learning strategy that, at each ˆ iteration, computes an estimate A of the disagreement set and determines the required sample size, ˆ and then samples the required number of design points from the conditional distribution Π(·|x ∈ A). [sent-88, score-0.361]
15 To give a precise description of a version of active learning method considered in this paper and to study its statistical properties, several facts from the general theory of empirical risk minimization will be needed. [sent-100, score-0.509]
16 In particular, in Section 2, we describe a construction of distribution dependent and data dependent bounds on the excess risk based on localized Rademacher complexities, see Koltchinskii (2006, 2008). [sent-101, score-0.561]
17 We prove several bounds on the number of active examples needed to achieve this goal with a specified probability. [sent-104, score-0.309]
18 However, we do not pursue this possibility here and, instead, we concentrate in Section 4 on the binary classification problems, which still remains the most interesting class of learning problems where the active learning approach leads to faster convergence rates. [sent-108, score-0.294]
19 The optimization problem P f −→ min, f ∈ F is interpreted as a “risk minimization” problem and the quantity EP ( f ) := P f − inf Pg g∈F is called the excess risk of f . [sent-119, score-0.491]
20 , Xn ) of size n is defined as n Pn := n−1 ∑ δX j j=1 and the problem of risk minimization is replaced by the “empirical risk minimization”: Pn f −→ min, f ∈ F . [sent-125, score-0.406]
21 Given a solution fˆ of the empirical risk minimization problem (1), a basic question is to provide reasonably tight upper confidence bounds on the excess risk EP ( fˆ) that depend on complexity characteristics of the class F . [sent-127, score-0.705]
22 sup f ,g∈FP (δ) Given ψ : R+ → R+ , define ψ(σ) σ≥δ σ ψ♭ (δ) := sup and ψ♯ (ε) := inf δ > 0 : ψ♭ (δ) ≤ ε . [sent-138, score-0.315]
23 EP ( f ) δ j ≥δ Thus, the quantity δn (F ; P) is a distribution dependent upper bound on the excess risk EP ( fˆ) that holds with a guaranteed probability. [sent-142, score-0.509]
24 This E ratio bound for the excess risk immediately implies the following statement showing that for all the values of δ above certain threshold the δ-minimal sets of empirical risk provide estimates of the δ-minimal sets of the true risk. [sent-144, score-0.696]
25 Data dependent upper confidence bounds on the excess risk can be constructed using localized sup-norms of Rademacher processes that provide a way to estimate the size of the empirical process. [sent-147, score-0.557]
26 sup f ,g∈FPn (δ) These quantities are empirical versions of φn (δ) and DP (δ) and they can be used to define an empir¯ ical version of the function Un : ˆ ˆ ˆ ˆ ˆ ˆ Un (δ) := K ∑ φn (cδ j ) + Dn (cδ j ) j≥0 tj tj + I (δ), n n (δ j+1 ,δ j ] ˆ ˆ where K, c are numerical constants. [sent-154, score-0.278]
27 n In what follows, it will be of interest to consider sequential learning algorithms in which the sample size is being gradually increased until the excess risk becomes smaller than a given level δ. [sent-168, score-0.421]
28 , to “learn” a function for which the excess risk is smaller than δ) can itself be learned from the data. [sent-196, score-0.421]
29 More precisely, the estimator n(δ) of the required sample ˆ size can be computed sequentially by increasing the sample size n gradually, computing for each n ˆ ˆ the data dependent excess risk bound δn and stopping as soon as δn ≤ δ. [sent-197, score-0.51]
30 At the same time, the sample size n(δ) ˆ is sufficient for estimating the σ-minimal sets of the true risk by the σ-minimal sets of the empirical risk for all σ ≥ δ (in the sense of the inclusions of Proposition 6). [sent-199, score-0.621]
31 These facts will play a crucial role in our design of active learning methods in the next section. [sent-200, score-0.279]
32 , ˆ Ak := x : sup f ,g∈Fk−1 | f (x) − g(x)| > δk ; ˆ ¯ 1 nk ˆ Pk := nk ∑ j=1 IAk (X j )δX j ; ˆ ¯ ˆ ˆ Fk := Fk−1 FPk (3δk ); ˆ end ˆ The set Ak defined at each iteration of the algorithm is viewed as a set of “active examples” (or ˆ ˆ “active set”). [sent-208, score-0.788]
33 Let ν(δ) denote the total number of active examples used by the algorithm in the first L iterations. [sent-218, score-0.277]
34 By the induction assumption, ˆ FP (δk ) ⊂ FP (δk−1 ) ⊂ Fk−1 ˆ ˆ and, by the definition of Ak , we have for all f , g ∈ Fk−1 , nk ¯ ˆ ¯k |Pnk ( f − g) − Pk ( f − g)| = n−1 ∑ ( f − g)(Xi ) − n−1 ¯k ¯ i=1 ¯k = n−1 ∑ ∑ ( f − g)(Xi ) ˆ i:Xi ∈Ak ( f − g)(Xi ) ≤ δk . [sent-227, score-0.348]
35 ˆ Also, by the inclusions of Proposition 6 and the definition of nk = n(δk ), we have on the event H ¯ ¯ that FP (δk ) ⊂ FPn¯k (2δk ). [sent-229, score-0.602]
36 Thus, using again the inclusions of ˆ ¯ Proposition 6, we get ˆ Fk ⊂ FPn¯k (4δk ) ⊂ FP (8δk ), proving the inclusions (2) To prove the bound on ν(δ), note that on the event H, where the inclusions (2) hold for all k such that δk ≥ δ, we have ˆ Ak ⊂ A(8δk−1 ). [sent-234, score-0.734]
37 Hence, on the event H, ν(δ) ≤ ∑ νk , δk ≥δ where νk := nk ¯ ∑ IA(8δ k−1 ) (X j ). [sent-235, score-0.397]
38 j=1 Clearly, νk is a binomial random variable with parameters nk and π(δk−1 ). [sent-236, score-0.328]
39 Taking s := et nk π(δk−1 ) yields ¯ P νk ≥ et nk π(δk−1 ) ≤ exp{−nk π(δk−1 )t logt}. [sent-241, score-0.656]
40 ¯ ¯ Applying the union bound, we get P ν(δ) ≥ et ¯ ∑ nk π(δk−1 ) ≤ P(H c ) + Since P(H c ) ≤ ¯ ∑ exp{−nk π(δk−1 )t logt}. [sent-242, score-0.328]
41 The simplest way to make the method data dependent is to replace in Algorithm 1 the sample ˆ sizes nk = n(δk ) by their estimates nk := n(δk ), k ≥ 1 and to redefine Pk in the iterative procedure ¯ ¯ ˆ ˆ ˆ ˆ k , Pk and Fk as follows: ˆ for A ˆ 1 nk ˆ Pk := I ˆ (X j )δX j . [sent-244, score-1.043]
42 nk j=1 Ak ˆ ∑ 2469 KOLTCHINSKII This modification of Algorithm 1 will be called Algorithm 2. [sent-245, score-0.328]
43 Recall the definition of the number of iterations L and also that ν(δ) denotes the number of active examples used by the algorithm in the first L iterations. [sent-247, score-0.277]
44 δ j ≥δ Note that in this version of the algorithm all the training examples X j (not only the examples in ˆ the active sets Ak ) are used to determine the sample sizes nk . [sent-250, score-0.644]
45 Thus, in the cases when sampling the design points is “cheap” and only assigning the labels to them is “expensive” (which is a common motivational assumption in the literature on active learning), the algorithms of this type make some sense (see Section 4 for more details). [sent-253, score-0.279]
46 ˆ Note that the following iterative relationships hold for the distribution dependent sample sizes nk := n(δk+1 ) and nk := n(δk+1 ) : ¯ ¯ ˜ ˜ 1 ¯ nk = min n ∈ M, n ≥ nk−1 : Un (δk ) ≤ δk+1 , n0 := inf M ¯ ¯ ¯ 2 and 1 ˜ ˜ nk = min n ∈ M, n ≥ nk−1 : Un (δk ) ≤ δk+1 , n0 := inf M. [sent-260, score-1.494]
47 , ˆ Ak := x : sup f ,g∈Fk−1 | f (x) − g(x)| > cδk ; ˆ ˆ (k) 1 nk := min n ∈ M, n ≥ nk−1 : Un ≤ 2 δk+1 ; ˆ ˆ ˆ ˆ Fk := Fk−1 F ˆ (3δk ); Pk end As before, we define A(δ) := x : | f (x) − g(x)| > cδ sup f ,g∈F (8δ) and π(δ) := P(A(δ)). [sent-267, score-0.592]
48 ˆ ˆ Theorem 9 There exist numerical constants c in the definition of the active sets Ak , K in the def(k) ˜ ˜ ˜ ˆ inition of Un and K, c in the definition of the function Un such that the following holds. [sent-269, score-0.284]
49 With probability at least 1−3 ∑ ∑ e−t j≥0 n∈M 2471 (n) j , KOLTCHINSKII the following inequalities and inclusions hold for all k ≥ 0 : nk ≤ nk ≤ nk , ¯ ˆ ˜ ˆ FP (δk ) ⊂ Fk ⊂ FP (8δk ). [sent-270, score-1.229]
50 Then define (n) ˆ Ek,n := K Rn ( f − g) + DPn (FP (8δk−1 )) sup f ,g∈FP (8δk−1 ) (n) t tk + k n n 1˜ ≤ Un (δk ) 2 and (n) (n) ′ ˆ Ek,n := K sup Rn ( f − g) + DPn (FP (δk−1 )) f ,g∈FP (δk−1 ) t tk + k n n ¯ ≥ 2Un (δk ) . [sent-276, score-0.426]
51 nk ≤ nk ≤ nk , ¯ ˆ ˜ ˆ FP (δk ) ⊂ Fk ⊂ FP (8δk ) and also for k = 1, 2, . [sent-283, score-0.984]
52 By this induction assumption, we have Fk−1 ⊂ FP (8δk−1 ) and, by the definition of the set Ak , ˆ (k) Rn ( f − g) ≤ sup sup Rn ( f − g) + cδk ˆ f ,g∈Fk−1 ˆ f ,g∈Fk−1 and ˆ ˆ D2ˆ(k) (Fk−1 ) ≤ D2n (Fk−1 ) + c2 δ2 . [sent-289, score-0.284]
53 P k Pn ˆ (k) This implies the following upper bound on Un : (n) (n) ˆ (k) ˆ Un ≤ K Rn ( f − g) + DPn (FP (8δk−1 )) sup f ,g∈FP (8δk−1 ) t tk + k + cδk + cδk n n (n) tk n . [sent-290, score-0.323]
54 Applying (6) to n = nk , we get ˆ ¯ˆ 2Unk (δk ) − δk+1 δk+1 (k) ˆˆ ≤ Unk ≤ , 2 2 which yields δk+1 ¯ˆ . [sent-294, score-0.328]
55 Unk (δk ) ≤ 2 We also have nk ≥ nk−1 ≥ nk−1 (by the induction assumption). [sent-295, score-0.348]
56 By the definition of nk , this implies ˆ ˆ ¯ ¯ that nk ≥ nk . [sent-296, score-0.984]
57 ˆ ¯ On the other hand, denote n− the element of the ordered set M preceding nk . [sent-297, score-0.328]
58 ˆk k 2 2 (7) If it happened that n− < nk−1 , then we must have nk = nk−1 , which, by the induction assumption, ˆk ˆ ˆ ˆ implies that nk = nk−1 ≤ nk−1 ≤ nk . [sent-300, score-1.004]
59 If n− ≥ nk−1 , then the definition of nk implies that ˆ ˆ ˜ ˜ ˆk ˆ ˆ ˆ (k) δk+1 , Un− > ˆk 2 which together with (7) implies that δk+1 ˜ˆ . [sent-301, score-0.328]
60 Un− (δk ) > k 2 But, if nk > nk , then n− ≥ nk , which would imply that ˆ ˜ ˆk ˜ δk+1 ˜˜ Unk (δk ) > 2 ˜ (since for all δ, Un (δ) is a nonincreasing function of n). [sent-302, score-1.016]
61 The last inequality contradicts the definition of nk implying that nk ≤ nk . [sent-303, score-0.984]
62 ˜ As soon as π(δ) → 0 as δ → 0, the upper bounds on ν(δ) show that, in the case of active learning, there is a reduction of the number of training examples needed to achieve a desired accuracy of learning comparing with passive learning. [sent-306, score-0.406]
63 For a binary classifier g, define its excess risk as EP (ℓ • g) := P(ℓ • g) − P(ℓ • g∗ ). [sent-321, score-0.458]
64 It is natural to characterize the quality of the classifier g in terms of its excess risk EP (ℓ • g) ˆ ˆ and to study how it depends on the complexity of the class G as well as on the complexity of the classification problem itself. [sent-336, score-0.421]
65 The following theorem is a version of the result proved by Massart and Nedelec (2006): Theorem 10 There exists a constant K > 0 such that, for all t > 0, with probability at least 1 − e−t EP (ℓ • g) ≤ K ˆ V nh2 t log + nh V nh V + n t . [sent-343, score-0.302]
66 n This upper bound on the excess risk is optimal in a minimax sense (as it was also shown by Massart and Nedelec, 2006). [sent-344, score-0.472]
67 e 2476 E XCESS R ISK IN ACTIVE L EARNING Theorem 11 There exists a constant K > 0 such that, for all t > 0, with probability at least 1 − e−t EP (ℓ • g) ≤ K ˆ V V t log τ + 2 nh nh nh V + n t . [sent-354, score-0.338]
68 The proof δ is based on applying subtle bounds for empirical processes (see Gin´ and Koltchinskii, 2006) to e ¯ compute the excess risk bound δn of Section 2. [sent-356, score-0.518]
69 In this case, with probability at least 1 − e−t , EP (ℓ • g) ≤ K ˆ V t , log τ0 + nh nh V so the main term of the bound is of the order O( nh ) and it does not contain logarithmic factors depending on n and h. [sent-359, score-0.367]
70 The quantity n(δ, α) shows how many training examples are needed to make the excess risk of the classifier g smaller than δ with a guaranteed probability of at least 1 − α. [sent-363, score-0.479]
71 In the case of empirical risk minimization over a VC-class with a bounded capacity function τ, the sample complexity is of the order O( V 1 ). [sent-365, score-0.279]
72 hδ The role of the capacity function is rather modest in the case of passive learning since it only allows one to refine the excess risk and the sample complexity bounds by making the logarithmic factors more precise. [sent-366, score-0.557]
73 However, the capacity function τ happened to be of crucial importance in the analysis of active learning methods of binary classification. [sent-367, score-0.321]
74 This function was rediscovered in active learning literature and its supremum is being used there under the name of disagreement coefficient, see, for example, Hanneke (2009a,b) and references therein. [sent-368, score-0.339]
75 and also a nondecreasing data dependent sequence of estimated sample sizes nk . [sent-377, score-0.387]
76 This set will be used as a set of active design points at the k-th iteration. [sent-383, score-0.279]
77 Next define active empirical distributions based on the unlabeled examples {X j } and on the labeled examples {(X j ,Y j )} : n ˆ (k) Πn := n−1 ∑ IAk (X j )δX j ˆ j=1 and n ˆ (k) Pn := n−1 ∑ IAk (X j )δ(X j ,Y j ) ˆ j=1 (k) ˆ ˆˆ For simplicity, we will also use the notation Pk := Pnk . [sent-384, score-0.333]
78 sup g1 ,g2 ∈G (δ) This simple observation allows one to replace the Rademacher complexities for the loss class F = ℓ • G by the Rademacher complexities for the class G itself (and the proofs of the excess risk bounds and other results cited in Section 2 go through with no changes). [sent-408, score-0.767]
79 , ˆ ˆ Ak := x : ∃g1 , g2 ∈ Gk−1 , g1 (x) = g2 (x) ; (k) 1 ˆ nk := min n ∈ M, n ≥ nk−1 : Un ≤ 2 δk+1 ; ˆ ˆ ˆ ˆ Gk := Gk−1 GPk (3δk ); ˆ end Remark 12 One can also use in Algorithm 4 the Rademacher complexities defined as follows: ˆ (k) Rn := n sup ˆ g1 ,g2 ∈Gk−1 n−1 ∑ ε j (g1 − g2 )(X j ) . [sent-416, score-0.551]
80 j=1 In this case, not only the active design points, but all the design points X j are used to compute the Rademacher complexities and to estimate the sample sizes nk . [sent-417, score-0.739]
81 u∈(0,1] Then there exists an event of probability at least 1 − α such that the following inclusions hold for ˆ the classes Gk output by Algorithm 4: for all k ≥ 0, ˆ GP (δk ) ⊂ Gk ⊂ GP (8δk ). [sent-426, score-0.314]
82 (10) Also with probability at least 1 − α, the following bound on the number ν(δ) of active training examples used by Algorithm 4 in the first L = log2 (1/δ) iterations holds with some numerical constant C > 0 : ν(δ) ≤ C τ0 log(1/δ) V log τ0 + log(1/α) + log log(1/δ) + log log(1/h) . [sent-427, score-0.49]
83 Namely, with some constant C1 , V V log(1/α) + log log n ¯ δn ≤ C1 + log τ 2 nh nh nh log(1/α) + log log n . [sent-434, score-0.539]
84 Note that, under Massart’s low noise assumption, formula (8) for the excess risk implies that for all binary classifiers g E (ℓ • g) ≥ hΠ{x : g(x) = g∗ (x)}. [sent-436, score-0.458]
85 This gives ν(δ) ≤ e2 ∑ K1 δ j ≥δ V log(1/α) + log log2 (1/δ) + log log(1/h) 8τ0 log τ0 + δ j−1 , δ j+1 h δ j+1 h h which is bounded from above by C τ0 log(1/δ) V log τ0 + log(1/α) + log log(1/δ) + log log(1/h) . [sent-442, score-0.33]
86 It is well known that under this assumption the following bound on the excess risk holds for an arbitrary classifier g : EP (ℓ • g) ≥ cΠκ ({g = g∗ }), where κ = 1+γ and c is a constant that depends on B, κ. [sent-447, score-0.45]
87 Then, the following upper bound on the excess risk of an empirical risk minimizer g holds with probability at least 1 − e−t : ˆ EP (ℓ • g) ≤ K ˆ 1 n −κ/(2κ+ρ−1) + t n κ/(2κ−1) , where K is a constant depending on κ, ρ, A, B. [sent-451, score-0.695]
88 It easily follows from this bound that in order to achieve the excess risk of order δ one needs O δ−2+(1−ρ)/κ training examples. [sent-453, score-0.45]
89 u∈(0,1] Then there exists an event of probability at least 1 − α such that the following inclusions hold for ˆ the classes Gk output by Algorithm 4: for all k with δk ≥ δ, ˆ GP (δk ) ⊂ Gk ⊂ GP (8δk ). [sent-460, score-0.314]
90 Also with probability at least 1 − α, the following bound on the number ν(δ) of active training examples used by Algorithm 4 holds with some constant C > 0 depending on κ, ρ, A, B : ν(δ) ≤ Cτ0 δ−2+(2−ρ)/κ + δ−2+2/κ (log(1/α) + log log(1/δ)) . [sent-461, score-0.38]
91 Remark 15 Alternatively, one can assume that the active learning algorithm stops as soon as the ˆ specified number of active examples, say, n has been achieved. [sent-464, score-0.561]
92 If L denotes the number of iterations needed to achieve this target, then 8δL is an upper bound on the excess risk of the classifiers from ˆ ˆˆ the set GL . [sent-465, score-0.45]
93 Thus, the excess risk of such classifiers tends to zero exponentially fast as n → ∞. [sent-467, score-0.421]
94 This is the form in which the excess risk bounds in active learning are usually stated in the literature (see, e. [sent-468, score-0.71]
95 In fact, this is a refinement of the bounds of Hanneke that were proved for somewhat different active learning algorithms (see Hanneke, 2009b, Theorems 4, 5). [sent-471, score-0.312]
96 Remark 16 Although we concentrated in this section only on binary classification problems, the active learning algorithms described in Section 3 can be also used in the context of multiclass classification and some other problems (e. [sent-474, score-0.294]
97 (2009), one can now replace the disagreement set Ak for the ˆ k−1 = ℓ • Gk−1 involved in Algorithm 3 by a larger set ˆ class F ˆ ˆ A+ := (x, y) : ∃g1 , g2 ∈ Gk−1 sup |ℓ(y′ , g1 (x)) − ℓ(y′ , g2 (x))| > cδk k = y′ ∈T ˆ x : ∃g1 , g2 ∈ Gk−1 sup |ℓ(y, g1 (x)) − ℓ(y, g2 (x))| > cδk × T. [sent-479, score-0.346]
98 k ˆ (k) nk := min n ∈ M, n ≥ nk−1 : Un ≤ 1 δk+1 ; ˆ ˆ 2 ˆ ˆ Gk := Gk−1 GPk (3δk ); ˆ end ˆ ˆ ˆ (k) The Rademacher complexities of classes Fk = ℓ • Gk (the quantities Un ) as well as active ˆ ˆ empirical measures Pk involved in this algorithm are now based on active sets A+ . [sent-484, score-0.969]
99 Clearly, only the k labels of active examples are used in this version of the algorithm. [sent-485, score-0.277]
100 Local Rademacher complexities and oracle inequalities in risk minimization. [sent-576, score-0.312]
wordName wordTfidf (topN-words)
[('fp', 0.333), ('un', 0.328), ('nk', 0.328), ('active', 0.257), ('excess', 0.231), ('koltchinskii', 0.224), ('fk', 0.214), ('inclusions', 0.205), ('risk', 0.19), ('rademacher', 0.175), ('xcess', 0.169), ('gk', 0.164), ('fpn', 0.157), ('ep', 0.15), ('isk', 0.144), ('sup', 0.132), ('hanneke', 0.121), ('ak', 0.095), ('complexities', 0.091), ('nh', 0.088), ('gin', 0.083), ('pn', 0.083), ('disagreement', 0.082), ('tk', 0.081), ('passive', 0.077), ('dpn', 0.072), ('gp', 0.071), ('event', 0.069), ('rn', 0.068), ('massart', 0.065), ('balcan', 0.065), ('epn', 0.06), ('iak', 0.06), ('earning', 0.058), ('pk', 0.056), ('log', 0.055), ('inf', 0.051), ('epk', 0.048), ('fpk', 0.048), ('tu', 0.048), ('castro', 0.046), ('logt', 0.046), ('dasgupta', 0.045), ('proposition', 0.044), ('tj', 0.043), ('talagrand', 0.043), ('fq', 0.043), ('classi', 0.042), ('dependent', 0.04), ('nitions', 0.037), ('wellner', 0.037), ('binary', 0.037), ('argming', 0.036), ('nedelec', 0.036), ('pnk', 0.036), ('empirical', 0.036), ('beygelzimer', 0.034), ('bounds', 0.032), ('nonincreasing', 0.032), ('oracle', 0.031), ('jy', 0.031), ('bound', 0.029), ('theorem', 0.029), ('localized', 0.028), ('concentration', 0.028), ('notations', 0.028), ('stops', 0.027), ('vaart', 0.027), ('constants', 0.027), ('capacity', 0.027), ('measurable', 0.026), ('minimization', 0.026), ('dp', 0.026), ('py', 0.026), ('unk', 0.026), ('gpk', 0.024), ('ical', 0.024), ('vcclass', 0.024), ('proved', 0.023), ('montreal', 0.023), ('agnostic', 0.023), ('sampled', 0.022), ('design', 0.022), ('minimax', 0.022), ('ers', 0.021), ('er', 0.021), ('hold', 0.021), ('tsybakov', 0.02), ('induction', 0.02), ('suppose', 0.02), ('statement', 0.02), ('dn', 0.02), ('soon', 0.02), ('examples', 0.02), ('least', 0.019), ('sizes', 0.019), ('couple', 0.019), ('corollary', 0.019), ('quantity', 0.019), ('ia', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
2 0.12783651 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
3 0.11364057 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
4 0.10787717 69 jmlr-2010-Lp-Nested Symmetric Distributions
Author: Fabian Sinz, Matthias Bethge
Abstract: In this paper, we introduce a new family of probability densities called L p -nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form ˜ x x ρ(x ) = ρ( f (x )), where f is a nested cascade of L p -norms x p = (∑ |xi | p )1/p . L p -nested symmetric distributions thereby are a special case of ν-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, ν-spherical and L p -nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, L p -spherically symmetric distributions, and certain types of independent component analysis (ICA) and independent subspace analysis (ISA) models, ν-spherical distributions are usually computationally intractable. Here we demonstrate that L p nested symmetric distributions are still computationally feasible by deriving an analytic expression for its normalization constant, gradients for maximum likelihood estimation, analytic expressions for certain types of marginals, as well as an exact and efficient sampling algorithm. We discuss the tight links of L p -nested symmetric distributions to well known machine learning methods such as ICA, ISA and mixed norm regularizers, and introduce the nested radial factorization algorithm (NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial L p nested symmetric source into statistically independent signals. As a corollary, we also introduce the uniform distribution on the L p -nested unit sphere. Keywords: parametric density model, symmetric distribution, ν-spherical distributions, non-linear independent component analysis, independent subspace analysis, robust Bayesian inference, mixed norm density model, uniform distributions on mixed norm spheres, nested radial factorization
5 0.08822725 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
6 0.082658701 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
7 0.082236975 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
8 0.072524697 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
9 0.069220006 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
10 0.066169649 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
11 0.063453369 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
12 0.060557298 22 jmlr-2010-Classification Using Geometric Level Sets
13 0.058088318 102 jmlr-2010-Semi-Supervised Novelty Detection
14 0.057399511 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
15 0.055053744 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
16 0.054865614 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
17 0.052063145 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
18 0.047399953 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
19 0.046240769 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
20 0.04448488 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
topicId topicWeight
[(0, -0.208), (1, -0.096), (2, 0.062), (3, 0.03), (4, -0.101), (5, 0.003), (6, 0.052), (7, 0.233), (8, -0.029), (9, 0.016), (10, 0.053), (11, -0.179), (12, 0.008), (13, -0.173), (14, -0.066), (15, 0.229), (16, -0.228), (17, -0.041), (18, -0.068), (19, -0.072), (20, -0.023), (21, -0.181), (22, -0.038), (23, 0.036), (24, 0.064), (25, -0.067), (26, 0.094), (27, 0.029), (28, -0.078), (29, -0.193), (30, 0.067), (31, 0.116), (32, 0.075), (33, 0.025), (34, 0.185), (35, -0.05), (36, 0.001), (37, 0.029), (38, 0.009), (39, 0.014), (40, 0.059), (41, 0.068), (42, 0.163), (43, 0.03), (44, 0.151), (45, 0.111), (46, 0.09), (47, 0.007), (48, 0.065), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.95171696 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
2 0.51915562 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
3 0.46224236 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
Author: Antti Honkela, Tapani Raiko, Mikael Kuusela, Matti Tornio, Juha Karhunen
Abstract: Variational Bayesian (VB) methods are typically only applied to models in the conjugate-exponential family using the variational Bayesian expectation maximisation (VB EM) algorithm or one of its variants. In this paper we present an efficient algorithm for applying VB to more general models. The method is based on specifying the functional form of the approximation, such as multivariate Gaussian. The parameters of the approximation are optimised using a conjugate gradient algorithm that utilises the Riemannian geometry of the space of the approximations. This leads to a very efficient algorithm for suitably structured approximations. It is shown empirically that the proposed method is comparable or superior in efficiency to the VB EM in a case where both are applicable. We also apply the algorithm to learning a nonlinear state-space model and a nonlinear factor analysis model for which the VB EM is not applicable. For these models, the proposed algorithm outperforms alternative gradient-based methods by a significant margin. Keywords: variational inference, approximate Riemannian conjugate gradient, fixed-form approximation, Gaussian approximation
4 0.42603958 69 jmlr-2010-Lp-Nested Symmetric Distributions
Author: Fabian Sinz, Matthias Bethge
Abstract: In this paper, we introduce a new family of probability densities called L p -nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form ˜ x x ρ(x ) = ρ( f (x )), where f is a nested cascade of L p -norms x p = (∑ |xi | p )1/p . L p -nested symmetric distributions thereby are a special case of ν-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, ν-spherical and L p -nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, L p -spherically symmetric distributions, and certain types of independent component analysis (ICA) and independent subspace analysis (ISA) models, ν-spherical distributions are usually computationally intractable. Here we demonstrate that L p nested symmetric distributions are still computationally feasible by deriving an analytic expression for its normalization constant, gradients for maximum likelihood estimation, analytic expressions for certain types of marginals, as well as an exact and efficient sampling algorithm. We discuss the tight links of L p -nested symmetric distributions to well known machine learning methods such as ICA, ISA and mixed norm regularizers, and introduce the nested radial factorization algorithm (NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial L p nested symmetric source into statistically independent signals. As a corollary, we also introduce the uniform distribution on the L p -nested unit sphere. Keywords: parametric density model, symmetric distribution, ν-spherical distributions, non-linear independent component analysis, independent subspace analysis, robust Bayesian inference, mixed norm density model, uniform distributions on mixed norm spheres, nested radial factorization
5 0.41423911 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
6 0.36832368 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
7 0.36437929 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
8 0.34347185 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
9 0.33871511 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
10 0.29374558 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
11 0.27490142 102 jmlr-2010-Semi-Supervised Novelty Detection
12 0.27427799 22 jmlr-2010-Classification Using Geometric Level Sets
13 0.26453021 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
14 0.25941789 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure
15 0.24789435 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
16 0.22837484 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
17 0.22385685 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
18 0.22307602 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
19 0.22190826 25 jmlr-2010-Composite Binary Losses
20 0.21823603 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
topicId topicWeight
[(1, 0.017), (3, 0.053), (8, 0.026), (15, 0.014), (21, 0.021), (32, 0.073), (36, 0.029), (37, 0.088), (75, 0.107), (81, 0.011), (85, 0.059), (87, 0.39), (96, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.66899723 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
2 0.38668889 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
3 0.38128471 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
4 0.37827346 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
Author: Gideon S. Mann, Andrew McCallum
Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random fields
5 0.37553114 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
Author: Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal R. Figueiras-Vidal
Abstract: We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the spectral representation of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class. Keywords: Gaussian process, probabilistic regression, sparse approximation, power spectrum, computational efficiency
6 0.37450778 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
7 0.37295887 102 jmlr-2010-Semi-Supervised Novelty Detection
8 0.37283319 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
9 0.37152714 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
10 0.37129682 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
11 0.37124965 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
12 0.37062365 109 jmlr-2010-Stochastic Composite Likelihood
13 0.36837828 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
14 0.36832535 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
15 0.36697873 60 jmlr-2010-Learnability, Stability and Uniform Convergence
16 0.36680347 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
17 0.3667486 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
18 0.36184394 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
19 0.35884425 25 jmlr-2010-Composite Binary Losses