nips nips2004 nips2004-72 knowledge-graph by maker-knowledge-mining

72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Source: pdf

Author: Balázs Kégl

Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the ﬁnal regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efﬁcient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a ﬁnite number of steps. Then we provide conﬁdence-interval-type bounds on the generalization error. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the ﬁnal regressor. [sent-4, score-0.841]

2 In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. [sent-5, score-0.316]

3 First, we extend recent results on efﬁcient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a ﬁnite number of steps. [sent-6, score-0.412]

4 Then we provide conﬁdence-interval-type bounds on the generalization error. [sent-7, score-0.118]

5 1 Introduction In a recent paper [1] we introduced M ED B OOST, a boosting algorithm that trains base regressors and returns their weighted median as the ﬁnal regressor. [sent-8, score-1.026]

6 In another line of research, [2, 3] extended A DA B OOST to boost localized or conﬁdence-rated experts with input-dependent weighting of the base classiﬁers. [sent-9, score-0.651]

7 In this paper we analyze the algorithmic convergence of M ED B OOST and L OC M ED B OOST, and provide bounds on the generalization error. [sent-11, score-0.197]

8 We start by describing the algorithm in its most general form, and extend the result of [1] on the convergence of the robust (marginal) training error (Section 2). [sent-12, score-0.182]

9 The robustness of the regressor is measured in terms of the dispersion of the expert population, and with respect to the underlying average conﬁdence estimate. [sent-13, score-0.419]

10 In particular, we extend recent results [5] on efﬁcient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a ﬁnite number of steps. [sent-15, score-0.412]

11 In Section 4, we provide conﬁdence-interval-type bounds on the generalization error by generalizing results obtained for A DA B OOST [6, 2, 3]. [sent-16, score-0.169]

12 As in the case of A DA B OOST, the bounds justify the algorithmic objective of minimizing the robust training error. [sent-17, score-0.237]

13 2 The L OC M ED B OOST algorithm and the convergence result For the formal description, let the training data be Dn = (x1 , y1 ), . [sent-19, score-0.113]

14 , (xn , yn ) where data points (xi , yi ) are from the set Rd × R. [sent-22, score-0.114]

15 , 1/n) 2 3 4 5 for t ← 1 to T (h(t) , κ(t) ) ← BASE(Dn , w) for i ← 1 to n θi ← 1 − 2C h(t) (xi ), yi κi ← κ(t) (xi ) 6 7 α(t) ← arg min e 9 if α (t) =∞ (t) return f base rewards base conﬁdences n (t) α α 8 see (1) wi e−ακi θi i=1 κ i θi ≥ for all i = 1, . [sent-30, score-1.416]

16 Dn is the training data, C (y , y) ≥ I{|y−y |> } is the cost function, BASE(Dn , w) is the base regression algorithm, is the robustness parameter, and T is the number of iterations. [sent-34, score-0.636]

17 in line 1, and are updated in each iteration in line 13 (Figure 1). [sent-35, score-0.15]

18 We suppose that we are given a base learner algorithm BASE(Dn , w) that, in each iteration t, returns a base hypothesis that consists of a real-valued base regressor h(t) ∈ H and a non-negative base conﬁdence function κ(t) ∈ K. [sent-36, score-2.704]

19 The ﬁrst term is a weighted regression loss where the weight of a point xi is the product of its “con(t) stant” weight wi and the conﬁdence κ(t) (xi ) of the base hypothesis. [sent-39, score-0.805]

20 term means to place the high-conﬁdence region of the base regressor into areas where the regression error is small. [sent-41, score-0.99]

21 On the other hand, the minimization of the second term drives the high-conﬁdence region of the base regressor into dense areas. [sent-42, score-0.947]

22 After Theorem 1, we will explain the derivation of the base objective (1). [sent-43, score-0.562]

23 To simplify the notation in Figure 1 and in Theorem 1 below, we deﬁne the base rewards (t) (t) θi and the base conﬁdences κi for each training point (xi , yi ), i = 1, . [sent-44, score-1.296]

24 , n, base re(t) gressor h , and base conﬁdence function κ(t) , t = 1, . [sent-47, score-1.068]

25 If (t) κi θi ≥ for all training points, then α(t) = ∞ and E (α(t) ) = 0, so the algorithm returns the actual regressor (line 9). [sent-52, score-0.502]

26 Intuitively, this means that the capacity of the set of base hypotheses is too large, so we are overﬁtting. [sent-53, score-0.63]

27 If α(t) < 0, the algorithm returns the regressor up to the last iteration (line 11). [sent-54, score-0.5]

28 Intuitively, this means that the capacity of the set of base hypotheses is too small, so we cannot ﬁnd a new base regressor that would decrease the training loss. [sent-55, score-1.607]

29 In lines 9, 11, or 14, the algorithm returns the weighted median of the base regressors. [sent-58, score-0.756]

30 For the analysis of the algorithm, we formally deﬁne the ﬁnal regressor in a more general (t) manner. [sent-59, score-0.414]

31 First, let α(t) = PTα α(j) be the normalized coefﬁcient of the base hypothesis j=1 (h(t) , κ(t) ), and let T c(T ) (x) = α(t) κ(t) (x) = t=1 T t=1 α(t) κ(t) (x) T t=1 (6) α(t) (T ) (T ) be the average conﬁdence function3 after the T th iteration. [sent-60, score-0.627]

32 Let fρ+ (x) and fρ− (x) be the weighted (1) (T ) 1+ρ/c 2 (T ) (x) - and (T ) 1−ρ/c 2 (x) -quantiles, respectively, of the base regressors h (x), . [sent-61, score-0.729]

33 Then the weighted median is deﬁned as f (T ) (·) = medα,κ(·) h(·) = f0+ (·). [sent-69, score-0.144]

34 Not to be confused with κ(t) in (3) which is the average base conﬁdence over the training data. [sent-71, score-0.582]

35 ) To assess the ﬁnal regressor f (T ) (·), we say that f (T ) (·) is ρ-robust -precise on (xi , yi ) (T ) (T ) if and only if fρ+ (xi ) ≤ yi + , and fρ− (xi ) ≥ yi − . [sent-76, score-0.632]

36 For ρ ≥ 0, this condition is equivalent to both quantiles being in the “ -tube” around yi (Figure 2(b)). [sent-77, score-0.079]

37 In the rest of this section we show that the algorithm minimizes the relative frequency of training points on which f (T ) (·) is not -robust -precise. [sent-78, score-0.101]

38 Formally, let the ρ-robust -precise training error of f (T ) be deﬁned as L(ρ) (f (T ) ) = 1 n n (T ) I fρ+ (xi ) > yi + i=1 (T ) ∨ fρ− (xi ) < yi − . [sent-79, score-0.27]

39 5 (9) If ρ = 0, L(0) (f (T ) ) gives the relative frequency of training points on which the regressor f (T ) has a larger L1 error than . [sent-80, score-0.527]

40 If we have equality in (2), this is exactly the average loss of the regressor f (T ) on the training data. [sent-81, score-0.465]

41 For classiﬁcation with bi-valued base classiﬁers h : Rd → {−1, 1}, the deﬁnition (9) (with = 1) recovers the traditional notion of robust training error, that is, L(ρ) (f (T ) ) is the relative frequency of data points with margin smaller than ρ. [sent-83, score-0.765]

42 The following theorem upper bounds the ρ-robust -precise training error L (ρ) of the regressor f (T ) output by L OC M ED B OOST. [sent-84, score-0.619]

43 Deﬁne the base rewards θi and the base conﬁdences κi as in (4). [sent-86, score-1.147]

44 (t) Let wi be the weight of training point xi after the tth iteration (updated in line 13 in Figure 1), and let α(t) be the weight of the base regressor h(t) (·) (computed in line 7 in Figure 1). [sent-87, score-1.409]

45 For the sake of simplicity, in the notation we suppress the fact that L(ρ) depends on the whole (T sequence of base regressors, base conﬁdences, and weights, not only on the ﬁnal regressor f ) . [sent-89, score-1.567]

46 5 The proof is based on the observation that if the median of the base regressors goes further than from the real response yi at training point xi , then most of the base regressors must also be far from yi , giving small base rewards to this point. [sent-90, score-2.384]

47 To derive the base objective (1), we follow the two step functional gradient descent procedure [7], that is, ﬁrst we maximize the negative gradient −E (α) in α = 0, then we do a line search to determine α(t) . [sent-92, score-0.654]

48 Using n (t) this approach, the base objective becomes e1 (Dn ) = − i=1 wi κi θi , which is identical (t) (t) to (1). [sent-93, score-0.654]

49 Note that since E (α) is convex and E (0) = 1, a positive α(t) means that (t) (t) minα E (α) = E (α(t) ) < 1, so the condition in line 10 in Figure 1 guarantees that the upper bound of (10) decreases in each step. [sent-94, score-0.119]

50 3 Setting and maximizing the minimum margin In practice, A DA B OOST works well with = 0, so setting to a positive value is only an alternative regularization option to early stopping. [sent-95, score-0.169]

51 A too small means that the algorithm can overﬁt and stop in line 9. [sent-97, score-0.073]

52 In binary classiﬁcation this is an unrealistic situation: it means that there is a base classiﬁer that correctly classiﬁes all data points. [sent-98, score-0.534]

53 In this case, a base classiﬁer can correctly classify (or a base regressor can give positive base rewards θi to) all data points on which it does not abstain, so if = 0, the algorithm stops in line 9. [sent-100, score-2.192]

54 At the other end of the spectrum, a large can make the algorithm underﬁt and stop in line 11, so one needs to set carefully in order to avoid early stopping in lines 9 or 11. [sent-101, score-0.113]

55 From another aspect, a larger decreases the effective capacity of the the class of base hypotheses by restricting the set of admissible base hypotheses to those having small errors. [sent-104, score-1.254]

56 In general, has a potential role in balancing between over- and underﬁtting so, in practice, we suggest that it be validated together with the number of iterations T and other possible complexity parameters of the base hypotheses. [sent-105, score-0.534]

57 In the rest of this section, we extend the analysis of marginal boosting [5] to this general case. [sent-107, score-0.166]

58 Although the agressive maximization of the minimum margin can lead to overﬁtting, the analysis can provide valuable insight into the understanding of L OC M ED B OOST and so it can guide the setting of in practice. [sent-108, score-0.138]

59 For the sake of simplicity, let us assume that base hypotheses (h, κ) come from a ﬁnite set 6 HN with cardinality N , and let H(t) = (h(1) , κ(1) ), . [sent-109, score-0.708]

60 , (h(t) , κ(t) ) be the set of base hypotheses after the tth iteration. [sent-112, score-0.664]

61 Let us deﬁne the edge of the base hypothesis (h, κ) ∈ H N as7 n γ(h,κ) (w) = n w i κi θ i = i=1 i=1 wi κ(xi ) 1 − 2C h(xi ), yi , and the maximum edge in the tth iteration as γ ∗ (t) = max(h,κ)∈HN γ(h,κ) (w(t) ). [sent-113, score-0.923]

62 Note that γ(h,κ) (w) = −e1 (Dn ), so with this terminology, the objective of the base learner is 6 7 The analysis can be extended to inﬁnite base sets along the lines of [5]. [sent-114, score-1.156]

63 For the sake of simplicity, in the notation we suppress the dependence of γ (h,κ) on Dn . [sent-115, score-0.104]

64 to maximize the edge γ (t) = γ(h(t) ,κ(t) ) (w(t) ) (if the maximum is achieved, then γ (t) = γ ∗ (t) ), and the algorithm stops in line 11 if the edge γ (t) is less than . [sent-116, score-0.165]

65 On the other hand, let us deﬁne the margin on a point (x, y) as the average reward8 N N α(j) κ(j) θ(j) = ρ(x,y) (α) = j=1 j=1 α(j) κ(j) (x) 1 − 2C h(j) (x), y . [sent-117, score-0.133]

66 Let us denote the minimum margin over the data points in the tth iteration by ρ∗ (t) = min (x,y)∈Dn ρ(x,y) (α(t−1) ), (11) where α(t−1) = α(1) , . [sent-118, score-0.354]

67 , α(t−1) is the vector of base hypothesis coefﬁcients up to the (t − 1)th iteration. [sent-121, score-0.561]

68 It is easy to see that in each iteration, the maximum edge over the base hypotheses is at least the minimum margin over the training points: γ ∗ (t) = max (h,κ)∈HN γ(h,κ) (w(t) ) ≥ min (x,y)∈Dn ρ(x,y) (α(t−1) ) = ρ∗ (t) . [sent-122, score-0.939]

69 To converge to ρ∗ within a factor ν in ﬁnite time, [5] sets (t) RW = min γ (j) − ν, j=1,. [sent-126, score-0.099]

70 First we deﬁne the minimum and maximum achievable base rewards by ρmin = ρmax = min min κ(x) 1 − 2C h(x), y , (12) max max κ(x) 1 − 2C h(x), y , (13) (h,κ)∈HN (x,y)∈Dn (h,κ)∈HN (x,y)∈Dn respectively. [sent-131, score-0.922]

71 Let ρ = > ρ) after at most T = A2 log n 2ν 2 T t=1 log (t) − ρmin . [sent-134, score-0.054]

72 For the sake of simplicity, in the notation we suppress the dependence of ρ (x,y) on HN . [sent-140, score-0.104]

73 Corollary 1 (Generalization of Corollary 4 in [5]) Assume that the weak learner always achieves an edge γ (t) ≥ ρ∗ . [sent-143, score-0.083]

74 (t) The second corollary shows that if is set adaptively to RW then the minimum margin ρ∗ (t) will converge to ρ∗ within a precision ν in a ﬁnite number of steps. [sent-145, score-0.228]

75 Corollary 2 (Generalization of Theorem 6 in [5]) Assume that the weak learner always achieves an edge γ (t) ≥ ρ∗ . [sent-146, score-0.083]

76 4 The generalization error In this section we extend probabilistic bounds on the generalization error obtained for A DA B OOST [6], conﬁdence-rated A DA B OOST [2], and localized boosting [3]. [sent-148, score-0.444]

77 The results provide bounds on the conﬁdence-interval-type error L(f (T ) ) = PD f (T ) (X) − Y > , where (X, Y ) is a random point generated according to D independently from points in Dn . [sent-150, score-0.15]

78 The bounds state that with a large probability, L(f (T ) ) < L(ρ) (f (T ) ) + C(n, ρ, H, K), where the complexity term C depends on the size or the pseudo-dimension of the base regressor set H, and the smoothness of the base conﬁdence functions in K. [sent-151, score-1.52]

79 As in the case of A DA B OOST, these bounds qualitatively justify the minimization of the robust training error L(ρ) (f (T ) ). [sent-152, score-0.211]

80 Let C be the set of combined regressors obtained as a weighted median of base regressors from H, that is, N C = f (·) = medα,κ(·) h(·) h ∈ HN , α ∈ R+ , κ ∈ KN , N ∈ Z+ . [sent-153, score-1.006]

81 In the simplest case, we assume that H is ﬁnite and base coefﬁcients are constant. [sent-154, score-0.534]

82 Theorem 3 (Generalization of Theorem 1 in [6]) Let D be a distribution over R d × R, and let Dn be a sample of n points generated independently at random according to D. [sent-155, score-0.095]

83 Assume that the base regressor set H is ﬁnite, and K contains only κ(x) ≡ 1. [sent-156, score-0.929]

84 Then with probability 1 − δ over the random choice of the training set Dn , any f ∈ C satisﬁes the following bound for all ρ > 0: L(f ) < L(ρ) (f ) + O 1 √ n log n log |H| 1 + log ρ2 δ 1/2 . [sent-157, score-0.156]

85 Similarly to the proof of Theorem 1 in [6], we construct a set CN that contains unweighted medians of N base functions from H, then approximate f by g(·) = med1 h1 (·), . [sent-158, score-0.534]

86 , hN (·) ∈ CN where the base functions hi are selected randlomly according to the coefﬁcient distribution α. [sent-161, score-0.534]

87 We then separate the one-sided error into two terms by PD f (X) > Y + ≤ PD g ρ + (X) > Y + 2 + PD g ρ + (X) ≤ Y + f (X) > Y + 2 , and then upper bound the two terms as in [6]. [sent-162, score-0.077]

88 The second theorem extends the ﬁrst to the case of inﬁnite base regressor sets. [sent-163, score-0.998]

89 Theorem 4 (Generalization of Theorem 2 of [6]) Let D be a distribution over R d × R, and let Dn be a sample of n points generated independently at random according to D. [sent-164, score-0.095]

90 Assume that the base regressor set H has pseudodimension p, and K contains only κ(x) ≡ 1. [sent-165, score-0.983]

91 Then with probability 1 − δ over the random choice of the training set Dn , any f ∈ C satisﬁes the following bound for all ρ > 0: L(f ) < L(ρ) (f ) + O 1 √ n 1 p log2 (n/p) + log ρ2 δ 1/2 . [sent-166, score-0.102]

92 , 2N by coefﬁcient of the set A = N 2 (N/2 + 1)(en/p)pN where p is the pseudodimension of H (or the VC dimension of H+ = {(x, y) : h(x) > y} : h ∈ H ). [sent-170, score-0.054]

93 Theorem 5 (Generalization of Theorem 1 of [3]) Let D be a distribution over R d × R, and let Dn be a sample of n points generated independently at random according to D. [sent-172, score-0.095]

94 Assume that the base regressor set H has pseudodimension p, and K contains functions κ(x) which are lower bounded by a constant a, and which satisfy for all x, x ∈ Rd the Lipschitz condition |κ(x) − κ(x )| ≤ L x − x ∞ . [sent-173, score-0.983]

95 Then with probability 1 − δ over the random choice of the training set Dn , any f ∈ C satisﬁes the following bound for all ρ > 0: L(f ) < L 5 (ρ) (f ) + O 1 √ n (L/(aρ))d p log2 (n/p) 1 + log 2 ρ δ 1/2 . [sent-174, score-0.102]

96 Conclusion In this paper we have analyzed the algorithmic convergence of L OC M ED B OOST by generalizing recent results on efﬁcient margin maximization, and provided bounds on the generalization error by extending similar bounds obtained for A DA B OOST. [sent-175, score-0.405]

97 K´ gl, “Robust regression by boosting the median,” in Proceedings of the 16th Conference on e Computational Learning Theory, Washington, D. [sent-177, score-0.155]

98 Singer, “Improved boosting algorithms using conﬁdence-rated predictions,” Machine Learning, vol. [sent-184, score-0.125]

99 K´ gl, “Conﬁdence-rated regression by boosting the median,” Tech. [sent-194, score-0.155]

100 Warmuth, “Efﬁcient margin maximizing with boosting,” Journal of Machine a Learning Research (submitted), 2003. [sent-200, score-0.131]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('base', 0.534), ('oost', 0.413), ('regressor', 0.395), ('dn', 0.247), ('oc', 0.198), ('regressors', 0.164), ('da', 0.151), ('boosting', 0.125), ('hn', 0.118), ('median', 0.113), ('med', 0.108), ('margin', 0.1), ('wi', 0.092), ('dences', 0.09), ('dence', 0.09), ('rewards', 0.079), ('yi', 0.079), ('con', 0.075), ('min', 0.074), ('hypotheses', 0.069), ('theorem', 0.069), ('generalization', 0.061), ('tth', 0.061), ('returns', 0.059), ('pd', 0.058), ('bounds', 0.057), ('xi', 0.056), ('achievable', 0.055), ('pseudodimension', 0.054), ('rw', 0.054), ('line', 0.052), ('training', 0.048), ('algorithmic', 0.047), ('iteration', 0.046), ('suppress', 0.043), ('edge', 0.042), ('learner', 0.041), ('extend', 0.041), ('corollary', 0.041), ('sake', 0.039), ('minimum', 0.038), ('gl', 0.038), ('localized', 0.037), ('preset', 0.036), ('points', 0.035), ('max', 0.034), ('let', 0.033), ('convergence', 0.032), ('maximizing', 0.031), ('error', 0.031), ('weighted', 0.031), ('montreal', 0.031), ('regression', 0.03), ('robust', 0.03), ('nite', 0.029), ('stops', 0.029), ('neumann', 0.029), ('objective', 0.028), ('weighting', 0.028), ('hypothesis', 0.027), ('independently', 0.027), ('psfrag', 0.027), ('justify', 0.027), ('von', 0.027), ('log', 0.027), ('capacity', 0.027), ('bound', 0.027), ('replacements', 0.026), ('cn', 0.025), ('rd', 0.025), ('converge', 0.025), ('coef', 0.024), ('return', 0.024), ('robustness', 0.024), ('precision', 0.024), ('nal', 0.023), ('ed', 0.023), ('loss', 0.022), ('classi', 0.022), ('notation', 0.022), ('stop', 0.021), ('schapire', 0.021), ('simplicity', 0.021), ('carefully', 0.021), ('decreases', 0.021), ('generalizing', 0.02), ('operating', 0.02), ('gradient', 0.02), ('weight', 0.02), ('upper', 0.019), ('formally', 0.019), ('bartlett', 0.019), ('lines', 0.019), ('weights', 0.019), ('tting', 0.019), ('minimization', 0.018), ('frequency', 0.018), ('baxter', 0.018), ('kegl', 0.018), ('mason', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

2 0.63732326 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

Abstract: In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve A DA B OOST by incorporating knowledge on the structure of the data into base classiﬁer design and selection. On the other hand, we use A DA B OOST’s efﬁcient learning mechanism to signiﬁcantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. Beside the speciﬁc manifold-based penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms. 1

3 0.15558486 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse

Abstract: We study a method of optimal data-driven aggregation of classiﬁers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classiﬁers of activation patterns in fMRI based on locally trained SVM classiﬁers. The aggregation coefﬁcients are then used to design a ”boosting map” of the brain needed to identify the regions with most signiﬁcant impact on classiﬁcation. 1

4 0.10177311 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

Author: Nathan Srebro, Noga Alon, Tommi S. Jaakkola

Abstract: We prove generalization error bounds for predicting entries in a partially observed matrix by ﬁtting the observed entries with a low-rank matrix. In justifying the analysis approach we take to obtain the bounds, we present an example of a class of functions of ﬁnite pseudodimension such that the sums of functions from this class have unbounded pseudodimension. 1

5 0.091431491 19 nips-2004-An Application of Boosting to Graph Classification

Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto

Abstract: This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. We also discuss the relation between our algorithm and SVMs with convolution kernels. Two experiments using natural language data and chemical compounds show that our method achieves comparable or even better performance than SVMs with convolution kernels as well as improves the testing efﬁciency. 1

6 0.086669438 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

7 0.07605125 178 nips-2004-Support Vector Classification with Input Data Uncertainty

8 0.070461966 103 nips-2004-Limits of Spectral Clustering

9 0.065509647 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

10 0.064342603 168 nips-2004-Semigroup Kernels on Finite Sets

11 0.064276606 23 nips-2004-Analysis of a greedy active learning strategy

12 0.062044371 115 nips-2004-Maximum Margin Clustering

13 0.056896869 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

14 0.056365475 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

15 0.055677872 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

16 0.053914778 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

17 0.050813697 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

18 0.050628822 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

19 0.050468057 70 nips-2004-Following Curved Regularized Optimization Solution Paths

20 0.048744306 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.174), (1, 0.083), (2, 0.024), (3, 0.206), (4, 0.018), (5, 0.054), (6, 0.18), (7, 0.178), (8, -0.282), (9, 0.459), (10, -0.227), (11, -0.303), (12, 0.135), (13, 0.027), (14, 0.192), (15, -0.024), (16, 0.09), (17, -0.034), (18, 0.116), (19, -0.088), (20, -0.131), (21, 0.004), (22, 0.058), (23, 0.008), (24, 0.095), (25, -0.154), (26, -0.043), (27, 0.032), (28, 0.061), (29, 0.078), (30, 0.059), (31, 0.01), (32, 0.029), (33, 0.015), (34, 0.005), (35, -0.045), (36, -0.006), (37, 0.096), (38, -0.091), (39, 0.02), (40, -0.016), (41, -0.036), (42, -0.018), (43, -0.024), (44, 0.002), (45, -0.004), (46, -0.022), (47, -0.018), (48, -0.01), (49, 0.071)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96838695 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

2 0.91641694 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

3 0.44947088 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse

4 0.28525928 19 nips-2004-An Application of Boosting to Graph Classification

Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto

5 0.25020593 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

Author: Omid Madani, David M. Pennock, Gary W. Flake

Abstract: In the context of binary classiﬁcation, we deﬁne disagreement as a measure of how often two independently-trained models differ in their classiﬁcation of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1

6 0.23684442 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

7 0.20307869 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

8 0.19260566 82 nips-2004-Incremental Algorithms for Hierarchical Classification

9 0.19012596 103 nips-2004-Limits of Spectral Clustering

10 0.18745904 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

11 0.18722869 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

12 0.17901611 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

13 0.17714235 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

14 0.17509463 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

15 0.16473819 136 nips-2004-On Semi-Supervised Classification

16 0.16182865 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

17 0.16116293 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning

18 0.16048905 168 nips-2004-Semigroup Kernels on Finite Sets

19 0.16041577 70 nips-2004-Following Curved Regularized Optimization Solution Paths

20 0.15951163 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.125), (13, 0.083), (15, 0.099), (26, 0.089), (31, 0.021), (32, 0.021), (33, 0.234), (35, 0.013), (39, 0.024), (50, 0.021), (63, 0.137), (76, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91183293 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

2 0.89875287 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

Author: N. J. Hill, Thomas N. Lal, Karin Bierig, Niels Birbaumer, Bernhard Schölkopf

Abstract: Motivated by the particular problems involved in communicating with “locked-in” paralysed patients, we aim to develop a braincomputer interface that uses auditory stimuli. We describe a paradigm that allows a user to make a binary decision by focusing attention on one of two concurrent auditory stimulus sequences. Using Support Vector Machine classiﬁcation and Recursive Channel Elimination on the independent components of averaged eventrelated potentials, we show that an untrained user’s EEG data can be classiﬁed with an encouragingly high level of accuracy. This suggests that it is possible for users to modulate EEG signals in a single trial by the conscious direction of attention, well enough to be useful in BCI. 1

3 0.89179462 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously ﬁnd syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classiﬁcation with models that exclusively use short- and long-range dependencies respectively. 1

4 0.89133894 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

Author: Philip M. Long, Xinyu Wu

Abstract: We establish a mistake bound for an ensemble method for classiﬁcation based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more reﬁned bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simpliﬁcation of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simpliﬁed algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1

5 0.884848 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

6 0.88214922 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

7 0.85732579 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

8 0.82329255 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

9 0.81760269 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

10 0.81714529 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

11 0.81635338 161 nips-2004-Self-Tuning Spectral Clustering

12 0.8157835 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

13 0.81476408 145 nips-2004-Parametric Embedding for Class Visualization

14 0.81413066 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

15 0.81385773 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

16 0.81368309 69 nips-2004-Fast Rates to Bayes for Kernel Machines

17 0.81330776 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

18 0.81194133 44 nips-2004-Conditional Random Fields for Object Recognition

19 0.81181788 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

20 0.8109746 64 nips-2004-Experts in a Markov Decision Process