nips nips2004 nips2004-119 knowledge-graph by maker-knowledge-mining

119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination


Source: pdf

Author: Philip M. Long, Xinyu Wu

Abstract: We establish a mistake bound for an ensemble method for classification based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more refined bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simplification of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simplified algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 sg Abstract We establish a mistake bound for an ensemble method for classification based on maximizing the entropy of voting weights subject to margin constraints. [sent-7, score-0.624]

2 The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. [sent-8, score-0.194]

3 We prove a more refined bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. [sent-9, score-0.366]

4 We describe a simplification of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. [sent-10, score-0.415]

5 The simplified algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. [sent-11, score-0.169]

6 1 Introduction In this paper, we analyze a maximum-entropy procedure for ensemble learning in the online learning model. [sent-12, score-0.058]

7 During the tth trial, the algorithm (1) receives xt ∈ {0, 1}n (interpreted in this work as a vector of base classifier predictions), (2) predicts a class yt ∈ {0, 1}, and (3) discovers the correct class yt . [sent-14, score-0.546]

8 During ˆ trial t, the algorithm has access only to information from previous trials. [sent-15, score-0.117]

9 The first algorithm we will analyze for this problem was proposed by Jaakkola, Meila and Jebara [14]. [sent-16, score-0.041]

10 The algorithm, at each trial t, makes its prediction by taking a weighted vote over the predictions of the base classifiers. [sent-17, score-0.158]

11 The weight vector pt is the probability distribution over the n base classifiers that maximizes the entropy, subject to the constraint that pt correctly classifies all patterns seen in previous trials with a given margin γ. [sent-18, score-1.863]

12 That is, it maximizes the entropy of pt subject to the constraints that pt · xs ≥ 1/2 + γ whenever ys = 1 for s < t, and pt · xs ≤ 1/2 − γ whenever ys = 0 for s < t. [sent-19, score-2.998]

13 We show that, if there is a weighting p∗ , determined with benefit of hindsight, that achieves margin γ on all trials, then this on-line maximum entropy procedure makes at most ln n 2γ 2 mistakes. [sent-20, score-0.487]

14 Littlestone [19] proved the same bound for the Weighted Majority Algorithm [21], and a similar bound for the Balanced Winnow Algorithm [19]. [sent-21, score-0.151]

15 The original Winnow algorithm was designed to solve the problem of learning a hidden disjunction of a small number k out of a possible n boolean variables. [sent-22, score-0.062]

16 When this problem is reduced to our general setting in the most natural way, the resulting bound is Θ(k 2 log n), whereas Littlestone proved a bound of ek ln n for Winnow. [sent-23, score-0.352]

17 We prove more refined bounds for a wider family of maximum-entropy algorithms, which use thresholds different than 1/2 (as proposed in [14]) and class-sensitive margins. [sent-24, score-0.058]

18 A mistake bound of ek ln n for learning disjunctions is a consequence of this more refined analysis. [sent-25, score-0.521]

19 The optimization needed at each round can be cast as minimizing a convex function subject to convex constraints, and thus can be solved in polynomial time [25]. [sent-26, score-0.155]

20 However, the same mistake bounds hold for a similar, albeit linear-time, algorithm. [sent-27, score-0.212]

21 This algorithm, after each trial, replaces all constraints from previous trials with a single linear inequality. [sent-28, score-0.097]

22 Littlestone [19] analyzed some variants of Winnow by showing that mistakes cause a reduction in the relative entropy between the learning algorithm’s weight vector, and that of the target function. [sent-31, score-0.322]

23 Kivinen and Warmuth [16] showed that an algorithm related to Winnow trades optimally in a sense between accommodating the information from new data, and keeping the relative entropy between the new and old weight vectors small. [sent-32, score-0.307]

24 As in related analyses like mistake bounds for the perceptron algorithm [22], Winnow [19] and the Weighted Majority Algorithm [19], our bound holds for any sequence of (xt , yt ) pairs satisfying the separation condition; in particular no independence assumptions are needed. [sent-36, score-0.428]

25 The proof of the main result proceeds roughly as follows. [sent-40, score-0.064]

26 If there is a mistake on trial t, it is corrected with a large margin by pt+1 . [sent-41, score-0.358]

27 Thus pt+1 must assign a significantly different probability to the voters predicting 1 on trial t than pt does. [sent-42, score-0.9]

28 Applying an identity known as Pinsker’s inequality, this means that the relative entropy from pt+1 and pt is large. [sent-43, score-1.038]

29 Next, we exploit the fact that the constraints satisfied by pt , and therefore by pt+1 , are convex to show that moving from pt to pt+1 must take you away from the uniform distribution, thus decreasing the entropy. [sent-44, score-1.662]

30 The theorem then follows from the fact that the entropy can only be reduced by a total of ln n. [sent-45, score-0.411]

31 The refinement leading to a ek ln n bound for disjunctions arises from the observation that Pinsker’s inequality can be strengthened when the probabilities being compared are small. [sent-46, score-0.374]

32 The analysis of this paper lends support to a view of Winnow as a fast, incremental approximation to the maximum entropy discrimination approach, and suggests a variant of Winnow that corresponds more closely to the inductive bias of maximum entropy. [sent-47, score-0.301]

33 For a feature vector x ∈ {0, 1}n and a class designation y ∈ {0, 1}, say that a probability distribution p is correct with margin γ if σ(p · x) = y, and |p · x − 1/2| ≥ γ. [sent-55, score-0.178]

34 If x and y were encountered in a trial of a learning algorithm, we say that p is correct with margin γ on that trial. [sent-56, score-0.229]

35 2 Entropy, relative entropy, and variation Recall that, for a probability distributions p = (p1 , . [sent-58, score-0.055]

36 (1) i=1 Relative entropy and variation distance are related in Pinsker’s inequality. [sent-65, score-0.26]

37 3 Information geometry Relative entropy obeys something like the Pythogarean Theorem. [sent-68, score-0.262]

38 Lemma 2 ([9]) Suppose q is a probability distribution, C is a convex set of probability distributions, and r is the element of A that minimizes D(r||q). [sent-69, score-0.086]

39 In our analysis, we will assume that there is always a feasible pt . [sent-73, score-0.789]

40 Theorem 3 If there is a fixed probability distribution p∗ that is correct with margin γ on all trials, OMEγ makes at most ln n mistakes. [sent-75, score-0.278]

41 2γ 2 Proof: We will show that a mistake causes the entropy of the hypothesis to drop by at least 2γ 2 . [sent-76, score-0.424]

42 Since the constraints only become more restrictive, the entropy never increases, and so the fact that the entropy lies between 0 and ln n will complete the proof. [sent-77, score-0.608]

43 The definition of pt+1 ensures that pt+1 · xt is on the correct side of 1/2 by at least γ. [sent-79, score-0.204]

44 Either pt+1 · xt − pt · xt ≥ γ, or the bitwise complement c(xt ) of xt satisfies pt+1 · c(xt ) − pt · c(xt ) ≥ γ. [sent-82, score-2.07]

45 Therefore, Pinsker’s Inequality (Lemma 1) implies that D(pt+1 ||pt ) ≥ 2γ 2 . [sent-84, score-0.037]

46 (2) Let Ct be the set of all probability distributions that satisfy the constraints in effect when pt was chosen, and let u = (1/n, . [sent-85, score-0.87]

47 Since pt+1 is in Ct (it must satisfy the constraints that pt did), Lemma 2 implies D(pt+1 ||u) ≥ D(pt+1 ||pt ) + D(pt ||u) and thus D(pt+1 ||u) − D(pt ||u) ≥ D(pt+1 ||pt ) which, since D(p||u) = (ln n) − H(p) for all p, implies H(pt )−H(pt+1 ) ≥ D(pt+1 ||pt ). [sent-89, score-0.929]

48 Because H(pt ) is always at least H(p∗ ), the same analysis leads to a mistake bound of (ln n − H(p∗ ))/(2γ 2 ). [sent-92, score-0.235]

49 Further, a nearly identical proof establishes the following (details are omitted from this abstract). [sent-93, score-0.048]

50 Theorem 4 Suppose OMEγ is modified so that p1 is set to be something other than the uniform distribution, and each pt minimizes D(pt ||p1 ) subject to the same constraints. [sent-94, score-0.877]

51 If there is a fixed p∗ that is correct with margin γ on all trials, the modified algorithm ∗ makes at most D(p ||p1 ) mistakes. [sent-95, score-0.173]

52 2γ 2 4 Maximum Entropy for Learning Disjunctions In this section, we show how the maximum entropy principle can be used to efficiently learn disjunctions. [sent-96, score-0.264]

53 For a feature vector x ∈ {0, 1}n and a class designation y ∈ {0, 1}, say that p is correct at threshold b with margin γ if σb (p · x) = y, and |p · x − b| ≥ γ. [sent-98, score-0.188]

54 Theorem 6 Suppose there is a probability distribution p∗ that is correct at threshold b, with a margin γ+ on all trials t with yt = 1, and with margin γ− on all trials with yt = 0. [sent-106, score-0.602]

55 Proof: The outline of the proof is similar to the proof of Theorem 3. [sent-108, score-0.096]

56 We will show that mistakes cause the entropy of the algorithm’s hypothesis to decrease. [sent-109, score-0.286]

57 Arguing as in the proof of Theorem 3, H(pt+1 ) ≤ H(pt ) − D(pt+1 ||pt ). [sent-110, score-0.048]

58 Lemma 5 then implies that H(pt+1 ) ≤ H(pt ) − d(pt+1 · xt ||pt · xt ). [sent-111, score-0.365]

59 (3) If there was a mistake on trial t for which yt = 1, then pt · xt ≤ b, and pt+1 · xt ≥ b + γ+ . [sent-112, score-1.496]

60 Thus in this case d(pt+1 · xt ||pt · xt ) ≥ d(b + γ+ ||b). [sent-113, score-0.328]

61 Similarly, if there was a mistake on trial t for which yt = 0, then d(pt+1 · xt ||pt · xt ) ≥ d(b − γ− ||b). [sent-114, score-0.707]

62 Once again, these two bounds on d(pt+1 · xt ||pt · xt ), together with (3) and the fact that the entropy is between 0 and ln n, complete the proof. [sent-115, score-0.709]

63 The analysis of Theorem 6 can also be used to prove bounds for the case in which mistakes of different types have different costs, as considered in [12]. [sent-116, score-0.089]

64 For example, if γ = 1/4, Theorem 6 gives a bound of 7. [sent-118, score-0.066]

65 Corollary 7 If there are k of the n features, such that each yt is the disjunction of those features in xt , then algorithm OME1/(ek),1/k−1/(ek),1/(ek) makes at most ek ln n mistakes. [sent-121, score-0.56]

66 Proof Sketch: If the target weight vector p∗ assigns equal weight to each of the variables in the disjunction, when y = 1, the weight of variables evaluating to 1 is at least 1/k, and when y = 0, it is 0. [sent-122, score-0.083]

67 Plugging into Theorem 6, simplifying and overapproximating completes the proof. [sent-124, score-0.041]

68 Note that in the case of disjunctions, it leads to a weaker 6k ln n bound. [sent-130, score-0.126]

69 Theorem 9 If there is a probability distribution p∗ that is correct at threshold b with a margin γ on all trials, then OMEb,γ,γ makes at most 3bγln n mistakes. [sent-131, score-0.192]

70 2 5 Relaxed on-line maximum entropy algorithms Let us refer the halfspace of probability distributions that satisfy the constraint of trial t as Tt and the associated separating hyperplane by Jt . [sent-132, score-0.536]

71 Recall that Ct is the set of feasible Figure 1: In ROME, the constraints Ct in effect before the tth round are replaced by the halfspace St . [sent-133, score-0.178]

72 solutions to all the constraints in effect when pt is chosen. [sent-134, score-0.832]

73 So pt+1 maximizes entropy subject to membership in Ct+1 = Tt ∩ Ct . [sent-135, score-0.351]

74 Our proofs only used the following facts about the OME algorithm: (a) pt+1 ∈ Tt , (b) pt is the maximum entropy member of Ct , and (c) pt+1 ∈ Ct . [sent-136, score-1.121]

75 Suppose At is the set of weight vectors with entropy at last that of pt . [sent-137, score-1.038]

76 Finally, let St be the halfspace with boundary Ht containing pt+1 . [sent-139, score-0.051]

77 (The least obvious is (b), which follows since Ht is tangent to At at pt , and the entropy function is strictly concave. [sent-142, score-1.036]

78 ) Also, as previously observed by Littlestone [19], the algorithm might just as well not respond to trials in which there is not a mistake. [sent-143, score-0.075]

79 Proposition 10 If trial t is a mistake, and q maximizes entropy subject to membership in St ∩ Tt , then it is on the separating hyperplane for Tt . [sent-148, score-0.534]

80 Proof: Because q and p both satisfy St , any convex combination of the two satisfies St . [sent-149, score-0.064]

81 Thus, if q was on the interior of Tt , we could find a probability distribution with higher entropy that still satisfies both St and Tt by taking a tiny step from q toward p. [sent-150, score-0.242]

82 This would contradict the assumption that q is the maximum entropy member of St ∩ Tt . [sent-151, score-0.301]

83 This implies that the next hypothesis of a ROME algorithm is either on Jt (the separating hyperplane Tt ) only, or on both Jt and Ht (the separating hyperplane of St ). [sent-152, score-0.26]

84 The following theorem will enable us to obtain a formula in either case. [sent-153, score-0.073]

85 1)) Suppose q is a probability distribution, and C is a set defined by linear constraints as follows: for an m × n real matrix A, and a m-dimensional column vector b, C = {r : Ar = b}. [sent-155, score-0.058]

86 Then if r is the member of C minimizing D(r||q), then there are scalar constants Z, c1 , . [sent-156, score-0.037]

87 If the next hypothesis pt+1 of a ROME algorithm is on Ht , then by Lemma 2, it and all other members of Ht satisfy D(pt+1 ||u) = D(pt+1 ||pt ) + D(pt ||u). [sent-163, score-0.09]

88 Thus, in this case, pt+1 also minimizes D(q||pt ) from among the members q of Ht ∩ Jt . [sent-164, score-0.033]

89 Thus, Lemma 11 implies that pt+1,i /pt,i is the same for all i with xi = 1, and the same for all i with xi = 0. [sent-165, score-0.037]

90 This implies that, for ROMEb,γ+ ,γ− , if there was a mistake on a trial t,  (b+γ )p + t,i  if xt,i = 1 and yt = 1  pt ·xt  (1−(b+γ ))p  + t,i  if xt,i = 0 and yt = 1 1−(pt ·xt ) pt+1,i = (4) (b−γ− )pt,i  if xt,i = 1 and yt = 0  pt ·xt   (1−(b−γ+ ))pt,i  if x = 0 and y = 0. [sent-166, score-2.222]

91 If pt+1 is not on the separating hyperplane for St , then it must maximize entropy subject to membership in Tt alone, and therefore subject to membership in Jt . [sent-168, score-0.534]

92 In this case, Lemma 11 implies  (b+γ ) if xt,i = 1 and yt = 1  |{j:xt,j+  =1}|  (1−(b+γ ))  +  if xt,i = 0 and yt = 1 |{j:xt,j =0}|. [sent-169, score-0.265]

93 pt+1,i = (5) (b−γ+ )  |{j:x =1}| if xt,i = 1 and yt = 0  t,j   (1−(b−γ+ ))  if xt,i = 0 and yt = 0 |{j:xt,j =0}|. [sent-170, score-0.228]

94 If this is the case, then pt+1 defined as in (5) should be a member of St . [sent-171, score-0.037]

95 Evaluating the gradient of H at pt , and simplifying a bit, we can see that n 1 St = q : qi ln ≤ H(p) . [sent-173, score-0.957]

96 pt,i i=1 Summing up, a way to implement a ROME algorithm with the same mistake bound as the corresponding OME algorithm is to • try defining pt+1 as in (5), and check whether the resulting pt+1 ∈ St , if so use it, and • if not, then define pt+1 as in (4) instead. [sent-174, score-0.277]

97 Acknowledgements We are grateful to Tony Jebara and Tong Zhang for helpful conversations, and an anonymous referee for suggesting a simplification of the proof of Theorem 3. [sent-175, score-0.048]

98 An analog of the minimax theorem for vector payoffs. [sent-191, score-0.073]

99 A new algorithm for minimizing convex functions over convex sets. [sent-324, score-0.103]

100 A sequential approximation bound for some sample-dependent convex optimization problems with applications in learning. [sent-332, score-0.107]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pt', 0.789), ('entropy', 0.227), ('mistake', 0.169), ('xt', 0.164), ('winnow', 0.149), ('yt', 0.114), ('tt', 0.113), ('ln', 0.111), ('ome', 0.102), ('trial', 0.096), ('st', 0.094), ('margin', 0.093), ('ek', 0.09), ('littlestone', 0.088), ('ys', 0.088), ('disjunctions', 0.085), ('pinsker', 0.085), ('rome', 0.085), ('ht', 0.079), ('lemma', 0.074), ('theorem', 0.073), ('ct', 0.07), ('bound', 0.066), ('trials', 0.054), ('subject', 0.054), ('jt', 0.054), ('xs', 0.054), ('halfspace', 0.051), ('omeb', 0.051), ('romma', 0.051), ('tth', 0.05), ('hyperplane', 0.049), ('proof', 0.048), ('membership', 0.047), ('constraints', 0.043), ('bounds', 0.043), ('pi', 0.042), ('convex', 0.041), ('disjunction', 0.041), ('kivinen', 0.041), ('correct', 0.04), ('separating', 0.038), ('member', 0.037), ('maximum', 0.037), ('implies', 0.037), ('qi', 0.036), ('angluin', 0.034), ('della', 0.034), ('jebara', 0.034), ('proofs', 0.031), ('mistakes', 0.031), ('designation', 0.03), ('hypothesis', 0.028), ('relaxed', 0.027), ('suppose', 0.027), ('majority', 0.026), ('colt', 0.026), ('threshold', 0.025), ('meila', 0.025), ('base', 0.024), ('langford', 0.024), ('online', 0.023), ('satisfy', 0.023), ('maximizes', 0.023), ('relative', 0.022), ('satis', 0.022), ('inequality', 0.022), ('blum', 0.022), ('weight', 0.022), ('algorithm', 0.021), ('simplifying', 0.021), ('seeger', 0.021), ('analyze', 0.02), ('analyzed', 0.02), ('gordon', 0.02), ('tangent', 0.02), ('completes', 0.02), ('proved', 0.019), ('weighted', 0.019), ('makes', 0.019), ('predicts', 0.019), ('crammer', 0.019), ('round', 0.019), ('something', 0.019), ('variation', 0.018), ('jaakkola', 0.018), ('members', 0.018), ('maximize', 0.018), ('evaluating', 0.017), ('simpli', 0.017), ('proceeds', 0.016), ('geometry', 0.016), ('minimizes', 0.015), ('prove', 0.015), ('replaced', 0.015), ('weaker', 0.015), ('ensemble', 0.015), ('pages', 0.015), ('probability', 0.015), ('related', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

Author: Philip M. Long, Xinyu Wu

Abstract: We establish a mistake bound for an ensemble method for classification based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more refined bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simplification of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simplified algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1

2 0.31659204 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng

Abstract: The Bradley-Terry model for paired comparison has been popular in many areas. We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. A useful application to multi-class probability estimates using error-correcting codes is demonstrated. 1

3 0.27384126 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

4 0.1999923 138 nips-2004-Online Bounds for Bayesian Algorithms

Author: Sham M. Kakade, Andrew Y. Ng

Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1

5 0.18032306 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1

6 0.097762868 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

7 0.090215124 164 nips-2004-Semi-supervised Learning by Entropy Minimization

8 0.089677162 102 nips-2004-Learning first-order Markov models for control

9 0.077223383 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

10 0.07036601 64 nips-2004-Experts in a Markov Decision Process

11 0.066854164 136 nips-2004-On Semi-Supervised Classification

12 0.066529296 168 nips-2004-Semigroup Kernels on Finite Sets

13 0.064656772 115 nips-2004-Maximum Margin Clustering

14 0.063663185 116 nips-2004-Message Errors in Belief Propagation

15 0.062519461 48 nips-2004-Convergence and No-Regret in Multiagent Learning

16 0.061455429 175 nips-2004-Stable adaptive control with online learning

17 0.060192253 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

18 0.058191627 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

19 0.057139684 82 nips-2004-Incremental Algorithms for Hierarchical Classification

20 0.056896869 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.172), (1, 0.013), (2, 0.248), (3, 0.113), (4, 0.145), (5, -0.205), (6, 0.004), (7, 0.089), (8, 0.083), (9, 0.144), (10, -0.018), (11, 0.065), (12, 0.112), (13, -0.106), (14, -0.011), (15, -0.107), (16, -0.065), (17, -0.219), (18, 0.032), (19, 0.154), (20, 0.208), (21, 0.105), (22, -0.167), (23, -0.139), (24, -0.112), (25, -0.109), (26, -0.091), (27, 0.111), (28, -0.148), (29, -0.025), (30, 0.109), (31, 0.112), (32, -0.098), (33, -0.13), (34, 0.039), (35, 0.017), (36, 0.003), (37, 0.116), (38, 0.009), (39, -0.081), (40, -0.056), (41, -0.071), (42, 0.011), (43, 0.036), (44, -0.101), (45, -0.025), (46, 0.016), (47, 0.072), (48, 0.024), (49, -0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98507792 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

Author: Philip M. Long, Xinyu Wu

Abstract: We establish a mistake bound for an ensemble method for classification based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more refined bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simplification of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simplified algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1

2 0.83490562 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng

Abstract: The Bradley-Terry model for paired comparison has been popular in many areas. We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. A useful application to multi-class probability estimates using error-correcting codes is demonstrated. 1

3 0.6000908 138 nips-2004-Online Bounds for Bayesian Algorithms

Author: Sham M. Kakade, Andrew Y. Ng

Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1

4 0.5818907 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

5 0.38870895 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1

6 0.29294473 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

7 0.27443328 175 nips-2004-Stable adaptive control with online learning

8 0.25844485 102 nips-2004-Learning first-order Markov models for control

9 0.24681969 136 nips-2004-On Semi-Supervised Classification

10 0.24088737 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

11 0.23588076 130 nips-2004-Newscast EM

12 0.23153843 116 nips-2004-Message Errors in Belief Propagation

13 0.22662146 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

14 0.21506402 82 nips-2004-Incremental Algorithms for Hierarchical Classification

15 0.20418297 74 nips-2004-Harmonising Chorales by Probabilistic Inference

16 0.20291771 28 nips-2004-Bayesian inference in spiking neurons

17 0.19771864 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

18 0.19512501 115 nips-2004-Maximum Margin Clustering

19 0.19455503 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification

20 0.19298035 164 nips-2004-Semi-supervised Learning by Entropy Minimization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.122), (15, 0.09), (26, 0.068), (31, 0.047), (32, 0.016), (33, 0.175), (35, 0.015), (39, 0.015), (50, 0.048), (63, 0.298)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82829487 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

Author: N. J. Hill, Thomas N. Lal, Karin Bierig, Niels Birbaumer, Bernhard Schölkopf

Abstract: Motivated by the particular problems involved in communicating with “locked-in” paralysed patients, we aim to develop a braincomputer interface that uses auditory stimuli. We describe a paradigm that allows a user to make a binary decision by focusing attention on one of two concurrent auditory stimulus sequences. Using Support Vector Machine classification and Recursive Channel Elimination on the independent components of averaged eventrelated potentials, we show that an untrained user’s EEG data can be classified with an encouragingly high level of accuracy. This suggests that it is possible for users to modulate EEG signals in a single trial by the conscious direction of attention, well enough to be useful in BCI. 1

same-paper 2 0.80632055 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

Author: Philip M. Long, Xinyu Wu

Abstract: We establish a mistake bound for an ensemble method for classification based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more refined bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simplification of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simplified algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1

3 0.67630309 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the final regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efficient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a finite number of steps. Then we provide confidence-interval-type bounds on the generalization error. 1

4 0.63005763 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

Author: Liam Paninski

Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.

5 0.62825102 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

6 0.62728822 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

7 0.62485415 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

8 0.62468565 102 nips-2004-Learning first-order Markov models for control

9 0.62392813 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

10 0.62333202 131 nips-2004-Non-Local Manifold Tangent Learning

11 0.62072092 116 nips-2004-Message Errors in Belief Propagation

12 0.62034512 163 nips-2004-Semi-parametric Exponential Family PCA

13 0.61922336 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

14 0.61767197 70 nips-2004-Following Curved Regularized Optimization Solution Paths

15 0.61692983 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

16 0.61680162 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

17 0.61668515 64 nips-2004-Experts in a Markov Decision Process

18 0.61651295 23 nips-2004-Analysis of a greedy active learning strategy

19 0.61525756 124 nips-2004-Multiple Alignment of Continuous Time Series

20 0.61488205 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification