nips nips2008 nips2008-15 knowledge-graph by maker-knowledge-mining

15 nips-2008-Adaptive Martingale Boosting

Source: pdf

Author: Phil Long, Rocco Servedio

Abstract: In recent work Long and Servedio [LS05] presented a “martingale boosting” algorithm that works by constructing a branching program over weak classiﬁers and has a simple analysis based on elementary properties of random walks. [LS05] showed that this martingale booster can tolerate random classiﬁcation noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i.e. it cannot effectively take advantage of variation in the quality of the weak classiﬁers it receives. We present an adaptive variant of the martingale boosting algorithm. This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. The new algorithm inherits the desirable properties of the original [LS05] algorithm, such as random classiﬁcation noise tolerance, and has other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with conﬁdencerated weak hypotheses that output real values rather than Boolean predictions.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In recent work Long and Servedio [LS05] presented a “martingale boosting” algorithm that works by constructing a branching program over weak classiﬁers and has a simple analysis based on elementary properties of random walks. [sent-6, score-1.022]

2 [LS05] showed that this martingale booster can tolerate random classiﬁcation noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i. [sent-7, score-0.924]

3 it cannot effectively take advantage of variation in the quality of the weak classiﬁers it receives. [sent-9, score-0.48]

4 We present an adaptive variant of the martingale boosting algorithm. [sent-10, score-0.529]

5 This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. [sent-11, score-0.865]

6 A rich theory of boosting has been developed over the past two decades; see [Sch03, MR03] for some overviews. [sent-14, score-0.29]

7 Two important issues for boosting algorithms which are relevant to the current work are adaptiveness and noise-tolerance; we brieﬂy discuss each of these issues before describing the contributions of this paper. [sent-15, score-0.477]

8 “Adaptiveness” refers to the ability of boosting algorithms to adjust to different accuracy levels in the sequence of weak hypotheses that they are given. [sent-17, score-0.89]

9 The ﬁrst generation of boosting algorithms [Sch90, Fre95] required the user to input an “advantage” parameter γ such that the weak learner was guaranteed to always output a weak hypothesis with accuracy at least 1/2 + γ. [sent-18, score-1.501]

10 Adaptiveness is an important property since it is often the case that the advantage of successive weak classiﬁers grows smaller and smaller as boosting proceeds. [sent-20, score-0.792]

11 AdaBoost does not require a lower bound γ on the minimum advantage, and the error rate of its ﬁnal hypothesis depends favorably on the different advantages of the different weak classiﬁers in the sequence. [sent-22, score-0.619]

12 More 1 precisely, if the accuracy of the t-th weak classiﬁer is 2 + γt , then the AdaBoost ﬁnal hypothesis has error at most T −1 t=0 2 1 − 4γt . [sent-23, score-0.578]

13 One drawback of many standard boosting techniques, including AdaBoost, is that they can perform poorly when run on noisy data [FS96, MO97, Die00, LS08]. [sent-26, score-0.311]

14 In particular, the algorithms of [KS05, LS05] have been shown to boost to optimally high accuracy in the presence of random classiﬁcation noise when run with a random classiﬁcation noise tolerant weak learner. [sent-28, score-0.616]

15 ) While the noise tolerance of the boosters [KS05, LS05] is an attractive feature, a drawback of these algorithms is that they do not enjoy the adaptiveness of algorithms like AdaBoost. [sent-33, score-0.313]

16 The MMM booster of [KS05] is not known to have any adaptiveness at all, and the “martingale boosting” algorithm of [LS05] only has the following limited type of adaptiveness. [sent-34, score-0.359]

17 where in the t-th stage a collection of t + 1 weak hypotheses are obtained; let γt denote the minimum advantage of these t + 1 hypotheses obtained in stage t. [sent-38, score-0.844]

18 [LS05] shows that the ﬁnal hypothesis constructed by martingale boosting has error at most exp − ( T −1 t=0 2T γt )2 . [sent-39, score-0.657]

19 Consider, for example, a sequence of weak classiﬁers in which the advantages decrease as √ γt = 1/ t + 1 (this is in line with the oft-occurring situation, mentioned above, that advantages grow smaller and √ smaller as boosting progresses). [sent-41, score-0.788]

20 We give the ﬁrst boosting algorithm that T −1 2 and is provably tolerant to is both adaptive enough to satisfy a bound of exp −Ω t=0 γt random classiﬁcation noise. [sent-45, score-0.459]

21 We do this by modifying the martingale boosting algorithm of [LS05] to make it adaptive; the modiﬁcation inherits the noise-tolerance of the original [LS05] algorithm. [sent-46, score-0.549]

22 (The quantity tracked during the random walk is the difference between the number of positive predictions and the number of negative predictions made by base classiﬁers encountered in the braching program up to a given point in time. [sent-51, score-0.265]

23 This is a natural way to exploit the fact that examples reaching such a largeadvantage node usually tend to walk in the right direction. [sent-55, score-0.288]

24 The idea extends straightforwardly to let us handle conﬁdence-rated weak hypotheses (see [SS99]) whose predictions are real values in [−1, 1] as opposed to Boolean values from {−1, 1}. [sent-56, score-0.554]

25 This is done simply by scaling the step size for a given example x from a given node according to the numerical value h(x) that the conﬁdence-rated weak hypothesis h at that node assigns to example x. [sent-57, score-0.858]

26 In particular, if a branching program is constructed naively based on this approach, it is possible for the number of nodes to increase exponentially with the depth. [sent-59, score-0.665]

27 To avoid this, we use a randomized rounding scheme together with the variable-step random walk to ensure that the number of nodes in the branching program grows polynomially rather than exponentially in the number of stages in the random walk (i. [sent-60, score-0.944]

28 In fact, we actually improve on the efﬁciency of the original martingale boosting algorithm of [LS05] by a polynomial factor, by truncating “extreme” nodes in the branching program that are “far” from the origin. [sent-63, score-1.153]

29 Our analysis shows that this truncation has only a small effect on the accuracy of the ﬁnal classiﬁer, while giving a signiﬁcant asymptotic savings in the size of the ﬁnal branching program (roughly 1/γ 3 nodes as opposed to the 1/γ 4 nodes of [KS05, LS05]). [sent-64, score-0.769]

30 As usual, our boosting algorithms work by repeatedly passing a distribution D′ derived from D to a weak learner, which outputs a classiﬁer h. [sent-71, score-0.728]

31 We brieﬂy recall some key aspects of the martingale boosting algorithm of [LS05] which are shared by our algorithm (and note some differences). [sent-76, score-0.519]

32 Both boosters work by constructing a leveled branching program. [sent-77, score-0.482]

33 Each node in the branching program has a location; this is a pair (β, t) where β is a real value (a location on the line) and t ≥ 0 is an integer (the level of the node; each level corresponds to a distinct stage of boosting). [sent-78, score-0.977]

34 the booster constructs nodes in the branching program at levels 0, 1, 2, . [sent-83, score-0.877]

35 For a location (β, t) where the branching program has a node, let Dβ,t be the distribution D conditioned on reaching the node at (β, t). [sent-87, score-0.751]

36 We sometimes refer to this distribution Dβ,t as the distribution induced by node (β, t). [sent-88, score-0.143]

37 As boosting proceeds, in stage t, each node (β, t) at level t is assigned a hypothesis which we call hβ,t . [sent-89, score-0.69]

38 Unlike [LS05] we shall allow conﬁdence-rated hypotheses, so each weak hypothesis is a mapping from X to [−1, 1]. [sent-90, score-0.551]

39 Once the hypothesis hβ,t has been obtained, out-edges are constructed from (β, t) to its child nodes at level t + 1. [sent-91, score-0.377]

40 To fully specify our new boosting algorithm we must describe: (1) How the weak learner is run at each node (β, t) to obtain a weak classiﬁer. [sent-94, score-1.524]

41 This is straightforward for the basic case of “two-sided” weak learners that we describe in Section 3 and somewhat less straightforward in the usual (non-two-sided) weak learner setting. [sent-95, score-1.106]

42 1 we describe how to use a standard weak learner, and how to handle noise – both extensions borrow heavily from earlier work [LS05, KS05]. [sent-97, score-0.501]

43 (2) What function is used to label the node (β, t), i. [sent-98, score-0.143]

44 It turns out that this function is a randomized version of the weak classiﬁer mentioned in point (1) above. [sent-101, score-0.46]

45 (3) Where to place the child nodes at level t + 1; this is closely connected with (2) above. [sent-102, score-0.227]

46 As in [LS05], once the branching program has been fully constructed down through some level T the ﬁnal hypothesis it computes is very simple. [sent-103, score-0.742]

47 Given an input example x, the output of the ﬁnal hypothesis on x is sgn(β) where (β, T ) is the location in level T to which x is ultimately routed as it passes through the branching program. [sent-104, score-0.695]

48 3 Boosting a two-sided weak learner In this section we assume that we have a two-sided weak learner. [sent-105, score-1.071]

49 This is an algorithm which, given a distribution D, can always obtain hypotheses that have two-sided advantage as deﬁned below: Deﬁnition 1 A hypothesis h : X → [−1, 1] has two-sided advantage γ with respect to D if it satisﬁes both Ex∈D+ [h(x)] ≥ γ and Ex∈D− [h(x)] ≤ −γ. [sent-106, score-0.333]

50 1 we may apply methods of [LS05] to reduce the typical case, in which we only receive “normal” weak hypotheses rather than two-sided weak hypotheses, to this case. [sent-108, score-0.992]

51 The branching program starts off with a single node at location (0, 0). [sent-109, score-0.724]

52 Assuming the branching program has been constructed up through level t, we now explain how it is extended in the t-th stage up through level t + 1. [sent-110, score-0.773]

53 There are two basic steps in each stage: weak training and branching. [sent-111, score-0.438]

54 Consider a given node at location (β, t) in the branching program. [sent-113, score-0.611]

55 As in [LS05] we construct a weak hypothesis hβ,t simply by running the two-sided weak learner on examples drawn from Dβ,t and letting hβ,t be the hypothesis it generates. [sent-114, score-1.337]

56 Now we deﬁne the advantage at level t to be def γt = min γβ,t . [sent-118, score-0.141]

57 Intuitively, we would like to use γt as a scaling factor for the “step size” of the random walk at level t. [sent-120, score-0.178]

58 Since we are using conﬁdence-rated weak hypotheses, it is also natural to have the step that example x takes at a given node be proportional to the value of the conﬁdence-rated hypothesis at that node on x. [sent-121, score-0.858]

59 The most direct way to do this would be to label the node (β, t) with the weak classiﬁer hβ,t and to route each example x to a node at location (β + γt hβ,t (x), t + 1). [sent-122, score-0.835]

60 However, there are obvious difﬁculties with this approach; for one thing a single node at (β, t) could give rise to arbitrarily many (inﬁnitely many, if |X| = ∞) nodes at level t+1. [sent-123, score-0.335]

61 Even if the hypotheses hβ,t were all guaranteed to {−1, 1}-valued, if we were to construct a branching program in this way then it could be the case that by the T -th stage there are 2T −1 distinct nodes at level T . [sent-124, score-0.888]

62 We get around this problem by creating nodes at level t + 1 only at integer multiples of γt . [sent-125, score-0.25]

63 This keeps us from having too many nodes in the branching program at level t + 1. [sent-127, score-0.706]

64 Of course, we only actually create those nodes in the branching program that have an incoming edge as described below (later we will give an analysis to bound the number of such nodes). [sent-128, score-0.666]

65 To simulate routing an example x to (β + γt hβ,t (x), t + 1), the branching program routes x randomly along one of these two edges so that the expected location at which x ends up is (β + γt hβ,t (x), t + 1). [sent-130, score-0.607]

66 More precisely, if β + γt hβ,t (x) = (i + ρ) · γt /2 where 0 ≤ ρ < 1, then the rule used at node (β, t) to route an example x is “with probability ρ send x to ((i + 1) · γt /2, t + 1) and with probability (1 − ρ) send x to (i · γt /2, t + 1). [sent-131, score-0.233]

67 ” Since |hβ,t (x)| ≤ 1 for all x by assumption, it is easy to see that at most eight outgoing edges are required from each node (β, t). [sent-132, score-0.188]

68 Thus the branching program that the booster constructs uses a randomized variant of each weak hypothesis hβ,t to route examples along one of (at most) eight outgoing edges. [sent-133, score-1.466]

69 4 Proof of correctness for boosting a two-sided weak learner The following theorem shows that the algorithm described above is an effective adaptive booster for two-sided weak learners: Theorem 2 Consider running the above booster for T stages. [sent-134, score-1.803]

70 , γT −1 > 0 be deﬁned as described above, so each invocation of the two-sided weak learner on distribution Dβ,t yields a hypothesis hβ,t that has γβ,t ≥ γt . [sent-141, score-0.777]

71 Then the ﬁnal hypothesis h constructed by the booster satisﬁes Prx∈D [h(x) = c(x)] ≤ exp − T −1 1 The algorithm makes at most M ≤ O(1) · t=0 γt structs a branching program with at most M nodes). [sent-142, score-0.908]

72 , T we deﬁne the random variable At as follows: given a draw of x from D+ (the original distribution D restricted to positive examples), the value of At is γt−1 hβ,t−1 (x), where (β, t − 1) is the location of the node that x reaches at level t of the branching program. [sent-150, score-0.758]

73 Intuitively At captures the direction and size of the move that we would like x to make during the branching step that brings it to level t. [sent-151, score-0.521]

74 We deﬁne Bt to be the random variable that captures the direction and size of the move that x actually makes during the branching step that brings it to level t. [sent-152, score-0.543]

75 Let Xt denote i=1 Bt , so the value of Xt is the actual location on the real line where x ends up at level t. [sent-156, score-0.145]

76 on x reaching any particular location (β, t − 1)), we have that x is distributed according to (Dβ,t−1 )+ , and thus we have 2 E[Xt |Xt−1 ] = Xt−1 + Ex∈(Dβ,t )+ [γt−1 hβ,t−1 (x)] ≥ Xt−1 + γt−1 γβ,t−1 ≥ Xt−1 + γt−1 , (5) where the ﬁrst inequality follows from the two-sided advantage of hβ,t−1 . [sent-160, score-0.136]

77 (6) t=0 So we have established (4); it remains to bound the number of nodes constructed in the branching T −1 program. [sent-185, score-0.59]

78 Let us write Mt to denote the number of nodes at level t, so M = t=0 Mt . [sent-186, score-0.192]

79 The t-th level of boosting can cause the rightmost (leftmost) node to be at most 2γt−1 distance farther away from the origin than the rightmost (leftmost) node at the (t − 1)-st level. [sent-187, score-0.721]

80 This means t−1 that at level t, every node is at a position (β, t) with |β| ≤ 2 j=0 γj . [sent-188, score-0.221]

81 Since nodes are placed at integer multiples of γt /2, we have that M = T −1 t=0 Mt ≤ O(1) · T −1 1 t=0 γt t−1 j=0 γj . [sent-189, score-0.172]

82 Consider the case in which each advantage γt is just γ and we are boosting to accuracy ǫ. [sent-191, score-0.359]

83 1 Improving efﬁciency by freezing extreme nodes Here we describe a variant of the algorithm from the previous section that constructs a branching program with fewer nodes. [sent-196, score-0.771]

84 For t ≥ 1, after the execution of step t − 1 of boosting, when all nodes at level t have been created, each node (α, t) with |α| > 8 t−1 s=0 2 γs 2 ln t + ln 4 is “frozen. [sent-198, score-0.47]

85 ” The ǫ algorithm commits to classifying any test examples routed to any such nodes according to sgn(α), and these nodes are not used to generate weak hypotheses during the next round of training. [sent-199, score-0.91]

86 We have the following theorem about the performance of this algorithm: Theorem 3 Consider running the modiﬁed booster for T stages. [sent-200, score-0.196]

87 , γT > 0 be deﬁned as described above, so each invocation of the weak learner on distribution Dβ,t yields a hypothesis hβ,t that has γβ,t ≥ γt . [sent-207, score-0.777]

88 Then the ﬁnal output hypothesis h of the booster satisﬁes ǫ 1 T −1 2 Prx∈D [h(x) = c(x)] ≤ + exp − γ . [sent-208, score-0.337]

89 (7) 2 8 t=0 t The algorithm makes O T −1 t=0 2 γt ln T + ln 1 · ǫ T −1 1 t=0 γt calls to the weak learner. [sent-209, score-0.636]

90 Let At be the distance from the origin t−1 2 past which examples are frozen in round t; i. [sent-213, score-0.173]

91 Nearly exactly ǫ the same analysis as proves (6) can be used here: for a positive example x to be incorrectly frozen in round t, it must be the case Xt < −At , or equivalently Yt < −At − of At gives us that Prx∈D+ [x incorrectly frozen in round t] is at most t−1 Pr[Yt ≤ −At − i=0 2 γi . [sent-216, score-0.293]

92 The bound on the number of calls to the weak learner follows from the fact that there are O(At /γt ) such calls in each stage of boosting, and the fact that At ≤ (8 T −1 s=0 2 γs )(2 ln T + ln 4 ) for all t. [sent-219, score-0.979]

93 ǫ It is easy to check that if γt = γ for all t, taking T = O(log(1/ǫ)/γ 2) the algorithm in this section will construct an ǫ-accurate hypothesis that is an O(log2 (1/ǫ)/γ 3 )-node branching program. [sent-220, score-0.534]

94 1 Standard weak learners In Sections 3 and 4, we assumed that the boosting algorithm had access to a two-sided weak learner, which is more accurate than random guessing on both the positive and the negative examples separately. [sent-222, score-1.373]

95 To make use of a standard weak learner, which is merely more accurate than random guessing on average, we can borrow ideas from [LS05]. [sent-223, score-0.523]

96 Ex∈D [h(x)c(x)] Ex∈D [h(x)]+1 , so the We will use a standard weak learner to simulate a two-sided weak learner as follows. [sent-236, score-1.292]

97 Given a distribution D, the two-sided weak learner will pass D to the standard weak learner, take its output g, and return h = g . [sent-237, score-1.071]

98 Lemma 7 is easily seen to imply counterparts of Theorems 2 and 3 in which the requirement of a two-sided weak learner is weakened to require only standard weak learning, but each γt is replaced with γt /2. [sent-245, score-1.071]

99 2 Tolerating random classiﬁcation noise As in [LS05], noise tolerance is facilitated by the fact that the path through the network is not affected by altering the label of an example. [sent-247, score-0.133]

100 On the other hand, balancing the distribution before passing it to the weak learner, which was needed to use a standard weak learner, may disturb the independence between the event that an example is noisy, and the random draw of x. [sent-248, score-0.934]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('weak', 0.438), ('branching', 0.401), ('ex', 0.353), ('boosting', 0.29), ('booster', 0.196), ('learner', 0.195), ('yt', 0.193), ('martingale', 0.189), ('prx', 0.171), ('node', 0.143), ('adaptiveness', 0.143), ('hypotheses', 0.116), ('nodes', 0.114), ('program', 0.113), ('hypothesis', 0.113), ('xt', 0.107), ('walk', 0.078), ('level', 0.078), ('bt', 0.076), ('frozen', 0.076), ('classi', 0.075), ('location', 0.067), ('stage', 0.066), ('calls', 0.064), ('ln', 0.057), ('adaboost', 0.056), ('boosters', 0.053), ('er', 0.05), ('ers', 0.047), ('freezing', 0.047), ('route', 0.044), ('servedio', 0.043), ('advantage', 0.042), ('examples', 0.04), ('bound', 0.038), ('noise', 0.038), ('guessing', 0.038), ('nal', 0.037), ('constructed', 0.037), ('balancing', 0.036), ('polynomially', 0.036), ('jcss', 0.036), ('rocco', 0.036), ('routed', 0.036), ('learners', 0.035), ('child', 0.035), ('tolerance', 0.035), ('balanced', 0.034), ('constructs', 0.034), ('pr', 0.033), ('stages', 0.033), ('round', 0.032), ('integer', 0.031), ('bagging', 0.031), ('invocation', 0.031), ('tolerant', 0.031), ('negative', 0.031), ('advantages', 0.03), ('mt', 0.03), ('adaptive', 0.03), ('prd', 0.029), ('constructing', 0.028), ('exp', 0.028), ('incorrectly', 0.028), ('lemma', 0.028), ('con', 0.027), ('multiples', 0.027), ('philip', 0.027), ('progresses', 0.027), ('reaching', 0.027), ('accuracy', 0.027), ('original', 0.026), ('simulate', 0.026), ('origin', 0.025), ('borrow', 0.025), ('kalai', 0.025), ('rounding', 0.025), ('eight', 0.024), ('inherits', 0.024), ('enjoy', 0.023), ('send', 0.023), ('successive', 0.022), ('random', 0.022), ('randomized', 0.022), ('fewer', 0.022), ('issues', 0.022), ('outgoing', 0.021), ('positive', 0.021), ('drawback', 0.021), ('brings', 0.021), ('def', 0.021), ('rightmost', 0.021), ('step', 0.021), ('variant', 0.02), ('algorithm', 0.02), ('leftmost', 0.02), ('boolean', 0.02), ('levels', 0.019), ('cation', 0.019), ('sgn', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 15 nips-2008-Adaptive Martingale Boosting

Author: Phil Long, Rocco Servedio

2 0.13301592 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

Author: Tae-kyun Kim, Roberto Cipolla

Abstract: We present a new co-clustering problem of images and visual features. The problem involves a set of non-object images in addition to a set of object images and features to be co-clustered. Co-clustering is performed in a way that maximises discrimination of object images from non-object images, thus emphasizing discriminative features. This provides a way of obtaining perceptual joint-clusters of object images and features. We tackle the problem by simultaneously boosting multiple strong classiﬁers which compete for images by their expertise. Each boosting classiﬁer is an aggregation of weak-learners, i.e. simple visual features. The obtained classiﬁers are useful for object detection tasks which exhibit multimodalities, e.g. multi-category and multi-view object detection tasks. Experiments on a set of pedestrian images and a face data set demonstrate that the method yields intuitive image clusters with associated features and is much superior to conventional boosting classiﬁers in object detection tasks. 1

3 0.13263108 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces

Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu

Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classiﬁcation of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difﬁcult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classiﬁcation and cross-language classiﬁcation tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1

4 0.13174795 112 nips-2008-Kernel Measures of Independence for non-iid Data

Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola

Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1

5 0.1266965 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile

Abstract: We provide a new analysis of an efﬁcient margin-based algorithm for selective sampling in classiﬁcation problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classiﬁer, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efﬁcient competitors. 1

6 0.1165188 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

7 0.10641775 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

8 0.099577151 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

9 0.099465646 170 nips-2008-Online Optimization in X-Armed Bandits

10 0.092761181 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

11 0.092758328 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

12 0.089312255 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

13 0.076964416 65 nips-2008-Domain Adaptation with Multiple Sources

14 0.07050965 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing

15 0.066764437 168 nips-2008-Online Metric Learning and Fast Similarity Search

16 0.065204903 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

17 0.064034812 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

18 0.06157263 219 nips-2008-Spectral Hashing

19 0.06089396 206 nips-2008-Sequential effects: Superstition or rational behavior?

20 0.055989619 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.17), (1, 0.014), (2, -0.097), (3, -0.044), (4, -0.145), (5, 0.13), (6, -0.0), (7, 0.034), (8, 0.162), (9, -0.104), (10, -0.035), (11, 0.052), (12, 0.039), (13, -0.064), (14, -0.003), (15, 0.042), (16, -0.005), (17, -0.103), (18, 0.05), (19, -0.043), (20, 0.058), (21, -0.01), (22, 0.044), (23, 0.054), (24, 0.001), (25, 0.055), (26, -0.034), (27, -0.105), (28, -0.004), (29, -0.043), (30, 0.164), (31, 0.02), (32, -0.006), (33, -0.053), (34, 0.098), (35, -0.087), (36, -0.007), (37, 0.059), (38, 0.079), (39, -0.016), (40, 0.047), (41, -0.126), (42, -0.123), (43, -0.141), (44, 0.016), (45, -0.013), (46, 0.014), (47, -0.111), (48, 0.013), (49, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96675241 15 nips-2008-Adaptive Martingale Boosting

Author: Phil Long, Rocco Servedio

2 0.54102319 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile

3 0.52895194 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

Author: Richard Nock, Frank Nielsen

Abstract: Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classiﬁcation calibration, ties up the minimization of the surrogates and classiﬁcation risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classiﬁcation calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 ﬂavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates. 1

4 0.52454734 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

Author: Tae-kyun Kim, Roberto Cipolla

5 0.52441925 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree

Author: Daphna Weinshall, Hynek Hermansky, Alon Zweig, Jie Luo, Holly Jimison, Frank Ohl, Misha Pavel

Abstract: Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on ’incongruent events’ when ’general level’ and ’speciﬁc level’ classiﬁers give conﬂicting predictions. We deﬁne a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more speciﬁc level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conﬂicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two speciﬁc problems: Out Of Vocabulary words in speech recognition, and the identiﬁcation of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.

6 0.51710719 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces

7 0.50701666 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

8 0.45686492 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

9 0.4477717 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

10 0.4345392 112 nips-2008-Kernel Measures of Independence for non-iid Data

11 0.42777362 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

12 0.39612707 168 nips-2008-Online Metric Learning and Fast Similarity Search

13 0.37080181 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

14 0.36808065 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition

15 0.36365071 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

16 0.3530125 170 nips-2008-Online Optimization in X-Armed Bandits

17 0.35040447 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime

18 0.34808022 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss

19 0.34741843 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

20 0.34437245 206 nips-2008-Sequential effects: Superstition or rational behavior?

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.122), (7, 0.047), (12, 0.021), (15, 0.014), (28, 0.19), (57, 0.069), (59, 0.014), (63, 0.028), (71, 0.028), (77, 0.039), (78, 0.028), (83, 0.047), (96, 0.256)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83230078 15 nips-2008-Adaptive Martingale Boosting

Author: Phil Long, Rocco Servedio

2 0.68475592 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The machine learning problem of classiﬁer design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the speciﬁcation of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the speciﬁcation of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classiﬁcation problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classiﬁer design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations. 1

3 0.68295586 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the beneﬁts of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efﬁciently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and ﬁnd that online optimization of the parameters of the KL-regularized model can signiﬁcantly improve prediction performance. 1

4 0.68271941 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a uniﬁed robustness perspective. 1

5 0.68029189 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Author: Shai Shalev-shwartz, Sham M. Kakade

Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1

6 0.68023056 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

7 0.6763311 85 nips-2008-Fast Rates for Regularized Objectives

8 0.67569429 24 nips-2008-An improved estimator of Variance Explained in the presence of noise

9 0.67492843 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

10 0.67470407 75 nips-2008-Estimating vector fields using sparse basis field expansions

11 0.67441922 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

12 0.67367357 226 nips-2008-Supervised Dictionary Learning

13 0.67316723 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

14 0.67315942 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

15 0.67276782 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes

16 0.67134243 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

17 0.67132962 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

18 0.67036539 239 nips-2008-Tighter Bounds for Structured Estimation

19 0.66962296 196 nips-2008-Relative Margin Machines

20 0.66951972 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms