nips nips2009 nips2009-193 knowledge-graph by maker-knowledge-mining

193 nips-2009-Potential-Based Agnostic Boosting

Source: pdf

Author: Varun Kanade, Adam Kalai

Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). [sent-4, score-0.335]

2 Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. [sent-5, score-0.475]

3 A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. [sent-6, score-0.215]

4 Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). [sent-7, score-0.977]

5 Aggressive reweighting of data may lead to poor performance in the presence of certain types of noise [1]. [sent-10, score-0.102]

6 This has been addressed by a number of “robust” boosting algorithms, such as SmoothBoost [2, 3] and MadaBoost [4] as well as boosting by branching programs [5, 6]. [sent-11, score-0.705]

7 The present work gives a simple potential-based boosting algorithm with guarantees in the (arbitrary noise) agnostic learning setting [8, 9]. [sent-15, score-0.786]

8 A unique feature of our algorithm, illustrated in Figure 1, is that it does not alter the distribution on unlabeled examples but rather it alters the labels. [sent-16, score-0.09]

9 This enables us to prove a strong boosting theorem in which the weak learner need only succeed for one distribution on unlabeled examples. [sent-17, score-0.83]

10 To the best of our knowledge, earlier weak-to-strong boosting theorems have always relied on the ability of the weak learner to succeed under arbitrary distributions. [sent-18, score-0.762]

11 The utility of our boosting theorem is demonstrated by re-deriving two non-trivial results in computational learning theory, namely agnostically learning decision trees [10] and agnostically learning halfspaces [11], which were previously solved using very different techniques. [sent-19, score-0.977]

12 The main contributions of this paper are, ﬁrst, giving the ﬁrst provably noise-tolerant analysis of a potential-based boosting algorithm, and, second, giving a distribution-speciﬁc boosting theorem that does not require the weak learner to learn over all distributions on x ∈ X. [sent-20, score-1.095]

13 This is in contrast to recent work by Long and Servedio, showing that convex potential boosters cannot work in the presence of random classiﬁcation noise [12]. [sent-21, score-0.143]

14 First, the algorithm has the possibility of negating the current hypothesis and hence is not technically a standard potential-based boosting algorithm. [sent-23, score-0.463]

15 Second, weak agnostic learning is more challenging than weak learning with random classiﬁcation noise, in the sense that an algorithm which is a weak-learner in the random classiﬁcation noise setting need not be a weak-learner in the agnostic setting. [sent-24, score-1.292]

16 There is a substantial literature on robust boosting algorithms, including algorithms already mentioned, MadaBoost, SmoothBoost, as well as LogitBoost [13], BrownBoost [14], Nad1 Simpliﬁed Boosting by Relabeling Procedure Inputs: (x1 , y1 ), . [sent-26, score-0.335]

17 , (xm , ym ) ∈ X × {−1, 1}, T ≥ 1, and weak learner W . [sent-29, score-0.392]

18 , m: t • wi = min{1, exp(−H t−1 (xi )yi )} t • With probability wi , set yi = yi , otherwise pick yi ∈ {−1, 1} randomly ˜t ˜t t t t (b) g = W (x1 , y1 ), . [sent-39, score-0.215]

19 /* possibly take negated hypothesis */ argmax g∈{g t ,− sign(H t−1 )} i m 1 t (d) γ t = m i=1 wi yi ht (xi ) t t−1 (e) H (x) = H (x) + γ t ht (x) 3. [sent-44, score-0.473]

20 Each epoch, the algorithm runs the weak learner on relabeled data (xi , yi ) m . [sent-47, score-0.534]

21 In traditional boosting, on each epoch, H t is a linear com˜t i=1 bination of weak hypotheses. [sent-48, score-0.209]

22 For our agnostic analysis, we also need to include the negated current hypothesis, − sign(H t−1 ) : X → {−1, 1}, as a possible weak classiﬁer. [sent-49, score-0.646]

23 ∗ In practice, to avoid adding noise, each example would be replaced with three weighted examples: (xi , yi ) with weight t t wi , and (xi , ±1) each with weight (1 − wi )/2. [sent-50, score-0.117]

24 These are all simple boosting algorithms whose output is a weighted majority of classiﬁers. [sent-52, score-0.335]

25 Many have been shown to have formal boosting properties (weak to strong PAC-learning) in a noiseless setting, or partial boosting properties in noisy settings. [sent-53, score-0.729]

26 There has also been a line of work on boosting algorithms that provably boost from weak to strong learners either under agnostic or random classiﬁcation noise, using branching programs [17, 20, 5, 21, 6]. [sent-54, score-1.066]

27 Second, since we don’t change the distribution over unlabeled examples, we can boost distribution-speciﬁc weak learners. [sent-57, score-0.306]

28 In recent work, using a similar idea of relabeling, Kalai, Kanade and Mansour[22] proved that the class of DNFs is learnable in a one-sided error agnostic learning model. [sent-58, score-0.403]

29 The main differences are: (1) there is the possibility of using the negation of the current hypothesis at each step, (2) examples are relabeled rather than reweighted, and (3) the step size is slightly different. [sent-62, score-0.263]

30 1 Preliminaries In the agnostic setting, we consider learning with respect to a distribution over X ×Y . [sent-67, score-0.403]

31 The error and correlation1 of a classiﬁer h : X → {−1, 1} with respect to D, are 1 This quantity is typically referred to as edge in the boosting literature. [sent-74, score-0.335]

32 A different deﬁnition of weakly accurate classiﬁer is appropriate in the agnostic setting. [sent-78, score-0.403]

33 Namely, for some γ ∈ (0, 1), h : X → {−1, 1} is said to be γ-optimal for C (and D) if, cor(h, D) ≥ γ cor(C, D) Hence, if the labels are totally random then a weak hypothesis need not have any correlation over random guessing. [sent-79, score-0.371]

34 The goal is to boost from an algorithm capable of outputting γ-optimal hypotheses to one which outputs a nearly 1-optimal hypothesis, even for small γ. [sent-81, score-0.156]

35 We now deﬁne the distribution D relabeled by w, RD,w . [sent-84, score-0.096]

36 Procedurally, one can think of generating a sample from RD,w by drawing an example (x, y) from D, then with probability w(x, y), outputting (x, y) directly, and with probability 1 − w(x, y), outputting (x, y ) where y is uniformly random in {−1, 1}. [sent-85, score-0.12]

37 Formally, 1 − w(x, y) 1 − w(x, −y) + D(x, −y) 2 2 Note that D and RD,w have the same marginal distributions over unlabeled examples x ∈ X. [sent-86, score-0.09]

38 A general (m, q)-learning algorithm is given m unlabeled examples x1 , . [sent-90, score-0.116]

39 , xm , and may make q label queries to a query oracle L : X → {−1, 1}, and it outputs a classiﬁer h : X → {−1, 1}. [sent-93, score-0.193]

40 The queries may be active, meaning that queries may only be made to training examples xi , or membership queries meaning that arbitrary examples x ∈ X may be queried. [sent-94, score-0.354]

41 Since our boosting procedure does not change the distribution over unlabeled examples, it offers two advantages: (1) Agnostic weak learning may be deﬁned with respect to a single distribution µ over unlabeled examples, and (2) The weak learning algorithms may be active (or use membership queries). [sent-97, score-0.944]

42 In particular, the agnostic weak learning hypothesis for C and µ is that for any f : X → [−1, 1], given examples from D = µ, f , the learner will output a γ-optimal classiﬁer for C. [sent-98, score-0.899]

43 The advantages of this new deﬁnition are: (a) it is not with respect to every distribution on unlabeled examples (the algorithm may only have guarantees for certain distributions), and (b) it is more realistic as it does not assume noiseless data. [sent-99, score-0.174]

44 Finding such a weak learner may be quite challenging since it has to succeed in the agnostic model (where no assumption is made on f ), however it may be a bit easier in the sense that the learning algorithm need only handle one particular µ. [sent-100, score-0.829]

45 2 Formal boosting procedure and main results The formal boosting procedure we analyze is given in Figure 2. [sent-109, score-0.693]

46 , xs+tm , Lt ) c) Let s 1 t i) α = g t (xi )wt (xi , yi ) s i=1 s 1 − sign(H t−1 (xi ))wt (xi , yi ) s i=1 d) If αt ≥ β t , ht = g t ; γ t = αt ; Else, ht = − sign(H t−1 ); γ t = β t ; t e) H (x) = H t−1 (x) + γ t ht (x) 4. [sent-131, score-0.479]

47 If W is a (γ, 0 , δ) weak learner with respect to C and µ, s = γ 2 2 log 1 , T = γ292 , 2 δ Algorithm AGNOSTIC B OOSTER (Figure 2) with probability at least 1 − 4δT outputs a hypothesis h satisfying: 0 cor(h, D) ≥ cor(C, D) − − γ Recall that 0 is intended to be very small, e. [sent-137, score-0.497]

48 Also note that the number of calls to the query oracle L is s plus T times the number of calls made by the weak learner (if the weak learner is active, then so is the boosting algorithm). [sent-140, score-1.173]

49 agnostically learning decision trees and agnostically learning halfspaces follow as corollaries to Theorem 1. [sent-142, score-0.636]

50 Let C be the class of binary decision trees on {−1, 1}n with at most t leaves, and let U be the uniform distribution on {−1, 1}n . [sent-144, score-0.112]

51 There exists an algorithm that when given t, n, , δ > 0, and a label oracle for an arbitrary f : {−1, 1}n → [−1, 1], makes q = poly(nt/( δ)) membership queries and, with probability ≥ 1 − δ, outputs h : {−1, 1}n → {−1, 1} such that for Uf = U, f , err(h, Uf ) ≤ err(C, Uf ) + . [sent-145, score-0.224]

52 For any ﬁxed > 0, there exists a univariate polynomial p such that the following holds: Let n ≥ 1, C be the class of halfspaces in n dimensions, let U be the uniform distribution on {−1, 1}n , and f : {−1, 1}n → [−1, 1] be an arbitrary function. [sent-147, score-0.143]

53 ) Note that a related theorem was shown for halfspaces over log-concave distributions over X = Rn . [sent-150, score-0.152]

54 The boosting approach here similarly generalizes to that case in a straightforward manner. [sent-151, score-0.335]

55 This illustrates how, from the point of view of designing provably efﬁcient agnostic learning algorithms, the current boosting procedure may be useful. [sent-152, score-0.764]

56 As is standard, the boosting algorithm can be viewed as minimizing a convex potential function. [sent-154, score-0.405]

57 We show that for a conservative reweighting, either the weak learner will make progress, returning a hypothesis correlated with the relabeled distribution or − sign(H t−1 ) will be correlated with the relabeled distribution. [sent-158, score-0.709]

58 Second, if we ﬁnd a hypothesis correlated with the relabeled distribution, then the potential on round t will be noticeably lower than that of round t − 1. [sent-159, score-0.4]

59 The potential function we consider, as in MadaBoost, is deﬁned by φ : R → R, φ(z) = 1−z e−z if z ≤ 0 if z > 0 Deﬁne the potential of a (real-valued) hypothesis H with respect to a distribution D over X×{−1, 1} as: Φ(H, D) = E [φ(yH(x))] (2) (x,y)∼D 0 Note that Φ(H , D) = Φ(0, D) = 1. [sent-163, score-0.19]

60 We will show that the potential decreases every round of the algorithm. [sent-164, score-0.146]

61 Notice that the weights in the boosting algorithm correspond to the derivative of the potential, because −φ (z) = min{1, exp(−z)} ∈ [0, 1]. [sent-165, score-0.361]

62 In other words, the weak learning step is essentially a gradient descent step. [sent-166, score-0.209]

63 We next state a key fact about agnostic learning in Lemma 1. [sent-167, score-0.403]

64 Then weighting function w : X × {−1, 1} → [0, 1] is called conservative for h if w(x, −h(x)) = 1 for all x ∈ X. [sent-170, score-0.09]

65 Note that, if the hypothesis is sign(H t (x)), then a weighting function deﬁned by −φ (H t (x)y) is conservative if and only if φ (z) = −1 for all z < 0. [sent-171, score-0.192]

66 We ﬁrst show that relabeling according to a conservative weighting function is good in the sense that, if h is far from optimal according to the original distribution, then after relabeling by w it is even further from optimal. [sent-172, score-0.394]

67 We will use Lemma 1 to show that the weak learner will return a useful hypothesis. [sent-182, score-0.383]

68 The case in which the weak learner may not return a useful hypothesis is when cor(C, RD,w ) = 0, when the optimal classiﬁer on the reweighted distribution has no correlation. [sent-183, score-0.527]

69 This can happen, but in this case it means that either our current hypothesis is close to optimal, or h = sign(H t−1 ) is even worse than random guessing, and hence we can use its negation as a weak agnostic learner. [sent-184, score-0.748]

70 Let ht : X → {−1, 1} be the weak hypothesis that the algorithm ﬁnds on round t. [sent-191, score-0.543]

71 This may either be the hypothesis returned by the weak learner W or − sign(H t−1 ). [sent-192, score-0.485]

72 The following lemma lower bounds the decrease in potential caused by adding γ t ht to H t−1 . [sent-193, score-0.235]

73 We will apply the following Lemma on each round of the algorithm to show that the potential decreases on each round, as long as the weak hypothesis ht has non-negligible correlation and γ t is suitably chosen. [sent-194, score-0.688]

74 Consider any function H : X → R, hypothesis h : X → [−1, 1], γ ∈ R, and distribution D over X × {−1, 1}. [sent-196, score-0.102]

75 Let D = RD,w be the distribution D relabeled by w(x, y) = −φ (yH(x)). [sent-197, score-0.096]

76 Using all the above lemmas, we will show that the algorithm AGNOSTIC B OOSTER returns a hypothesis with correlation (or error) close to that of the best classiﬁer from C. [sent-203, score-0.188]

77 Let g t be the hypothesis returned by the weak learner W . [sent-207, score-0.485]

78 0 Consider two cases: First that cor(c, RD,wt ) ≥ γ + 2 , in this case by the weak learning assumption, γ cor(g t , RD,wt ) ≥ 2 . [sent-211, score-0.209]

79 This guarantees that when the algorithm is run for T rounds, on 2 2 0 some round t the hypothesis sign(H t ) will have correlation at least sup cor(c, D) − − . [sent-217, score-0.315]

80 For γ 3 c∈C 200 s = γ 2 2 log 1 the empirical estimate of the correlation of the constructed hypothesis on each δ round is within an additive 6 of its true correlation, allowing a further failure probability of δ each round. [sent-218, score-0.26]

81 Thus the ﬁnal hypothesis H τ which has the highest empirical correlation satisﬁes, cor(H τ , D) ≥ sup cor(c, D) − 0 − γ c∈C Since there is a failure probability of at most 4δ on each round, the algorithm succeeds with probability at least 1 − 4T δ. [sent-219, score-0.214]

82 4 Applications We show that recent agnostic learning analyses can be dramatically simpliﬁed using our boosting algorithm. [sent-220, score-0.738]

83 Both of the agnostic algorithms are distribution-speciﬁc, meaning that they only work on one (or a family) of distributions µ over unlabeled examples. [sent-221, score-0.462]

84 1 Agnostically Learning Decision Trees Recent work has shown how to agnostically learn polynomial-sized decision trees using membership queries, by an L1 gradient-projection algorithm [10]. [sent-223, score-0.375]

85 Here, we show that learning decision trees is quite simple using our distribution-speciﬁc boosting theorem and the Kushilevitz-Mansour membership query parity learning algorithm as a weak learner [24]. [sent-224, score-0.998]

86 Running the KM algorithm, using q = poly(n, t, 1/ 0 ) queries, and outputting the parity with largest magnitude of estimated Fourier coefﬁcient, is a (γ = 1/t, 0 ) agnostic weak learner for size-t decision trees over the uniform distribution. [sent-226, score-0.979]

87 2 Agnostically Learning Halfspaces In the case of learning halfspaces, the weak learner simply ﬁnds the degree-d term, χS (x) with m 1 |S| ≤ d, with greatest empirical correlation m i=1 χS (xi )yi on a data set (x1 , y1 ), . [sent-230, score-0.423]

88 Let n ≥ 1, C be the class of halfspaces in n dimensions, let U be the uniform distribution on {−1, 1}n , and f : {−1, 1}n → [−1, 1] be an arbitrary function. [sent-237, score-0.143]

89 5 Experiments We performed preliminary experiments with the new boosting algorithm presented here on 8 datasets from UCI repository [26]. [sent-240, score-0.405]

90 We converted multi-class problems into binary classiﬁcation problems by arbitrarily grouping classes, and ran Adaboost, Madaboost and Agnostic Boost on these datasets, using stumps as weak learners. [sent-241, score-0.239]

91 Since stumps can accept weighted examples, we passed the exact weighted distribution to the weak learner. [sent-242, score-0.239]

92 Rather than keeping the label with probability wt (x, y) and making it completely random with the remaining probability, we added both (x, y) and (x, −y) with weights (1 + wt (x, y))/2 and (1 − wt (x, y))/2 respectively. [sent-244, score-0.213]

93 Experiments with random relabeling showed that random relabeling performs much worse than fractional relabeling. [sent-245, score-0.326]

94 We also added random classiﬁcation noise of 5%, 10% and 20% to the datasets and ran the boosting algorithms on the modiﬁed dataset. [sent-250, score-0.4]

95 6 Conclusion We show that potential-based agnostic boosting is possible in theory, and also that this may be done without changing the distribution over unlabeled examples. [sent-347, score-0.797]

96 We show that non-trivial agnostic learning results, for learning decision trees and halfspaces, can be viewed as simple applications of our boosting theorem combined with well-known weak learners. [sent-348, score-1.095]

97 Preliminary experiments show that the performance of our boosting algorithm is comparable to that of Madaboost and Adaboost. [sent-350, score-0.361]

98 A more thorough empirical evaluation of our boosting procedure using different weak learners is part of future research. [sent-351, score-0.564]

99 Improvement of boosting algorithm by modifying the weighting rule. [sent-442, score-0.399]

100 An empirical comparison of three boosting algorithms on real data sets with artiﬁcial class noise. [sent-467, score-0.335]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cor', 0.551), ('agnostic', 0.403), ('boosting', 0.335), ('weak', 0.209), ('agnostically', 0.189), ('madaboost', 0.172), ('learner', 0.154), ('relabeling', 0.152), ('ht', 0.127), ('uf', 0.12), ('err', 0.116), ('halfspaces', 0.116), ('hypothesis', 0.102), ('sign', 0.101), ('kalai', 0.101), ('relabeled', 0.096), ('round', 0.079), ('wt', 0.071), ('ada', 0.069), ('agn', 0.069), ('mada', 0.069), ('ooster', 0.069), ('lemma', 0.064), ('trees', 0.062), ('mansour', 0.061), ('outputting', 0.06), ('queries', 0.06), ('correlation', 0.06), ('unlabeled', 0.059), ('er', 0.054), ('conservative', 0.052), ('smoothboost', 0.052), ('classi', 0.05), ('decision', 0.05), ('yi', 0.049), ('membership', 0.048), ('kanade', 0.045), ('potential', 0.044), ('noise', 0.042), ('reweighted', 0.042), ('parity', 0.041), ('adaboost', 0.04), ('annual', 0.038), ('weighting', 0.038), ('boost', 0.038), ('query', 0.037), ('xs', 0.037), ('xi', 0.037), ('reweighting', 0.037), ('succeed', 0.037), ('theorem', 0.036), ('noiseless', 0.036), ('branching', 0.035), ('boosters', 0.034), ('domingo', 0.034), ('gopalan', 0.034), ('negated', 0.034), ('negation', 0.034), ('pendigits', 0.034), ('wi', 0.034), ('xm', 0.033), ('outputs', 0.032), ('examples', 0.031), ('oracle', 0.031), ('stumps', 0.03), ('klivans', 0.03), ('yw', 0.03), ('corollaries', 0.03), ('servedio', 0.03), ('kearns', 0.029), ('ym', 0.029), ('yh', 0.028), ('arbitrary', 0.027), ('algorithm', 0.026), ('sup', 0.026), ('provably', 0.026), ('pima', 0.026), ('ys', 0.026), ('active', 0.025), ('epoch', 0.024), ('presence', 0.023), ('datasets', 0.023), ('decreases', 0.023), ('formal', 0.023), ('pac', 0.022), ('guarantees', 0.022), ('poly', 0.022), ('fractional', 0.022), ('calls', 0.022), ('preliminary', 0.021), ('lt', 0.021), ('returned', 0.02), ('german', 0.02), ('return', 0.02), ('learners', 0.02), ('additive', 0.019), ('everywhere', 0.019), ('nds', 0.018), ('long', 0.018), ('theory', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 193 nips-2009-Potential-Based Agnostic Boosting

Author: Varun Kanade, Adam Kalai

2 0.13703313 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufﬁcient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classiﬁcation boundary and the underlying distribution are smooth to a ﬁnite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are inﬁnitely smooth, the improvement is exponential.

3 0.13445398 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efﬁcient boosting algorithm that uses additional weights/priors on the base classiﬁers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classiﬁers. The algorithm’s primary advantage is in applications where the trained classiﬁer is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

4 0.079084896 166 nips-2009-Noisy Generalized Binary Search

Author: Robert Nowak

Abstract: This paper addresses the problem of noisy Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy. Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of query complexity. This paper presents an optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions. 1

5 0.067848466 129 nips-2009-Learning a Small Mixture of Trees

Author: M. P. Kumar, Daphne Koller

Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classiﬁcation. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efﬁcient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a signiﬁcantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.

6 0.063556761 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

7 0.060631584 76 nips-2009-Efficient Learning using Forward-Backward Splitting

8 0.054342169 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

9 0.050745614 14 nips-2009-A Parameter-free Hedging Algorithm

10 0.047843844 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

11 0.047722273 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

12 0.046477981 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

13 0.043152187 24 nips-2009-Adapting to the Shifting Intent of Search Queries

14 0.042750634 72 nips-2009-Distribution Matching for Transduction

15 0.041138548 71 nips-2009-Distribution-Calibrated Hierarchical Classification

16 0.04066683 136 nips-2009-Learning to Rank by Optimizing NDCG Measure

17 0.039752305 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

18 0.038470801 157 nips-2009-Multi-Label Prediction via Compressed Sensing

19 0.038318411 260 nips-2009-Zero-shot Learning with Semantic Output Codes

20 0.037909485 190 nips-2009-Polynomial Semantic Indexing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.129), (1, 0.077), (2, -0.003), (3, 0.028), (4, -0.011), (5, 0.017), (6, -0.058), (7, -0.051), (8, -0.036), (9, 0.031), (10, -0.06), (11, -0.013), (12, -0.032), (13, -0.047), (14, 0.043), (15, 0.014), (16, 0.054), (17, 0.014), (18, 0.062), (19, -0.039), (20, 0.102), (21, -0.074), (22, -0.084), (23, 0.005), (24, 0.026), (25, -0.088), (26, -0.039), (27, 0.004), (28, -0.011), (29, -0.06), (30, -0.138), (31, -0.158), (32, 0.101), (33, 0.048), (34, -0.024), (35, -0.028), (36, -0.052), (37, -0.146), (38, -0.024), (39, -0.041), (40, -0.027), (41, 0.102), (42, -0.09), (43, -0.15), (44, 0.045), (45, -0.17), (46, -0.005), (47, -0.004), (48, 0.035), (49, -0.118)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94096118 193 nips-2009-Potential-Based Agnostic Boosting

Author: Varun Kanade, Adam Kalai

2 0.74678373 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Author: Liwei Wang

3 0.60336149 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

4 0.51053101 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

Abstract: We address the problem of learning classiﬁers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classiﬁers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classiﬁers learned on examples with multiple artiﬁcially created views. Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. An extension of this framework is a natural way to leverage unlabeled multi-view data in semi-supervised learning. Experimental results on a subset of the Reuters RCV1/RCV2 collections support our ﬁndings by showing that additional views obtained from MT may signiﬁcantly improve the classiﬁcation performance in the cases identiﬁed by our trade-off. 1

5 0.48820519 24 nips-2009-Adapting to the Shifting Intent of Search Queries

Author: Umar Syed, Aleksandrs Slivkins, Nina Mishra

Abstract: Search engines today present results that are often oblivious to recent shifts in intent. For example, the meaning of the query ‘independence day’ shifts in early July to a US holiday and to a movie around the time of the box ofﬁce release. While no studies exactly quantify the magnitude of intent-shifting trafﬁc, studies suggest that news events, seasonal topics, pop culture, etc account for 1/2 the search queries. This paper shows that the signals a search engine receives can be used to both determine that a shift in intent happened, as well as ﬁnd a result that is now more relevant. We present a meta-algorithm that marries a classiﬁer with a bandit algorithm to achieve regret that depends logarithmically on the number of query impressions, under certain assumptions. We provide strong evidence that this regret is close to the best achievable. Finally, via a series of experiments, we demonstrate that our algorithm outperforms prior approaches, particularly as the amount of intent-shifting trafﬁc increases. 1

6 0.45176768 71 nips-2009-Distribution-Calibrated Hierarchical Classification

7 0.44340706 166 nips-2009-Noisy Generalized Binary Search

8 0.43224639 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

9 0.41919342 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

10 0.37894863 72 nips-2009-Distribution Matching for Transduction

11 0.37406424 14 nips-2009-A Parameter-free Hedging Algorithm

12 0.36982179 129 nips-2009-Learning a Small Mixture of Trees

13 0.3614983 234 nips-2009-Streaming k-means approximation

14 0.33553791 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

15 0.33473849 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

16 0.32546866 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

17 0.32495326 54 nips-2009-Compositionality of optimal control laws

18 0.30709797 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

19 0.30590788 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

20 0.29806679 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.011), (15, 0.296), (24, 0.094), (25, 0.054), (35, 0.032), (36, 0.145), (39, 0.037), (58, 0.073), (71, 0.046), (81, 0.024), (86, 0.061), (91, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77916485 193 nips-2009-Potential-Based Agnostic Boosting

Author: Varun Kanade, Adam Kalai

2 0.76475847 135 nips-2009-Learning to Hash with Binary Reconstructive Embeddings

Author: Brian Kulis, Trevor Darrell

Abstract: Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efﬁciently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques. 1

3 0.57190186 129 nips-2009-Learning a Small Mixture of Trees

Author: M. P. Kumar, Daphne Koller

4 0.57156628 239 nips-2009-Submodularity Cuts and Applications

Author: Yoshinobu Kawahara, Kiyohito Nagano, Koji Tsuda, Jeff A. Bilmes

Abstract: Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint — the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to ﬁnd the exact solution ﬁnitely many iterations, and it converges fast in practice due to the efﬁciency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance. 1

5 0.56826866 72 nips-2009-Distribution Matching for Transduction

Author: Novi Quadrianto, James Petterson, Alex J. Smola

Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modiﬁcation for classiﬁcation, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efﬁciently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1

6 0.56720531 166 nips-2009-Noisy Generalized Binary Search

7 0.56425112 180 nips-2009-On the Convergence of the Concave-Convex Procedure

8 0.56337553 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

9 0.5633229 76 nips-2009-Efficient Learning using Forward-Backward Splitting

10 0.56285679 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

11 0.56239462 220 nips-2009-Slow Learners are Fast

12 0.5603689 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

13 0.55963737 122 nips-2009-Label Selection on Graphs

14 0.55881864 45 nips-2009-Beyond Convexity: Online Submodular Minimization

15 0.55836856 157 nips-2009-Multi-Label Prediction via Compressed Sensing

16 0.5578866 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

17 0.55769157 71 nips-2009-Distribution-Calibrated Hierarchical Classification

18 0.55759043 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

19 0.5546611 260 nips-2009-Zero-shot Learning with Semantic Output Codes

20 0.55446768 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization