nips nips2009 nips2009-193 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Varun Kanade, Adam Kalai
Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). [sent-4, score-0.335]
2 Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. [sent-5, score-0.475]
3 A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. [sent-6, score-0.215]
4 Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). [sent-7, score-0.977]
5 Aggressive reweighting of data may lead to poor performance in the presence of certain types of noise [1]. [sent-10, score-0.102]
6 This has been addressed by a number of “robust” boosting algorithms, such as SmoothBoost [2, 3] and MadaBoost [4] as well as boosting by branching programs [5, 6]. [sent-11, score-0.705]
7 The present work gives a simple potential-based boosting algorithm with guarantees in the (arbitrary noise) agnostic learning setting [8, 9]. [sent-15, score-0.786]
8 A unique feature of our algorithm, illustrated in Figure 1, is that it does not alter the distribution on unlabeled examples but rather it alters the labels. [sent-16, score-0.09]
9 This enables us to prove a strong boosting theorem in which the weak learner need only succeed for one distribution on unlabeled examples. [sent-17, score-0.83]
10 To the best of our knowledge, earlier weak-to-strong boosting theorems have always relied on the ability of the weak learner to succeed under arbitrary distributions. [sent-18, score-0.762]
11 The utility of our boosting theorem is demonstrated by re-deriving two non-trivial results in computational learning theory, namely agnostically learning decision trees [10] and agnostically learning halfspaces [11], which were previously solved using very different techniques. [sent-19, score-0.977]
12 The main contributions of this paper are, first, giving the first provably noise-tolerant analysis of a potential-based boosting algorithm, and, second, giving a distribution-specific boosting theorem that does not require the weak learner to learn over all distributions on x ∈ X. [sent-20, score-1.095]
13 This is in contrast to recent work by Long and Servedio, showing that convex potential boosters cannot work in the presence of random classification noise [12]. [sent-21, score-0.143]
14 First, the algorithm has the possibility of negating the current hypothesis and hence is not technically a standard potential-based boosting algorithm. [sent-23, score-0.463]
15 Second, weak agnostic learning is more challenging than weak learning with random classification noise, in the sense that an algorithm which is a weak-learner in the random classification noise setting need not be a weak-learner in the agnostic setting. [sent-24, score-1.292]
16 There is a substantial literature on robust boosting algorithms, including algorithms already mentioned, MadaBoost, SmoothBoost, as well as LogitBoost [13], BrownBoost [14], Nad1 Simplified Boosting by Relabeling Procedure Inputs: (x1 , y1 ), . [sent-26, score-0.335]
17 , (xm , ym ) ∈ X × {−1, 1}, T ≥ 1, and weak learner W . [sent-29, score-0.392]
18 , m: t • wi = min{1, exp(−H t−1 (xi )yi )} t • With probability wi , set yi = yi , otherwise pick yi ∈ {−1, 1} randomly ˜t ˜t t t t (b) g = W (x1 , y1 ), . [sent-39, score-0.215]
19 /* possibly take negated hypothesis */ argmax g∈{g t ,− sign(H t−1 )} i m 1 t (d) γ t = m i=1 wi yi ht (xi ) t t−1 (e) H (x) = H (x) + γ t ht (x) 3. [sent-44, score-0.473]
20 Each epoch, the algorithm runs the weak learner on relabeled data (xi , yi ) m . [sent-47, score-0.534]
21 In traditional boosting, on each epoch, H t is a linear com˜t i=1 bination of weak hypotheses. [sent-48, score-0.209]
22 For our agnostic analysis, we also need to include the negated current hypothesis, − sign(H t−1 ) : X → {−1, 1}, as a possible weak classifier. [sent-49, score-0.646]
23 ∗ In practice, to avoid adding noise, each example would be replaced with three weighted examples: (xi , yi ) with weight t t wi , and (xi , ±1) each with weight (1 − wi )/2. [sent-50, score-0.117]
24 These are all simple boosting algorithms whose output is a weighted majority of classifiers. [sent-52, score-0.335]
25 Many have been shown to have formal boosting properties (weak to strong PAC-learning) in a noiseless setting, or partial boosting properties in noisy settings. [sent-53, score-0.729]
26 There has also been a line of work on boosting algorithms that provably boost from weak to strong learners either under agnostic or random classification noise, using branching programs [17, 20, 5, 21, 6]. [sent-54, score-1.066]
27 Second, since we don’t change the distribution over unlabeled examples, we can boost distribution-specific weak learners. [sent-57, score-0.306]
28 In recent work, using a similar idea of relabeling, Kalai, Kanade and Mansour[22] proved that the class of DNFs is learnable in a one-sided error agnostic learning model. [sent-58, score-0.403]
29 The main differences are: (1) there is the possibility of using the negation of the current hypothesis at each step, (2) examples are relabeled rather than reweighted, and (3) the step size is slightly different. [sent-62, score-0.263]
30 1 Preliminaries In the agnostic setting, we consider learning with respect to a distribution over X ×Y . [sent-67, score-0.403]
31 The error and correlation1 of a classifier h : X → {−1, 1} with respect to D, are 1 This quantity is typically referred to as edge in the boosting literature. [sent-74, score-0.335]
32 A different definition of weakly accurate classifier is appropriate in the agnostic setting. [sent-78, score-0.403]
33 Namely, for some γ ∈ (0, 1), h : X → {−1, 1} is said to be γ-optimal for C (and D) if, cor(h, D) ≥ γ cor(C, D) Hence, if the labels are totally random then a weak hypothesis need not have any correlation over random guessing. [sent-79, score-0.371]
34 The goal is to boost from an algorithm capable of outputting γ-optimal hypotheses to one which outputs a nearly 1-optimal hypothesis, even for small γ. [sent-81, score-0.156]
35 We now define the distribution D relabeled by w, RD,w . [sent-84, score-0.096]
36 Procedurally, one can think of generating a sample from RD,w by drawing an example (x, y) from D, then with probability w(x, y), outputting (x, y) directly, and with probability 1 − w(x, y), outputting (x, y ) where y is uniformly random in {−1, 1}. [sent-85, score-0.12]
37 Formally, 1 − w(x, y) 1 − w(x, −y) + D(x, −y) 2 2 Note that D and RD,w have the same marginal distributions over unlabeled examples x ∈ X. [sent-86, score-0.09]
38 A general (m, q)-learning algorithm is given m unlabeled examples x1 , . [sent-90, score-0.116]
39 , xm , and may make q label queries to a query oracle L : X → {−1, 1}, and it outputs a classifier h : X → {−1, 1}. [sent-93, score-0.193]
40 The queries may be active, meaning that queries may only be made to training examples xi , or membership queries meaning that arbitrary examples x ∈ X may be queried. [sent-94, score-0.354]
41 Since our boosting procedure does not change the distribution over unlabeled examples, it offers two advantages: (1) Agnostic weak learning may be defined with respect to a single distribution µ over unlabeled examples, and (2) The weak learning algorithms may be active (or use membership queries). [sent-97, score-0.944]
42 In particular, the agnostic weak learning hypothesis for C and µ is that for any f : X → [−1, 1], given examples from D = µ, f , the learner will output a γ-optimal classifier for C. [sent-98, score-0.899]
43 The advantages of this new definition are: (a) it is not with respect to every distribution on unlabeled examples (the algorithm may only have guarantees for certain distributions), and (b) it is more realistic as it does not assume noiseless data. [sent-99, score-0.174]
44 Finding such a weak learner may be quite challenging since it has to succeed in the agnostic model (where no assumption is made on f ), however it may be a bit easier in the sense that the learning algorithm need only handle one particular µ. [sent-100, score-0.829]
45 2 Formal boosting procedure and main results The formal boosting procedure we analyze is given in Figure 2. [sent-109, score-0.693]
46 , xs+tm , Lt ) c) Let s 1 t i) α = g t (xi )wt (xi , yi ) s i=1 s 1 − sign(H t−1 (xi ))wt (xi , yi ) s i=1 d) If αt ≥ β t , ht = g t ; γ t = αt ; Else, ht = − sign(H t−1 ); γ t = β t ; t e) H (x) = H t−1 (x) + γ t ht (x) 4. [sent-131, score-0.479]
47 If W is a (γ, 0 , δ) weak learner with respect to C and µ, s = γ 2 2 log 1 , T = γ292 , 2 δ Algorithm AGNOSTIC B OOSTER (Figure 2) with probability at least 1 − 4δT outputs a hypothesis h satisfying: 0 cor(h, D) ≥ cor(C, D) − − γ Recall that 0 is intended to be very small, e. [sent-137, score-0.497]
48 Also note that the number of calls to the query oracle L is s plus T times the number of calls made by the weak learner (if the weak learner is active, then so is the boosting algorithm). [sent-140, score-1.173]
49 agnostically learning decision trees and agnostically learning halfspaces follow as corollaries to Theorem 1. [sent-142, score-0.636]
50 Let C be the class of binary decision trees on {−1, 1}n with at most t leaves, and let U be the uniform distribution on {−1, 1}n . [sent-144, score-0.112]
51 There exists an algorithm that when given t, n, , δ > 0, and a label oracle for an arbitrary f : {−1, 1}n → [−1, 1], makes q = poly(nt/( δ)) membership queries and, with probability ≥ 1 − δ, outputs h : {−1, 1}n → {−1, 1} such that for Uf = U, f , err(h, Uf ) ≤ err(C, Uf ) + . [sent-145, score-0.224]
52 For any fixed > 0, there exists a univariate polynomial p such that the following holds: Let n ≥ 1, C be the class of halfspaces in n dimensions, let U be the uniform distribution on {−1, 1}n , and f : {−1, 1}n → [−1, 1] be an arbitrary function. [sent-147, score-0.143]
53 ) Note that a related theorem was shown for halfspaces over log-concave distributions over X = Rn . [sent-150, score-0.152]
54 The boosting approach here similarly generalizes to that case in a straightforward manner. [sent-151, score-0.335]
55 This illustrates how, from the point of view of designing provably efficient agnostic learning algorithms, the current boosting procedure may be useful. [sent-152, score-0.764]
56 As is standard, the boosting algorithm can be viewed as minimizing a convex potential function. [sent-154, score-0.405]
57 We show that for a conservative reweighting, either the weak learner will make progress, returning a hypothesis correlated with the relabeled distribution or − sign(H t−1 ) will be correlated with the relabeled distribution. [sent-158, score-0.709]
58 Second, if we find a hypothesis correlated with the relabeled distribution, then the potential on round t will be noticeably lower than that of round t − 1. [sent-159, score-0.4]
59 The potential function we consider, as in MadaBoost, is defined by φ : R → R, φ(z) = 1−z e−z if z ≤ 0 if z > 0 Define the potential of a (real-valued) hypothesis H with respect to a distribution D over X×{−1, 1} as: Φ(H, D) = E [φ(yH(x))] (2) (x,y)∼D 0 Note that Φ(H , D) = Φ(0, D) = 1. [sent-163, score-0.19]
60 We will show that the potential decreases every round of the algorithm. [sent-164, score-0.146]
61 Notice that the weights in the boosting algorithm correspond to the derivative of the potential, because −φ (z) = min{1, exp(−z)} ∈ [0, 1]. [sent-165, score-0.361]
62 In other words, the weak learning step is essentially a gradient descent step. [sent-166, score-0.209]
63 We next state a key fact about agnostic learning in Lemma 1. [sent-167, score-0.403]
64 Then weighting function w : X × {−1, 1} → [0, 1] is called conservative for h if w(x, −h(x)) = 1 for all x ∈ X. [sent-170, score-0.09]
65 Note that, if the hypothesis is sign(H t (x)), then a weighting function defined by −φ (H t (x)y) is conservative if and only if φ (z) = −1 for all z < 0. [sent-171, score-0.192]
66 We first show that relabeling according to a conservative weighting function is good in the sense that, if h is far from optimal according to the original distribution, then after relabeling by w it is even further from optimal. [sent-172, score-0.394]
67 We will use Lemma 1 to show that the weak learner will return a useful hypothesis. [sent-182, score-0.383]
68 The case in which the weak learner may not return a useful hypothesis is when cor(C, RD,w ) = 0, when the optimal classifier on the reweighted distribution has no correlation. [sent-183, score-0.527]
69 This can happen, but in this case it means that either our current hypothesis is close to optimal, or h = sign(H t−1 ) is even worse than random guessing, and hence we can use its negation as a weak agnostic learner. [sent-184, score-0.748]
70 Let ht : X → {−1, 1} be the weak hypothesis that the algorithm finds on round t. [sent-191, score-0.543]
71 This may either be the hypothesis returned by the weak learner W or − sign(H t−1 ). [sent-192, score-0.485]
72 The following lemma lower bounds the decrease in potential caused by adding γ t ht to H t−1 . [sent-193, score-0.235]
73 We will apply the following Lemma on each round of the algorithm to show that the potential decreases on each round, as long as the weak hypothesis ht has non-negligible correlation and γ t is suitably chosen. [sent-194, score-0.688]
74 Consider any function H : X → R, hypothesis h : X → [−1, 1], γ ∈ R, and distribution D over X × {−1, 1}. [sent-196, score-0.102]
75 Let D = RD,w be the distribution D relabeled by w(x, y) = −φ (yH(x)). [sent-197, score-0.096]
76 Using all the above lemmas, we will show that the algorithm AGNOSTIC B OOSTER returns a hypothesis with correlation (or error) close to that of the best classifier from C. [sent-203, score-0.188]
77 Let g t be the hypothesis returned by the weak learner W . [sent-207, score-0.485]
78 0 Consider two cases: First that cor(c, RD,wt ) ≥ γ + 2 , in this case by the weak learning assumption, γ cor(g t , RD,wt ) ≥ 2 . [sent-211, score-0.209]
79 This guarantees that when the algorithm is run for T rounds, on 2 2 0 some round t the hypothesis sign(H t ) will have correlation at least sup cor(c, D) − − . [sent-217, score-0.315]
80 For γ 3 c∈C 200 s = γ 2 2 log 1 the empirical estimate of the correlation of the constructed hypothesis on each δ round is within an additive 6 of its true correlation, allowing a further failure probability of δ each round. [sent-218, score-0.26]
81 Thus the final hypothesis H τ which has the highest empirical correlation satisfies, cor(H τ , D) ≥ sup cor(c, D) − 0 − γ c∈C Since there is a failure probability of at most 4δ on each round, the algorithm succeeds with probability at least 1 − 4T δ. [sent-219, score-0.214]
82 4 Applications We show that recent agnostic learning analyses can be dramatically simplified using our boosting algorithm. [sent-220, score-0.738]
83 Both of the agnostic algorithms are distribution-specific, meaning that they only work on one (or a family) of distributions µ over unlabeled examples. [sent-221, score-0.462]
84 1 Agnostically Learning Decision Trees Recent work has shown how to agnostically learn polynomial-sized decision trees using membership queries, by an L1 gradient-projection algorithm [10]. [sent-223, score-0.375]
85 Here, we show that learning decision trees is quite simple using our distribution-specific boosting theorem and the Kushilevitz-Mansour membership query parity learning algorithm as a weak learner [24]. [sent-224, score-0.998]
86 Running the KM algorithm, using q = poly(n, t, 1/ 0 ) queries, and outputting the parity with largest magnitude of estimated Fourier coefficient, is a (γ = 1/t, 0 ) agnostic weak learner for size-t decision trees over the uniform distribution. [sent-226, score-0.979]
87 2 Agnostically Learning Halfspaces In the case of learning halfspaces, the weak learner simply finds the degree-d term, χS (x) with m 1 |S| ≤ d, with greatest empirical correlation m i=1 χS (xi )yi on a data set (x1 , y1 ), . [sent-230, score-0.423]
88 Let n ≥ 1, C be the class of halfspaces in n dimensions, let U be the uniform distribution on {−1, 1}n , and f : {−1, 1}n → [−1, 1] be an arbitrary function. [sent-237, score-0.143]
89 5 Experiments We performed preliminary experiments with the new boosting algorithm presented here on 8 datasets from UCI repository [26]. [sent-240, score-0.405]
90 We converted multi-class problems into binary classification problems by arbitrarily grouping classes, and ran Adaboost, Madaboost and Agnostic Boost on these datasets, using stumps as weak learners. [sent-241, score-0.239]
91 Since stumps can accept weighted examples, we passed the exact weighted distribution to the weak learner. [sent-242, score-0.239]
92 Rather than keeping the label with probability wt (x, y) and making it completely random with the remaining probability, we added both (x, y) and (x, −y) with weights (1 + wt (x, y))/2 and (1 − wt (x, y))/2 respectively. [sent-244, score-0.213]
93 Experiments with random relabeling showed that random relabeling performs much worse than fractional relabeling. [sent-245, score-0.326]
94 We also added random classification noise of 5%, 10% and 20% to the datasets and ran the boosting algorithms on the modified dataset. [sent-250, score-0.4]
95 6 Conclusion We show that potential-based agnostic boosting is possible in theory, and also that this may be done without changing the distribution over unlabeled examples. [sent-347, score-0.797]
96 We show that non-trivial agnostic learning results, for learning decision trees and halfspaces, can be viewed as simple applications of our boosting theorem combined with well-known weak learners. [sent-348, score-1.095]
97 Preliminary experiments show that the performance of our boosting algorithm is comparable to that of Madaboost and Adaboost. [sent-350, score-0.361]
98 A more thorough empirical evaluation of our boosting procedure using different weak learners is part of future research. [sent-351, score-0.564]
99 Improvement of boosting algorithm by modifying the weighting rule. [sent-442, score-0.399]
100 An empirical comparison of three boosting algorithms on real data sets with artificial class noise. [sent-467, score-0.335]
wordName wordTfidf (topN-words)
[('cor', 0.551), ('agnostic', 0.403), ('boosting', 0.335), ('weak', 0.209), ('agnostically', 0.189), ('madaboost', 0.172), ('learner', 0.154), ('relabeling', 0.152), ('ht', 0.127), ('uf', 0.12), ('err', 0.116), ('halfspaces', 0.116), ('hypothesis', 0.102), ('sign', 0.101), ('kalai', 0.101), ('relabeled', 0.096), ('round', 0.079), ('wt', 0.071), ('ada', 0.069), ('agn', 0.069), ('mada', 0.069), ('ooster', 0.069), ('lemma', 0.064), ('trees', 0.062), ('mansour', 0.061), ('outputting', 0.06), ('queries', 0.06), ('correlation', 0.06), ('unlabeled', 0.059), ('er', 0.054), ('conservative', 0.052), ('smoothboost', 0.052), ('classi', 0.05), ('decision', 0.05), ('yi', 0.049), ('membership', 0.048), ('kanade', 0.045), ('potential', 0.044), ('noise', 0.042), ('reweighted', 0.042), ('parity', 0.041), ('adaboost', 0.04), ('annual', 0.038), ('weighting', 0.038), ('boost', 0.038), ('query', 0.037), ('xs', 0.037), ('xi', 0.037), ('reweighting', 0.037), ('succeed', 0.037), ('theorem', 0.036), ('noiseless', 0.036), ('branching', 0.035), ('boosters', 0.034), ('domingo', 0.034), ('gopalan', 0.034), ('negated', 0.034), ('negation', 0.034), ('pendigits', 0.034), ('wi', 0.034), ('xm', 0.033), ('outputs', 0.032), ('examples', 0.031), ('oracle', 0.031), ('stumps', 0.03), ('klivans', 0.03), ('yw', 0.03), ('corollaries', 0.03), ('servedio', 0.03), ('kearns', 0.029), ('ym', 0.029), ('yh', 0.028), ('arbitrary', 0.027), ('algorithm', 0.026), ('sup', 0.026), ('provably', 0.026), ('pima', 0.026), ('ys', 0.026), ('active', 0.025), ('epoch', 0.024), ('presence', 0.023), ('datasets', 0.023), ('decreases', 0.023), ('formal', 0.023), ('pac', 0.022), ('guarantees', 0.022), ('poly', 0.022), ('fractional', 0.022), ('calls', 0.022), ('preliminary', 0.021), ('lt', 0.021), ('returned', 0.02), ('german', 0.02), ('return', 0.02), ('learners', 0.02), ('additive', 0.019), ('everywhere', 0.019), ('nds', 0.018), ('long', 0.018), ('theory', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 193 nips-2009-Potential-Based Agnostic Boosting
Author: Varun Kanade, Adam Kalai
Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1
2 0.13703313 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufficient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classification boundary and the underlying distribution are smooth to a finite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are infinitely smooth, the improvement is exponential.
3 0.13445398 47 nips-2009-Boosting with Spatial Regularization
Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang
Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1
4 0.079084896 166 nips-2009-Noisy Generalized Binary Search
Author: Robert Nowak
Abstract: This paper addresses the problem of noisy Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy. Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of query complexity. This paper presents an optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions. 1
5 0.067848466 129 nips-2009-Learning a Small Mixture of Trees
Author: M. P. Kumar, Daphne Koller
Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.
6 0.063556761 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
7 0.060631584 76 nips-2009-Efficient Learning using Forward-Backward Splitting
8 0.054342169 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
9 0.050745614 14 nips-2009-A Parameter-free Hedging Algorithm
10 0.047843844 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
11 0.047722273 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
12 0.046477981 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
13 0.043152187 24 nips-2009-Adapting to the Shifting Intent of Search Queries
14 0.042750634 72 nips-2009-Distribution Matching for Transduction
15 0.041138548 71 nips-2009-Distribution-Calibrated Hierarchical Classification
16 0.04066683 136 nips-2009-Learning to Rank by Optimizing NDCG Measure
17 0.039752305 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
18 0.038470801 157 nips-2009-Multi-Label Prediction via Compressed Sensing
19 0.038318411 260 nips-2009-Zero-shot Learning with Semantic Output Codes
20 0.037909485 190 nips-2009-Polynomial Semantic Indexing
topicId topicWeight
[(0, -0.129), (1, 0.077), (2, -0.003), (3, 0.028), (4, -0.011), (5, 0.017), (6, -0.058), (7, -0.051), (8, -0.036), (9, 0.031), (10, -0.06), (11, -0.013), (12, -0.032), (13, -0.047), (14, 0.043), (15, 0.014), (16, 0.054), (17, 0.014), (18, 0.062), (19, -0.039), (20, 0.102), (21, -0.074), (22, -0.084), (23, 0.005), (24, 0.026), (25, -0.088), (26, -0.039), (27, 0.004), (28, -0.011), (29, -0.06), (30, -0.138), (31, -0.158), (32, 0.101), (33, 0.048), (34, -0.024), (35, -0.028), (36, -0.052), (37, -0.146), (38, -0.024), (39, -0.041), (40, -0.027), (41, 0.102), (42, -0.09), (43, -0.15), (44, 0.045), (45, -0.17), (46, -0.005), (47, -0.004), (48, 0.035), (49, -0.118)]
simIndex simValue paperId paperTitle
same-paper 1 0.94096118 193 nips-2009-Potential-Based Agnostic Boosting
Author: Varun Kanade, Adam Kalai
Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1
2 0.74678373 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufficient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classification boundary and the underlying distribution are smooth to a finite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are infinitely smooth, the improvement is exponential.
3 0.60336149 47 nips-2009-Boosting with Spatial Regularization
Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang
Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1
4 0.51053101 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
Author: Massih Amini, Nicolas Usunier, Cyril Goutte
Abstract: We address the problem of learning classifiers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classifiers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classifiers learned on examples with multiple artificially created views. Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. An extension of this framework is a natural way to leverage unlabeled multi-view data in semi-supervised learning. Experimental results on a subset of the Reuters RCV1/RCV2 collections support our findings by showing that additional views obtained from MT may significantly improve the classification performance in the cases identified by our trade-off. 1
5 0.48820519 24 nips-2009-Adapting to the Shifting Intent of Search Queries
Author: Umar Syed, Aleksandrs Slivkins, Nina Mishra
Abstract: Search engines today present results that are often oblivious to recent shifts in intent. For example, the meaning of the query ‘independence day’ shifts in early July to a US holiday and to a movie around the time of the box office release. While no studies exactly quantify the magnitude of intent-shifting traffic, studies suggest that news events, seasonal topics, pop culture, etc account for 1/2 the search queries. This paper shows that the signals a search engine receives can be used to both determine that a shift in intent happened, as well as find a result that is now more relevant. We present a meta-algorithm that marries a classifier with a bandit algorithm to achieve regret that depends logarithmically on the number of query impressions, under certain assumptions. We provide strong evidence that this regret is close to the best achievable. Finally, via a series of experiments, we demonstrate that our algorithm outperforms prior approaches, particularly as the amount of intent-shifting traffic increases. 1
6 0.45176768 71 nips-2009-Distribution-Calibrated Hierarchical Classification
7 0.44340706 166 nips-2009-Noisy Generalized Binary Search
8 0.43224639 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition
9 0.41919342 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
10 0.37894863 72 nips-2009-Distribution Matching for Transduction
11 0.37406424 14 nips-2009-A Parameter-free Hedging Algorithm
12 0.36982179 129 nips-2009-Learning a Small Mixture of Trees
13 0.3614983 234 nips-2009-Streaming k-means approximation
14 0.33553791 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains
15 0.33473849 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale
16 0.32546866 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
17 0.32495326 54 nips-2009-Compositionality of optimal control laws
18 0.30709797 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
19 0.30590788 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
20 0.29806679 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
topicId topicWeight
[(7, 0.011), (15, 0.296), (24, 0.094), (25, 0.054), (35, 0.032), (36, 0.145), (39, 0.037), (58, 0.073), (71, 0.046), (81, 0.024), (86, 0.061), (91, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.77916485 193 nips-2009-Potential-Based Agnostic Boosting
Author: Varun Kanade, Adam Kalai
Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1
2 0.76475847 135 nips-2009-Learning to Hash with Binary Reconstructive Embeddings
Author: Brian Kulis, Trevor Darrell
Abstract: Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques. 1
3 0.57190186 129 nips-2009-Learning a Small Mixture of Trees
Author: M. P. Kumar, Daphne Koller
Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.
4 0.57156628 239 nips-2009-Submodularity Cuts and Applications
Author: Yoshinobu Kawahara, Kiyohito Nagano, Koji Tsuda, Jeff A. Bilmes
Abstract: Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint — the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution finitely many iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance. 1
5 0.56826866 72 nips-2009-Distribution Matching for Transduction
Author: Novi Quadrianto, James Petterson, Alex J. Smola
Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1
6 0.56720531 166 nips-2009-Noisy Generalized Binary Search
7 0.56425112 180 nips-2009-On the Convergence of the Concave-Convex Procedure
8 0.56337553 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
9 0.5633229 76 nips-2009-Efficient Learning using Forward-Backward Splitting
10 0.56285679 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
11 0.56239462 220 nips-2009-Slow Learners are Fast
12 0.5603689 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
13 0.55963737 122 nips-2009-Label Selection on Graphs
14 0.55881864 45 nips-2009-Beyond Convexity: Online Submodular Minimization
15 0.55836856 157 nips-2009-Multi-Label Prediction via Compressed Sensing
16 0.5578866 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
17 0.55769157 71 nips-2009-Distribution-Calibrated Hierarchical Classification
18 0.55759043 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
19 0.5546611 260 nips-2009-Zero-shot Learning with Semantic Output Codes
20 0.55446768 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization