hunch_net hunch_net-2005 hunch_net-2005-45 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework. Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {h w }: h w (x) = 1 if x > w, and 0 otherwise. VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (
sentIndex sentText sentNum sentScore
1 Often, unlabeled data is easy to come by but labels are expensive. [sent-1, score-0.626]
2 For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. [sent-2, score-0.847]
3 In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. [sent-3, score-1.199]
4 The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework. [sent-4, score-0.418]
5 Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {h w }: h w (x) = 1 if x > w, and 0 otherwise. [sent-6, score-0.251]
6 But suppose we instead draw m unlabeled samples from P. [sent-8, score-0.543]
7 If we lay these points down on the line, their hidden labels are a sequence of 0′s followed by a sequence of 1′s, and the goal is to discover the point w at which the transition occurs. [sent-9, score-0.773]
8 This can be accomplished with a simple binary search which asks for just log m labels. [sent-10, score-0.187]
9 Thus active learning gives us an exponential improvement in the number of labels needed: by adaptively querying log m labels, we can automatically infer the rest of them. [sent-11, score-1.073]
10 To date, the single main theoretical result in the field is [FSST97] ‘s analysis of the query-by-committee (QBC) learning algorithm. [sent-13, score-0.167]
11 In their model, the learner observes a stream of unlabeled data and makes spot decisions about whether or not to ask for a point’s label. [sent-14, score-0.479]
12 They show that if the data is drawn uniformly from the surface of the d-dimensional unit sphere, and the hidden labels correspond perfectly to a homogeneous (i. [sent-15, score-0.844]
13 This remarkable result is tempered somewhat by the complexity of the QBC algorithm, which involves computing volumes of intermediate version spaces. [sent-18, score-0.195]
14 Some recent progress on active learning: [DKM05] show how a simple variant of the perceptron update can be used to achieve these same sample complexity bounds, in the same model. [sent-19, score-0.683]
15 [D04] shows a variety of upper and lower bounds for active learning — for instance, if you allow linear separators which are non-homogeneous then in the above model the sample complexity necessarily shoots up to 1/e . [sent-20, score-0.775]
16 The theoretical terrain of active learning remains something of an unexplored wilderness. [sent-21, score-0.296]
17 There has, however, been a lot of beautiful theory work (see [A02] for a roundup) on a related model in which the learner is allowed to synthesize query points, rather than simply choosing them from the pool of unlabeled data. [sent-22, score-0.574]
18 This ran into some practical problems: [BL92] found that the resulting synthetic instances were often very difficult for a human to classify! [sent-23, score-0.11]
19 Query learning can work poorly when a human oracle is used. [sent-30, score-0.181]
wordName wordTfidf (topN-words)
[('labels', 0.341), ('active', 0.225), ('unlabeled', 0.187), ('qbc', 0.172), ('draw', 0.153), ('querying', 0.153), ('speech', 0.146), ('separators', 0.142), ('hidden', 0.134), ('usual', 0.128), ('points', 0.124), ('learner', 0.123), ('perfectly', 0.118), ('line', 0.118), ('complexity', 0.118), ('samples', 0.114), ('human', 0.11), ('classifier', 0.106), ('query', 0.106), ('log', 0.105), ('sample', 0.105), ('instance', 0.101), ('linear', 0.098), ('data', 0.098), ('exponential', 0.094), ('analysis', 0.089), ('suppose', 0.089), ('model', 0.087), ('sequence', 0.087), ('improvement', 0.084), ('simple', 0.082), ('show', 0.082), ('field', 0.078), ('supervised', 0.077), ('classified', 0.077), ('intermediate', 0.077), ('origin', 0.077), ('separator', 0.077), ('achieve', 0.071), ('classify', 0.071), ('surface', 0.071), ('adaptively', 0.071), ('charge', 0.071), ('lie', 0.071), ('poorly', 0.071), ('sphere', 0.071), ('spot', 0.071), ('synthesize', 0.071), ('tedious', 0.071), ('unexplored', 0.071)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000004 45 hunch net-2005-03-22-Active learning
Introduction: Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework. Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {h w }: h w (x) = 1 if x > w, and 0 otherwise. VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (
2 0.2220706 432 hunch net-2011-04-20-The End of the Beginning of Active Learning
Introduction: This post is by Daniel Hsu and John Langford. In selective sampling style active learning, a learning algorithm chooses which examples to label. We now have an active learning algorithm that is: Efficient in label complexity, unlabeled complexity, and computational complexity. Competitive with supervised learning anywhere that supervised learning works. Compatible with online learning, with any optimization-based learning algorithm, with any loss function, with offline testing, and even with changing learning algorithms. Empirically effective. The basic idea is to combine disagreement region-based sampling with importance weighting : an example is selected to be labeled with probability proportional to how useful it is for distinguishing among near-optimal classifiers, and labeled examples are importance-weighted by the inverse of these probabilities. The combination of these simple ideas removes the sampling bias problem that has plagued many previous he
3 0.20020176 183 hunch net-2006-06-14-Explorations of Exploration
Introduction: Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models. Reinforcement Learning (1) . Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a . The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E 3 by Satinder Singh and Michael Kearns . Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E 3 and
4 0.19851036 127 hunch net-2005-11-02-Progress in Active Learning
Introduction: Several bits of progress have been made since Sanjoy pointed out the significant lack of theoretical understanding of active learning . This is an update on the progress I know of. As a refresher, active learning as meant here is: There is a source of unlabeled data. There is an oracle from which labels can be requested for unlabeled data produced by the source. The goal is to perform well with minimal use of the oracle. Here is what I’ve learned: Sanjoy has developed sufficient and semi-necessary conditions for active learning given the assumptions of IID data and “realizability” (that one of the classifiers is a correct classifier). Nina , Alina , and I developed an algorithm for active learning relying on only the assumption of IID data. A draft is here . Nicolo , Claudio , and Luca showed that it is possible to do active learning in an entirely adversarial setting for linear threshold classifiers here . This was published a year or two ago and I r
5 0.17460634 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design
Introduction: This is a summary of the workshop on Learning Problem Design which Alina and I ran at NIPS this year. The first question many people have is “What is learning problem design?” This workshop is about admitting that solving learning problems does not start with labeled data, but rather somewhere before. When humans are hired to produce labels, this is usually not a serious problem because you can tell them precisely what semantics you want the labels to have, and we can fix some set of features in advance. However, when other methods are used this becomes more problematic. This focus is important for Machine Learning because there are very large quantities of data which are not labeled by a hired human. The title of the workshop was a bit ambitious, because a workshop is not long enough to synthesize a diversity of approaches into a coherent set of principles. For me, the posters at the end of the workshop were quite helpful in getting approaches to gel. Here are some an
6 0.15045857 143 hunch net-2005-12-27-Automated Labeling
7 0.14019001 360 hunch net-2009-06-15-In Active Learning, the question changes
8 0.13619636 351 hunch net-2009-05-02-Wielding a New Abstraction
9 0.13270521 251 hunch net-2007-06-24-Interesting Papers at ICML 2007
10 0.12595662 41 hunch net-2005-03-15-The State of Tight Bounds
11 0.12281545 332 hunch net-2008-12-23-Use of Learning Theory
12 0.11786484 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
13 0.11483246 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
14 0.11479078 299 hunch net-2008-04-27-Watchword: Supervised Learning
15 0.10825843 61 hunch net-2005-04-25-Embeddings: what are they good for?
16 0.10617594 293 hunch net-2008-03-23-Interactive Machine Learning
17 0.10267827 33 hunch net-2005-02-28-Regularization
18 0.10149065 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
19 0.10114197 164 hunch net-2006-03-17-Multitask learning is Black-Boxable
20 0.099980749 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
topicId topicWeight
[(0, 0.231), (1, 0.147), (2, 0.002), (3, -0.077), (4, 0.15), (5, -0.038), (6, 0.028), (7, -0.007), (8, 0.049), (9, -0.045), (10, 0.099), (11, 0.027), (12, -0.101), (13, 0.056), (14, -0.15), (15, -0.064), (16, -0.072), (17, 0.105), (18, -0.06), (19, -0.025), (20, 0.025), (21, 0.036), (22, 0.058), (23, 0.005), (24, 0.083), (25, -0.078), (26, 0.025), (27, 0.022), (28, 0.004), (29, -0.14), (30, 0.078), (31, 0.027), (32, 0.044), (33, -0.004), (34, -0.001), (35, -0.029), (36, 0.042), (37, -0.016), (38, -0.053), (39, -0.101), (40, -0.101), (41, -0.039), (42, 0.061), (43, -0.123), (44, 0.083), (45, 0.092), (46, 0.055), (47, 0.049), (48, -0.02), (49, 0.063)]
simIndex simValue blogId blogTitle
same-blog 1 0.97679138 45 hunch net-2005-03-22-Active learning
Introduction: Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework. Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {h w }: h w (x) = 1 if x > w, and 0 otherwise. VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (
2 0.71529496 432 hunch net-2011-04-20-The End of the Beginning of Active Learning
Introduction: This post is by Daniel Hsu and John Langford. In selective sampling style active learning, a learning algorithm chooses which examples to label. We now have an active learning algorithm that is: Efficient in label complexity, unlabeled complexity, and computational complexity. Competitive with supervised learning anywhere that supervised learning works. Compatible with online learning, with any optimization-based learning algorithm, with any loss function, with offline testing, and even with changing learning algorithms. Empirically effective. The basic idea is to combine disagreement region-based sampling with importance weighting : an example is selected to be labeled with probability proportional to how useful it is for distinguishing among near-optimal classifiers, and labeled examples are importance-weighted by the inverse of these probabilities. The combination of these simple ideas removes the sampling bias problem that has plagued many previous he
3 0.68892831 127 hunch net-2005-11-02-Progress in Active Learning
Introduction: Several bits of progress have been made since Sanjoy pointed out the significant lack of theoretical understanding of active learning . This is an update on the progress I know of. As a refresher, active learning as meant here is: There is a source of unlabeled data. There is an oracle from which labels can be requested for unlabeled data produced by the source. The goal is to perform well with minimal use of the oracle. Here is what I’ve learned: Sanjoy has developed sufficient and semi-necessary conditions for active learning given the assumptions of IID data and “realizability” (that one of the classifiers is a correct classifier). Nina , Alina , and I developed an algorithm for active learning relying on only the assumption of IID data. A draft is here . Nicolo , Claudio , and Luca showed that it is possible to do active learning in an entirely adversarial setting for linear threshold classifiers here . This was published a year or two ago and I r
4 0.68202049 143 hunch net-2005-12-27-Automated Labeling
Introduction: One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this: Unsupervised Learning . You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way. Active Learning . You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy. It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows: Identify some subset of observed values to predict
5 0.62377584 183 hunch net-2006-06-14-Explorations of Exploration
Introduction: Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models. Reinforcement Learning (1) . Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a . The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E 3 by Satinder Singh and Michael Kearns . Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E 3 and
6 0.61724055 360 hunch net-2009-06-15-In Active Learning, the question changes
7 0.59688479 310 hunch net-2008-07-15-Interesting papers at COLT (and a bit of UAI & workshops)
8 0.55629343 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design
9 0.5367282 224 hunch net-2006-12-12-Interesting Papers at NIPS 2006
10 0.52969354 139 hunch net-2005-12-11-More NIPS Papers
11 0.51916796 61 hunch net-2005-04-25-Embeddings: what are they good for?
12 0.51607388 299 hunch net-2008-04-27-Watchword: Supervised Learning
13 0.51460314 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
14 0.51080078 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
15 0.4957203 444 hunch net-2011-09-07-KDD and MUCMD 2011
16 0.49234426 161 hunch net-2006-03-05-“Structural” Learning
17 0.49195847 155 hunch net-2006-02-07-Pittsburgh Mind Reading Competition
18 0.48752564 251 hunch net-2007-06-24-Interesting Papers at ICML 2007
19 0.48148739 338 hunch net-2009-01-23-An Active Learning Survey
20 0.46673933 351 hunch net-2009-05-02-Wielding a New Abstraction
topicId topicWeight
[(3, 0.031), (10, 0.039), (27, 0.665), (33, 0.022), (38, 0.012), (53, 0.025), (55, 0.034), (94, 0.066), (95, 0.024)]
simIndex simValue blogId blogTitle
same-blog 1 0.99916285 45 hunch net-2005-03-22-Active learning
Introduction: Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework. Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {h w }: h w (x) = 1 if x > w, and 0 otherwise. VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (
2 0.99566483 400 hunch net-2010-06-13-The Good News on Exploration and Learning
Introduction: Consider the contextual bandit setting where, repeatedly: A context x is observed. An action a is taken given the context x . A reward r is observed, dependent on x and a . Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward. This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions. A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and but
3 0.99298388 245 hunch net-2007-05-12-Loss Function Semantics
Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear
4 0.99251062 172 hunch net-2006-04-14-JMLR is a success
Introduction: In 2001, the “ Journal of Machine Learning Research ” was created in reaction to unadaptive publisher policies at MLJ . Essentially, with the creation of the internet, the bottleneck in publishing research shifted from publishing to research. The declaration of independence accompanying this move expresses the reasons why in greater detail. MLJ has strongly changed its policy in reaction to this. In particular, there is no longer an assignment of copyright to the publisher (*), and MLJ regularly sponsors many student “best paper awards” across several conferences with cash prizes. This is an advantage of MLJ over JMLR: MLJ can afford to sponsor cash prizes for the machine learning community. The remaining disadvantage is that reading papers in MLJ sometimes requires searching for the author’s website where the free version is available. In contrast, JMLR articles are freely available to everyone off the JMLR website. Whether or not this disadvantage cancels the advantage i
5 0.99127984 247 hunch net-2007-06-14-Interesting Papers at COLT 2007
Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch
6 0.99053317 308 hunch net-2008-07-06-To Dual or Not
7 0.99044371 274 hunch net-2007-11-28-Computational Consequences of Classification
8 0.98906446 166 hunch net-2006-03-24-NLPers
9 0.98906446 246 hunch net-2007-06-13-Not Posting
10 0.98906446 418 hunch net-2010-12-02-Traffic Prediction Problem
11 0.98897564 288 hunch net-2008-02-10-Complexity Illness
12 0.98834008 352 hunch net-2009-05-06-Machine Learning to AI
13 0.98554146 9 hunch net-2005-02-01-Watchword: Loss
14 0.98300034 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
15 0.98010039 304 hunch net-2008-06-27-Reviewing Horror Stories
16 0.97503316 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
17 0.96885556 244 hunch net-2007-05-09-The Missing Bound
18 0.96042174 293 hunch net-2008-03-23-Interactive Machine Learning
19 0.95779109 483 hunch net-2013-06-10-The Large Scale Learning class notes
20 0.95600313 67 hunch net-2005-05-06-Don’t mix the solution into the problem