hunch_net hunch_net-2011 hunch_net-2011-426 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered. There are many different kinds of scaling. Scaling in examples This is the most basic kind of scaling. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss , logistic loss , Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonath
sentIndex sentText sentNum sentScore
1 Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss , logistic loss , Hinge Loss, and Quantile Loss are all worth covering. [sent-6, score-0.549]
2 It’s important to cover the semantics of these loss functions as well. [sent-7, score-0.361]
3 I’ve seen preconditioned conjugate gradient work well, for which Jonathan Shewchuck ‘s writeup is reasonable. [sent-10, score-0.221]
4 I liked what Markus said at the LCCC workshop : nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa. [sent-14, score-0.359]
5 Hashing approaches are surprisingly effective in my experience. [sent-18, score-0.15]
6 Online l 1 regularization is via truncated gradient . [sent-20, score-0.221]
7 John Duchi’s composite mirror descent generalization is also a useful general treatment. [sent-22, score-0.197]
8 Boosting based approaches can also be effective, although training time can become problematic. [sent-23, score-0.15]
9 This is partially mitigated by parallelization algorithms as discussed at the LCCC workshop See Jerry Ye’s talk and Krysta’s talk. [sent-24, score-0.201]
10 Test-time Evaluation Ultrafast and efficient test-time evaluation seems to be a goal independent of training. [sent-27, score-0.25]
11 Aside from the basic datastructure, I’d cover WAND and predictive indexing . [sent-29, score-0.249]
12 GPU The use of GPU’s to make evaluation both more efficient and fast seems to make sense in many applications. [sent-30, score-0.25]
13 Burr Settles active learning survey is pretty comprehensive, although if I was to cover one algorithm, it would be this one which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality. [sent-35, score-0.383]
14 The other common approach is user-feedback information where bias and exploration effects becomes a critical concern. [sent-36, score-0.242]
15 The tutorial Alina and I did on learning and exploration is critical here. [sent-37, score-0.169]
16 Online tree building and conditional probability trees are also potentially extremely useful. [sent-41, score-0.343]
17 Building smarter trees can help, such as with a confusion matrix or in an iterative fashion . [sent-42, score-0.236]
18 Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples. [sent-43, score-0.393]
19 I’d cover Searn as well as some of Drew Bagnell ‘s work such as this one . [sent-45, score-0.178]
20 A more recent instance is the structured prediction cascade . [sent-50, score-0.239]
wordName wordTfidf (topN-words)
[('loss', 0.183), ('cover', 0.178), ('nando', 0.172), ('labels', 0.17), ('tree', 0.157), ('cascade', 0.153), ('approaches', 0.15), ('gradient', 0.145), ('gpu', 0.142), ('lccc', 0.142), ('evaluation', 0.14), ('scaling', 0.134), ('compelling', 0.129), ('parallelization', 0.127), ('descent', 0.126), ('label', 0.113), ('efficient', 0.11), ('amounts', 0.106), ('implementation', 0.099), ('building', 0.097), ('trees', 0.089), ('exploration', 0.089), ('structured', 0.086), ('missing', 0.085), ('critical', 0.08), ('smarter', 0.076), ('detector', 0.076), ('tractability', 0.076), ('inconsistency', 0.076), ('writeup', 0.076), ('tolerant', 0.076), ('burr', 0.076), ('settles', 0.076), ('indicies', 0.076), ('cascades', 0.076), ('crf', 0.076), ('enjoys', 0.076), ('sourcing', 0.076), ('truncated', 0.076), ('discussed', 0.074), ('becomes', 0.073), ('online', 0.073), ('nobody', 0.071), ('indexing', 0.071), ('composite', 0.071), ('iterative', 0.071), ('derivative', 0.071), ('jonathan', 0.071), ('bloom', 0.071), ('duchi', 0.071)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000007 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
Introduction: At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered. There are many different kinds of scaling. Scaling in examples This is the most basic kind of scaling. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss , logistic loss , Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonath
2 0.23787352 420 hunch net-2010-12-26-NIPS 2010
Introduction: I enjoyed attending NIPS this year, with several things interesting me. For the conference itself: Peter Welinder , Steve Branson , Serge Belongie , and Pietro Perona , The Multidimensional Wisdom of Crowds . This paper is about using mechanical turk to get label information, with results superior to a majority vote approach. David McAllester , Tamir Hazan , and Joseph Keshet Direct Loss Minimization for Structured Prediction . This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition. Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades . This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested. Alan Fern and Prasad Tadepalli , A Computational Decision Theory for Interactive Assistants . This paper carves out some
3 0.20395921 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi
4 0.18082498 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s
5 0.17459691 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy
Introduction: I’ve released version 5.0 of the Vowpal Wabbit online learning software. The major number has changed since the last release because I regard all earlier versions as obsolete—there are several new algorithms & features including substantial changes and upgrades to the default learning algorithm. The biggest changes are new algorithms: Nikos and I improved the default algorithm. The basic update rule still uses gradient descent, but the size of the update is carefully controlled so that it’s impossible to overrun the label. In addition, the normalization has changed. Computationally, these changes are virtually free and yield better results, sometimes much better. Less careful updates can be reenabled with –loss_function classic, although results are still not identical to previous due to normalization changes. Nikos also implemented the per-feature learning rates as per these two papers . Often, this works better than the default algorithm. It isn’t the defa
6 0.17234617 111 hunch net-2005-09-12-Fast Gradient Descent
7 0.17012346 9 hunch net-2005-02-01-Watchword: Loss
8 0.16661792 281 hunch net-2007-12-21-Vowpal Wabbit Code Release
9 0.16577521 235 hunch net-2007-03-03-All Models of Learning have Flaws
10 0.16469029 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design
11 0.16010006 245 hunch net-2007-05-12-Loss Function Semantics
12 0.15571758 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project
13 0.15329009 258 hunch net-2007-08-12-Exponentiated Gradient
14 0.14815441 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
15 0.1451101 346 hunch net-2009-03-18-Parallel ML primitives
16 0.14372456 274 hunch net-2007-11-28-Computational Consequences of Classification
17 0.13755792 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning
18 0.13710839 432 hunch net-2011-04-20-The End of the Beginning of Active Learning
19 0.13606781 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
20 0.13246365 259 hunch net-2007-08-19-Choice of Metrics
topicId topicWeight
[(0, 0.31), (1, 0.143), (2, -0.031), (3, -0.09), (4, -0.006), (5, 0.141), (6, -0.221), (7, -0.037), (8, 0.008), (9, 0.148), (10, 0.003), (11, -0.096), (12, -0.042), (13, 0.019), (14, -0.055), (15, 0.014), (16, -0.043), (17, 0.068), (18, -0.056), (19, -0.023), (20, 0.0), (21, -0.001), (22, -0.027), (23, 0.03), (24, 0.016), (25, -0.013), (26, -0.047), (27, 0.074), (28, 0.015), (29, -0.036), (30, -0.01), (31, -0.029), (32, -0.093), (33, -0.037), (34, -0.097), (35, 0.028), (36, -0.062), (37, 0.023), (38, 0.006), (39, 0.064), (40, 0.071), (41, -0.062), (42, -0.009), (43, -0.102), (44, 0.073), (45, 0.042), (46, -0.002), (47, 0.048), (48, -0.042), (49, -0.051)]
simIndex simValue blogId blogTitle
same-blog 1 0.96457446 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
Introduction: At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered. There are many different kinds of scaling. Scaling in examples This is the most basic kind of scaling. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss , logistic loss , Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonath
2 0.7839092 420 hunch net-2010-12-26-NIPS 2010
Introduction: I enjoyed attending NIPS this year, with several things interesting me. For the conference itself: Peter Welinder , Steve Branson , Serge Belongie , and Pietro Perona , The Multidimensional Wisdom of Crowds . This paper is about using mechanical turk to get label information, with results superior to a majority vote approach. David McAllester , Tamir Hazan , and Joseph Keshet Direct Loss Minimization for Structured Prediction . This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition. Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades . This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested. Alan Fern and Prasad Tadepalli , A Computational Decision Theory for Interactive Assistants . This paper carves out some
3 0.6719895 281 hunch net-2007-12-21-Vowpal Wabbit Code Release
Introduction: We are releasing the Vowpal Wabbit (Fast Online Learning) code as open source under a BSD (revised) license. This is a project at Yahoo! Research to build a useful large scale learning algorithm which Lihong Li , Alex Strehl , and I have been working on. To appreciate the meaning of “large”, it’s useful to define “small” and “medium”. A “small” supervised learning problem is one where a human could use a labeled dataset and come up with a reasonable predictor. A “medium” supervised learning problem dataset fits into the RAM of a modern desktop computer. A “large” supervised learning problem is one which does not fit into the RAM of a normal machine. VW tackles large scale learning problems by this definition of large. I’m not aware of any other open source Machine Learning tools which can handle this scale (although they may exist). A few close ones are: IBM’s Parallel Machine Learning Toolbox isn’t quite open source . The approach used by this toolbox is essenti
4 0.6627444 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy
Introduction: I’ve released version 5.0 of the Vowpal Wabbit online learning software. The major number has changed since the last release because I regard all earlier versions as obsolete—there are several new algorithms & features including substantial changes and upgrades to the default learning algorithm. The biggest changes are new algorithms: Nikos and I improved the default algorithm. The basic update rule still uses gradient descent, but the size of the update is carefully controlled so that it’s impossible to overrun the label. In addition, the normalization has changed. Computationally, these changes are virtually free and yield better results, sometimes much better. Less careful updates can be reenabled with –loss_function classic, although results are still not identical to previous due to normalization changes. Nikos also implemented the per-feature learning rates as per these two papers . Often, this works better than the default algorithm. It isn’t the defa
5 0.65297121 258 hunch net-2007-08-12-Exponentiated Gradient
Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve
6 0.63063145 348 hunch net-2009-04-02-Asymmophobia
7 0.62999171 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project
8 0.62011552 111 hunch net-2005-09-12-Fast Gradient Descent
9 0.61756486 167 hunch net-2006-03-27-Gradients everywhere
10 0.5945347 179 hunch net-2006-05-16-The value of the orthodox view of Boosting
11 0.58973646 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning
12 0.58684897 138 hunch net-2005-12-09-Some NIPS papers
13 0.5774532 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
14 0.55877763 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
15 0.55725342 279 hunch net-2007-12-19-Cool and interesting things seen at NIPS
16 0.55693001 441 hunch net-2011-08-15-Vowpal Wabbit 6.0
17 0.55067742 346 hunch net-2009-03-18-Parallel ML primitives
18 0.55033809 451 hunch net-2011-12-13-Vowpal Wabbit version 6.1 & the NIPS tutorial
19 0.54489255 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design
20 0.53512138 77 hunch net-2005-05-29-Maximum Margin Mismatch?
topicId topicWeight
[(3, 0.013), (9, 0.026), (27, 0.239), (34, 0.021), (38, 0.031), (49, 0.049), (51, 0.056), (53, 0.034), (55, 0.052), (64, 0.05), (85, 0.183), (94, 0.124), (95, 0.05)]
simIndex simValue blogId blogTitle
1 0.93485504 70 hunch net-2005-05-12-Math on the Web
Introduction: Andrej Bauer has setup a Mathematics and Computation Blog. As a first step he has tried to address the persistent and annoying problem of math on the web. As a basic tool for precisely stating and transfering understanding of technical subjects, mathematics is very necessary. Despite this necessity, every mechanism for expressing mathematics on the web seems unnaturally clumsy. Here are some of the methods and their drawbacks: MathML This was supposed to be the answer, but it has two severe drawbacks: “Internet Explorer” doesn’t read it and the language is an example of push-XML-to-the-limit which no one would ever consider writing in. (In contrast, html is easy to write in.) It’s also very annoying that math fonts must be installed independent of the browser, even for mozilla based browsers. Create inline images. This has several big drawbacks: font size is fixed for all viewers, you can’t cut & paste inside the images, and you can’t hyperlink from (say) symbol to de
same-blog 2 0.90425223 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
Introduction: At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered. There are many different kinds of scaling. Scaling in examples This is the most basic kind of scaling. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss , logistic loss , Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonath
3 0.87214255 382 hunch net-2009-12-09-Future Publication Models @ NIPS
Introduction: Yesterday, there was a discussion about future publication models at NIPS . Yann and Zoubin have specific detailed proposals which I’ll add links to when I get them ( Yann’s proposal and Zoubin’s proposal ). What struck me about the discussion is that there are many simultaneous concerns as well as many simultaneous proposals, which makes it difficult to keep all the distinctions straight in a verbal conversation. It also seemed like people were serious enough about this that we may see some real movement. Certainly, my personal experience motivates that as I’ve posted many times about the substantial flaws in our review process, including some very poor personal experiences. Concerns include the following: (Several) Reviewers are overloaded, boosting the noise in decision making. ( Yann ) A new system should run with as little built-in delay and friction to the process of research as possible. ( Hanna Wallach (updated)) Double-blind review is particularly impor
4 0.81272739 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re
5 0.80984974 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge
Introduction: The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured. From the instructions page , several issues come up: Large Definition My personal definition of dataset size is: small A dataset is small if a human could look at the dataset and plausibly find a good solution. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer. large A large dataset does not fit in the RAM of a reasonably priced computer. By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that. The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require ran
6 0.80812359 235 hunch net-2007-03-03-All Models of Learning have Flaws
7 0.80219686 360 hunch net-2009-06-15-In Active Learning, the question changes
8 0.80166852 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
9 0.8013252 348 hunch net-2009-04-02-Asymmophobia
10 0.79997981 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
11 0.79857934 351 hunch net-2009-05-02-Wielding a New Abstraction
12 0.79780895 258 hunch net-2007-08-12-Exponentiated Gradient
13 0.79760247 345 hunch net-2009-03-08-Prediction Science
14 0.79352367 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
15 0.79158133 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
16 0.79155636 371 hunch net-2009-09-21-Netflix finishes (and starts)
17 0.79096931 466 hunch net-2012-06-05-ICML acceptance statistics
18 0.79066432 143 hunch net-2005-12-27-Automated Labeling
19 0.78908825 95 hunch net-2005-07-14-What Learning Theory might do
20 0.78831273 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design