hunch_net hunch_net-2005 hunch_net-2005-103 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions. This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance. A current laundry list of capabilities includes: 2002 multiclass SVM including arbitrary cost matrices ICML 2003 Hidden Markov Models NIPS 2003 Markov Networks (see some discussion ) EMNLP 2004 Context free grammars ICML 2004 Any loss (with much computation) ICML 2005 Any constrained linear prediction model (that’s my own
sentIndex sentText sentNum sentScore
1 Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions. [sent-1, score-0.501]
2 This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. [sent-2, score-0.573]
3 Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. [sent-3, score-0.402]
4 Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance. [sent-4, score-0.409]
5 ICML 2005 Any loss dependent on a contingency table I am personally interested in how this relates to the learning reductions work which has similar goals, but works at a different abstraction level (the learning problem rather than algorithmic mechanism). [sent-6, score-0.915]
6 The difference in abstraction implies that anything solvable by reduction should be solvable by a direct algorithmic mechanism. [sent-7, score-1.559]
7 However, comparing and constrasting the results I know of it seems that what is solvable via reduction to classification versus what is solvable via direct SVM-like methods is currently incomparable. [sent-8, score-1.888]
8 Can SVMs be tuned to directly solve (example dependent) cost sensitive classification? [sent-9, score-0.518]
9 Obviously, they can be tuned indirectly via reduction , but it is easy to imagine more tractable direct optimizations. [sent-10, score-0.786]
10 How efficiently can learning reductions be used to solve structured prediction problems? [sent-11, score-0.753]
11 Structured prediction problems are instances of cost sensitive classification, but the regret transform efficiency which occurs when this embedding is done is too weak to be of interest. [sent-12, score-0.8]
12 Are there any problems efficiently solvable by SVM-like algorithms which are not efficiently solvable via learning reductions? [sent-13, score-1.472]
wordName wordTfidf (topN-words)
[('solvable', 0.459), ('direct', 0.2), ('efficiently', 0.197), ('tuned', 0.169), ('loss', 0.168), ('via', 0.16), ('abstraction', 0.158), ('reduction', 0.152), ('reductions', 0.147), ('cost', 0.144), ('sensitive', 0.131), ('algorithmic', 0.131), ('icml', 0.131), ('prediction', 0.13), ('classification', 0.129), ('dependent', 0.129), ('markov', 0.124), ('structured', 0.119), ('account', 0.112), ('embedding', 0.105), ('grammars', 0.105), ('emnlp', 0.105), ('evaluated', 0.105), ('indirectly', 0.105), ('laundry', 0.105), ('relates', 0.098), ('effort', 0.096), ('implausible', 0.092), ('versus', 0.092), ('optimizations', 0.088), ('matrices', 0.088), ('used', 0.086), ('capabilities', 0.084), ('table', 0.084), ('approximations', 0.084), ('family', 0.084), ('inevitable', 0.084), ('constrained', 0.081), ('integrating', 0.079), ('svms', 0.077), ('transform', 0.077), ('comparing', 0.077), ('efficiency', 0.075), ('shown', 0.075), ('solve', 0.074), ('svm', 0.073), ('imposed', 0.071), ('occurs', 0.069), ('instances', 0.069), ('specifying', 0.069)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 103 hunch net-2005-08-18-SVM Adaptability
Introduction: Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions. This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance. A current laundry list of capabilities includes: 2002 multiclass SVM including arbitrary cost matrices ICML 2003 Hidden Markov Models NIPS 2003 Markov Networks (see some discussion ) EMNLP 2004 Context free grammars ICML 2004 Any loss (with much computation) ICML 2005 Any constrained linear prediction model (that’s my own
2 0.23427191 27 hunch net-2005-02-23-Problem: Reinforcement Learning with Classification
Introduction: At an intuitive level, the question here is “Can reinforcement learning be solved with classification?” Problem Construct a reinforcement learning algorithm with near-optimal expected sum of rewards in the direct experience model given access to a classifier learning algorithm which has a small error rate or regret on all posed classification problems. The definition of “posed” here is slightly murky. I consider a problem “posed” if there is an algorithm for constructing labeled classification examples. Past Work There exists a reduction of reinforcement learning to classification given a generative model. A generative model is an inherently stronger assumption than the direct experience model. Other work on learning reductions may be important. Several algorithms for solving reinforcement learning in the direct experience model exist. Most, such as E 3 , Factored-E 3 , and metric-E 3 and Rmax require that the observation be the state. Recent work
3 0.21136203 351 hunch net-2009-05-02-Wielding a New Abstraction
Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo
4 0.1923151 14 hunch net-2005-02-07-The State of the Reduction
Introduction: What? Reductions are machines which turn solvers for one problem into solvers for another problem. Why? Reductions are useful for several reasons. Laziness . Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work. Crystallization . The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on. Theoretical Organization . By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder. What we know now . Typesafe r
5 0.18430567 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
Introduction: A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1} , X x [0,1] , X x {1,2,3,4,5} , etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero. For this post, we can think of a learning reduction as A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification). A mapping Q from predictors for type T’ to predictors for type T . The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form: Theorem For all base predictors b , for all distributions D over examples of type T : E (x,y) ~ D L T (y,Q(b,x)) <= f(E (x’,y’)~R(D) L T’ (y’,b(x’))) Here L T is the loss for the type T problem and L T’ is the loss for the type T’ problem. Also, R(D) is the distribution ov
6 0.18129936 9 hunch net-2005-02-01-Watchword: Loss
7 0.17036575 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?
8 0.1606624 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
9 0.14762935 473 hunch net-2012-09-29-Vowpal Wabbit, version 7.0
10 0.14028075 235 hunch net-2007-03-03-All Models of Learning have Flaws
11 0.14014785 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
12 0.13970149 245 hunch net-2007-05-12-Loss Function Semantics
13 0.12782763 492 hunch net-2013-12-01-NIPS tutorials and Vowpal Wabbit 7.4
14 0.12530644 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions
15 0.11644853 274 hunch net-2007-11-28-Computational Consequences of Classification
16 0.11172328 332 hunch net-2008-12-23-Use of Learning Theory
17 0.10939547 259 hunch net-2007-08-19-Choice of Metrics
18 0.10696791 82 hunch net-2005-06-17-Reopening RL->Classification
19 0.10440481 21 hunch net-2005-02-17-Learning Research Programs
20 0.1041511 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
topicId topicWeight
[(0, 0.231), (1, 0.123), (2, 0.077), (3, -0.125), (4, -0.108), (5, 0.034), (6, 0.05), (7, -0.025), (8, -0.046), (9, -0.017), (10, -0.076), (11, -0.156), (12, -0.143), (13, 0.1), (14, 0.038), (15, 0.072), (16, 0.049), (17, -0.031), (18, -0.007), (19, -0.095), (20, 0.027), (21, -0.117), (22, -0.035), (23, -0.039), (24, -0.055), (25, 0.014), (26, -0.069), (27, 0.038), (28, 0.015), (29, 0.105), (30, 0.116), (31, 0.002), (32, 0.014), (33, -0.084), (34, -0.022), (35, 0.055), (36, -0.022), (37, -0.011), (38, 0.01), (39, 0.077), (40, -0.014), (41, 0.015), (42, -0.077), (43, -0.004), (44, 0.089), (45, -0.036), (46, -0.021), (47, 0.027), (48, -0.102), (49, 0.048)]
simIndex simValue blogId blogTitle
same-blog 1 0.97873354 103 hunch net-2005-08-18-SVM Adaptability
Introduction: Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions. This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance. A current laundry list of capabilities includes: 2002 multiclass SVM including arbitrary cost matrices ICML 2003 Hidden Markov Models NIPS 2003 Markov Networks (see some discussion ) EMNLP 2004 Context free grammars ICML 2004 Any loss (with much computation) ICML 2005 Any constrained linear prediction model (that’s my own
2 0.69402146 473 hunch net-2012-09-29-Vowpal Wabbit, version 7.0
Introduction: A new version of VW is out . The primary changes are: Learning Reductions : I’ve wanted to get learning reductions working and we’ve finally done it. Not everything is implemented yet, but VW now supports direct: Multiclass Classification –oaa or –ect . Cost Sensitive Multiclass Classification –csoaa or –wap . Contextual Bandit Classification –cb . Sequential Structured Prediction –searn or –dagger In addition, it is now easy to build your own custom learning reductions for various plausible uses: feature diddling, custom structured prediction problems, or alternate learning reductions. This effort is far from done, but it is now in a generally useful state. Note that all learning reductions inherit the ability to do cluster parallel learning. Library interface : VW now has a basic library interface. The library provides most of the functionality of VW, with the limitation that it is monolithic and nonreentrant. These will be improved over
3 0.68645138 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
Introduction: A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1} , X x [0,1] , X x {1,2,3,4,5} , etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero. For this post, we can think of a learning reduction as A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification). A mapping Q from predictors for type T’ to predictors for type T . The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form: Theorem For all base predictors b , for all distributions D over examples of type T : E (x,y) ~ D L T (y,Q(b,x)) <= f(E (x’,y’)~R(D) L T’ (y’,b(x’))) Here L T is the loss for the type T problem and L T’ is the loss for the type T’ problem. Also, R(D) is the distribution ov
4 0.63351929 27 hunch net-2005-02-23-Problem: Reinforcement Learning with Classification
Introduction: At an intuitive level, the question here is “Can reinforcement learning be solved with classification?” Problem Construct a reinforcement learning algorithm with near-optimal expected sum of rewards in the direct experience model given access to a classifier learning algorithm which has a small error rate or regret on all posed classification problems. The definition of “posed” here is slightly murky. I consider a problem “posed” if there is an algorithm for constructing labeled classification examples. Past Work There exists a reduction of reinforcement learning to classification given a generative model. A generative model is an inherently stronger assumption than the direct experience model. Other work on learning reductions may be important. Several algorithms for solving reinforcement learning in the direct experience model exist. Most, such as E 3 , Factored-E 3 , and metric-E 3 and Rmax require that the observation be the state. Recent work
5 0.63326526 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions
Introduction: Learning reductions transform a solver of one type of learning problem into a solver of another type of learning problem. When we analyze these for robustness we can make statement of the form “Reduction R has the property that regret r (or loss) on subproblems of type A implies regret at most f ( r ) on the original problem of type B “. A lower bound for a learning reduction would have the form “for all reductions R , there exists a learning problem of type B and learning algorithm for problems of type A where regret r on induced problems implies at least regret f ( r ) for B “. The pursuit of lower bounds is often questionable because, unlike upper bounds, they do not yield practical algorithms. Nevertheless, they may be helpful as a tool for thinking about what is learnable and how learnable it is. This has already come up here and here . At the moment, there is no coherent theory of lower bounds for learning reductions, and we have little understa
6 0.62693697 14 hunch net-2005-02-07-The State of the Reduction
7 0.58785003 492 hunch net-2013-12-01-NIPS tutorials and Vowpal Wabbit 7.4
8 0.577739 351 hunch net-2009-05-02-Wielding a New Abstraction
9 0.57205373 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?
10 0.56609201 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
11 0.55421191 82 hunch net-2005-06-17-Reopening RL->Classification
12 0.51156908 177 hunch net-2006-05-05-An ICML reject
13 0.5105111 49 hunch net-2005-03-30-What can Type Theory teach us about Machine Learning?
14 0.5059703 9 hunch net-2005-02-01-Watchword: Loss
15 0.50557482 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
16 0.45902276 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
17 0.45872051 192 hunch net-2006-07-08-Some recent papers
18 0.45487562 68 hunch net-2005-05-10-Learning Reductions are Reductionist
19 0.44999579 114 hunch net-2005-09-20-Workshop Proposal: Atomic Learning
20 0.4460153 21 hunch net-2005-02-17-Learning Research Programs
topicId topicWeight
[(27, 0.257), (38, 0.079), (53, 0.081), (55, 0.079), (77, 0.02), (91, 0.304), (94, 0.07), (95, 0.02)]
simIndex simValue blogId blogTitle
same-blog 1 0.90139943 103 hunch net-2005-08-18-SVM Adaptability
Introduction: Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions. This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance. A current laundry list of capabilities includes: 2002 multiclass SVM including arbitrary cost matrices ICML 2003 Hidden Markov Models NIPS 2003 Markov Networks (see some discussion ) EMNLP 2004 Context free grammars ICML 2004 Any loss (with much computation) ICML 2005 Any constrained linear prediction model (that’s my own
2 0.86926252 262 hunch net-2007-09-16-Optimizing Machine Learning Programs
Introduction: Machine learning is often computationally bounded which implies that the ability to write fast code becomes important if you ever want to implement a machine learning algorithm. Basic tactical optimizations are covered well elsewhere , but I haven’t seen a reasonable guide to higher level optimizations, which are the most important in my experience. Here are some of the higher level optimizations I’ve often found useful. Algorithmic Improvement First . This is Hard, but it is the most important consideration, and typically yields the most benefits. Good optimizations here are publishable. In the context of machine learning, you should be familiar with the arguments for online vs. batch learning. Choice of Language . There are many arguments about the choice of language . Sometimes you don’t have a choice when interfacing with other people. Personally, I favor C/C++ when I want to write fast code. This (admittedly) makes me a slower programmer than when using higher lev
3 0.6967504 131 hunch net-2005-11-16-The Everything Ensemble Edge
Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v
4 0.68953943 14 hunch net-2005-02-07-The State of the Reduction
Introduction: What? Reductions are machines which turn solvers for one problem into solvers for another problem. Why? Reductions are useful for several reasons. Laziness . Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work. Crystallization . The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on. Theoretical Organization . By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder. What we know now . Typesafe r
5 0.68870473 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
Introduction: Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues. To my mind, the answer to the question “Are the core problems from machine learning different from the core problems of statistics?” is a clear Yes. The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP– which, for all we know, is true– then we would suddenly find almost all of our favorite machine learning problems considerably more tractable). If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize?” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e.g., in polynomial-time)?” With t
6 0.68822795 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
7 0.68643379 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
8 0.6863448 12 hunch net-2005-02-03-Learning Theory, by assumption
9 0.68485284 235 hunch net-2007-03-03-All Models of Learning have Flaws
10 0.68426323 351 hunch net-2009-05-02-Wielding a New Abstraction
11 0.68327534 259 hunch net-2007-08-19-Choice of Metrics
12 0.68220431 95 hunch net-2005-07-14-What Learning Theory might do
13 0.6816839 41 hunch net-2005-03-15-The State of Tight Bounds
14 0.68063408 194 hunch net-2006-07-11-New Models
15 0.68061841 347 hunch net-2009-03-26-Machine Learning is too easy
16 0.68006897 370 hunch net-2009-09-18-Necessary and Sufficient Research
17 0.67876762 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
18 0.67813551 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
19 0.67598385 258 hunch net-2007-08-12-Exponentiated Gradient
20 0.67567176 19 hunch net-2005-02-14-Clever Methods of Overfitting