hunch_net hunch_net-2011 hunch_net-2011-435 knowledge-graph by maker-knowledge-mining

435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms


meta infos for this blog

Source: html

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. [sent-8, score-0.426]

2 Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vector Machines where optimization is typically quadratic or worse. [sent-9, score-0.41]

3 The most obvious is that linear time allows you to deal with large datasets. [sent-11, score-0.327]

4 A less obvious but critical point is that a superlinear time algorithm is inherently buggy—it has an unreliable running time that can easily explode if you accidentally give it too much input. [sent-12, score-0.865]

5 This observation is what means that development of systems with clean abstractions can be extraordinarily helpful, as it allows people to work independently. [sent-18, score-0.273]

6 How do we efficiently learn in settings where exploration is required? [sent-22, score-0.373]

7 This is deeply critical to many applications, because the learning with exploration setting is inherently more natural than the standard supervised learning setting. [sent-24, score-0.455]

8 How can we do efficient offline evaluation of algorithms (see here for a first attempt)? [sent-26, score-0.265]

9 How can we best construct reward functions operating on different time scales? [sent-33, score-0.265]

10 ) But linear predictors are not enough—we would like learning algorithms that can for example learn from all the images in the world. [sent-41, score-0.407]

11 The standard solution in information retrieval is to evaluate (or approximately evaluate) all objects in a database returning the elements with the largest score according to some learned or constructed scoring function. [sent-45, score-0.311]

12 This is an inherently O(n) operation, which is frustrating when it’s plausible that an exponentially faster O(log(n)) solution exists. [sent-46, score-0.323]

13 A good solution involves both theory and empirical work here, as we need to think about how to think about how to solve the problem, and of course we need to solve it. [sent-47, score-0.485]

14 What is a flexible inherently efficient language for architecting representations for learning algorithms? [sent-48, score-0.571]

15 It’s easy and natural to pose a computationally intractable graphical model, implying many real applications involve approximations. [sent-50, score-0.342]

16 A better solution would be to use a different representation language which was always computationally tractable yet flexible enough to solve real-world problems. [sent-51, score-0.809]

17 These are inherently related as Coarse to fine is a pruned breadth first search. [sent-54, score-0.285]

18 Restated, it is not enough to have a language for specifying your prior structural beliefs—instead we must have a language for such which results in computationally tractable solutions. [sent-55, score-0.539]

19 How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? [sent-57, score-0.352]

20 Good solutions to each of the research directions above would result in revolutions in their area, and everyone of them would plausibly see wide applicability. [sent-60, score-0.308]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('superlinear', 0.212), ('inherently', 0.192), ('language', 0.137), ('efficient', 0.136), ('solution', 0.131), ('graphical', 0.13), ('algorithms', 0.129), ('operation', 0.124), ('coarse', 0.118), ('learn', 0.116), ('evaluate', 0.109), ('hashing', 0.109), ('natural', 0.107), ('flexible', 0.106), ('computationally', 0.105), ('solve', 0.099), ('observation', 0.097), ('fine', 0.093), ('construct', 0.092), ('development', 0.092), ('linear', 0.091), ('efficiently', 0.088), ('reward', 0.088), ('settings', 0.087), ('directions', 0.087), ('time', 0.085), ('remains', 0.085), ('allows', 0.084), ('exploration', 0.082), ('tractable', 0.081), ('tutorial', 0.081), ('model', 0.081), ('two', 0.081), ('enough', 0.079), ('algorithm', 0.079), ('plausibly', 0.079), ('think', 0.078), ('almost', 0.077), ('critical', 0.074), ('effectively', 0.074), ('requires', 0.072), ('phrasing', 0.071), ('newton', 0.071), ('interpolates', 0.071), ('nonlinearities', 0.071), ('unreliable', 0.071), ('returning', 0.071), ('would', 0.071), ('optimization', 0.068), ('obvious', 0.067)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9999997 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

2 0.20001486 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

3 0.18369478 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

4 0.17869347 351 hunch net-2009-05-02-Wielding a New Abstraction

Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo

5 0.17733924 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of

6 0.17257659 347 hunch net-2009-03-26-Machine Learning is too easy

7 0.17167114 332 hunch net-2008-12-23-Use of Learning Theory

8 0.16945647 454 hunch net-2012-01-30-ICML Posters and Scope

9 0.16575162 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

10 0.1654366 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

11 0.16065863 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

12 0.15029651 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

13 0.14867109 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

14 0.14815441 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

15 0.14760044 420 hunch net-2010-12-26-NIPS 2010

16 0.1427267 432 hunch net-2011-04-20-The End of the Beginning of Active Learning

17 0.13600641 95 hunch net-2005-07-14-What Learning Theory might do

18 0.13592072 237 hunch net-2007-04-02-Contextual Scaling

19 0.13440242 444 hunch net-2011-09-07-KDD and MUCMD 2011

20 0.13429165 370 hunch net-2009-09-18-Necessary and Sufficient Research


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.377), (1, 0.098), (2, -0.104), (3, 0.061), (4, 0.068), (5, -0.024), (6, -0.017), (7, 0.022), (8, 0.037), (9, 0.064), (10, -0.014), (11, -0.103), (12, -0.017), (13, 0.085), (14, -0.043), (15, 0.04), (16, 0.032), (17, 0.008), (18, -0.001), (19, -0.057), (20, -0.01), (21, 0.05), (22, 0.019), (23, -0.0), (24, -0.027), (25, 0.0), (26, -0.073), (27, -0.069), (28, -0.059), (29, -0.006), (30, -0.093), (31, -0.088), (32, 0.082), (33, 0.113), (34, -0.043), (35, 0.039), (36, 0.079), (37, -0.065), (38, 0.002), (39, 0.063), (40, 0.01), (41, -0.057), (42, 0.027), (43, -0.066), (44, 0.015), (45, 0.002), (46, -0.02), (47, -0.041), (48, -0.103), (49, -0.075)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96934128 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

2 0.78984594 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re

3 0.75791413 351 hunch net-2009-05-02-Wielding a New Abstraction

Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo

4 0.73433954 253 hunch net-2007-07-06-Idempotent-capable Predictors

Introduction: One way to distinguish different learning algorithms is by their ability or inability to easily use an input variable as the predicted output. This is desirable for at least two reasons: Modularity If we want to build complex learning systems via reuse of a subsystem, it’s important to have compatible I/O. “Prior” knowledge Machine learning is often applied in situations where we do have some knowledge of what the right solution is, often in the form of an existing system. In such situations, it’s good to start with a learning algorithm that can be at least as good as any existing system. When doing classification, most learning algorithms can do this. For example, a decision tree can split on a feature, and then classify. The real differences come up when we attempt regression. Many of the algorithms we know and commonly use are not idempotent predictors. Logistic regressors can not be idempotent, because all input features are mapped through a nonlinearity.

5 0.73407304 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v

6 0.72755533 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

7 0.71116555 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

8 0.70634437 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

9 0.69904327 347 hunch net-2009-03-26-Machine Learning is too easy

10 0.6829856 235 hunch net-2007-03-03-All Models of Learning have Flaws

11 0.6755808 153 hunch net-2006-02-02-Introspectionism as a Disease

12 0.67510426 370 hunch net-2009-09-18-Necessary and Sufficient Research

13 0.66267169 215 hunch net-2006-10-22-Exemplar programming

14 0.66168451 420 hunch net-2010-12-26-NIPS 2010

15 0.65397745 42 hunch net-2005-03-17-Going all the Way, Sometimes

16 0.65269828 148 hunch net-2006-01-13-Benchmarks for RL

17 0.64852536 332 hunch net-2008-12-23-Use of Learning Theory

18 0.63130951 471 hunch net-2012-08-24-Patterns for research in machine learning

19 0.62826353 388 hunch net-2010-01-24-Specializations of the Master Problem

20 0.62682819 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.017), (3, 0.026), (9, 0.012), (10, 0.016), (16, 0.115), (27, 0.255), (38, 0.051), (48, 0.016), (49, 0.022), (53, 0.114), (55, 0.092), (64, 0.021), (77, 0.034), (94, 0.101), (95, 0.039)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97613847 414 hunch net-2010-10-17-Partha Niyogi has died

Introduction: from brain cancer. I asked Misha who worked with him to write about it. Partha Niyogi, Louis Block Professor in Computer Science and Statistics at the University of Chicago passed away on October 1, 2010, aged 43. I first met Partha Niyogi almost exactly ten years ago when I was a graduate student in math and he had just started as a faculty in Computer Science and Statistics at the University of Chicago. Strangely, we first talked at length due to a somewhat convoluted mathematical argument in a paper on pattern recognition. I asked him some questions about the paper, and, even though the topic was new to him, he had put serious thought into it and we started regular meetings. We made significant progress and developed a line of research stemming initially just from trying to understand that one paper and to simplify one derivation. I think this was typical of Partha, showing both his intellectual curiosity and his intuition for the serendipitous; having a sense and focus fo

2 0.96367234 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

same-blog 3 0.9572528 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

4 0.92288697 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of

5 0.92127317 176 hunch net-2006-05-01-A conversation between Theo and Pat

Introduction: Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. Theo (the thoeretician) Use an error correcting output code . Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them. Theo Oh? Is your problem a decision list? Pat No, I don’t think so. Theo Hmm. Are the classes well separated by axis aligned splits? Pat Err, maybe. I’m not sure. Theo Well, if they are, under the IID assumption I can tell you how many samples you need. Pat IID? The data is definitely not IID. Theo Oh dear. Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. Theo Ok. What do you know about your problem? Pat Not much. My friend just gave me the dataset. Theo Then, no one can help you. Pat (What a fuzzy thinker. Theo keeps jumping t

6 0.9180128 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

7 0.91226739 131 hunch net-2005-11-16-The Everything Ensemble Edge

8 0.91213447 351 hunch net-2009-05-02-Wielding a New Abstraction

9 0.91123092 343 hunch net-2009-02-18-Decision by Vetocracy

10 0.91094714 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

11 0.90939194 95 hunch net-2005-07-14-What Learning Theory might do

12 0.9092136 347 hunch net-2009-03-26-Machine Learning is too easy

13 0.90910733 370 hunch net-2009-09-18-Necessary and Sufficient Research

14 0.90861934 235 hunch net-2007-03-03-All Models of Learning have Flaws

15 0.90487516 371 hunch net-2009-09-21-Netflix finishes (and starts)

16 0.90453988 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

17 0.9038288 237 hunch net-2007-04-02-Contextual Scaling

18 0.90291274 12 hunch net-2005-02-03-Learning Theory, by assumption

19 0.90256333 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

20 0.90231764 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning