hunch_net hunch_net-2008 hunch_net-2008-286 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of
sentIndex sentText sentNum sentScore
1 When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? [sent-6, score-0.661]
2 Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). [sent-7, score-0.483]
3 The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. [sent-8, score-0.673]
4 How does the algorithm scale with the number of features n per example? [sent-11, score-0.731]
5 Many second order gradient descent algorithms are O(n 2 ) or worse which becomes unacceptable as the number of parameters grows. [sent-12, score-0.464]
6 Nonsparse algorithms applied to sparse datasets have an undefined dependence, which is typically terrible. [sent-13, score-0.31]
7 What are the memory requirements of the learning algorithm? [sent-14, score-0.302]
8 Something linear in the number of features (or less) is nice. [sent-15, score-0.333]
9 Nearest neighbor and kernel methods can be problematic, because the memory requirement is uncontrolled. [sent-16, score-0.351]
10 One unfortunate aspect of big-O notation is that it doesn’t give an intuitive good sense of the scale of problems solvable by a machine. [sent-17, score-0.599]
11 For various reasons (memory size, number of web pages, FLOPS of a modern machine), a scale of 10 10 is currently appropriate. [sent-19, score-0.495]
12 Computing scales, you get: O(m) O(m log(m)) O(m 2 ) O(m 3 ) O(e m ) 10 10 5*10 8 10 5 2*10 3 25 There is good reason to stick with big-O notation over the long term, because the scale of problems we tackle keeps growing. [sent-20, score-0.747]
13 Having a good understanding of the implied scale remains very handy for understanding the practicality of algorithms for problems. [sent-21, score-0.711]
14 The Turing’s Razor application would be “a learning algorithm isn’t interesting unless it runs in time linear in the number of bytes input”. [sent-23, score-0.874]
15 This isn’t crazy—for people with a primary interest in large scale learning (where you explicitly have large datasets) or AI (where any effective system must scale to very large amounts of experience), a O(mn log(mn)) or better dependence is the target. [sent-24, score-1.048]
16 For someone deeply steeped in computer science algorithms and complexity thoery, the application is: “a learning algorithm isn’t interesting unless it has a polynomial dependence on the number of bytes input”. [sent-25, score-1.219]
17 It’s too crude because O(m^9) algorithms are interesting to basically no one. [sent-27, score-0.367]
18 The right degree of care about computation I’ll call “Turing’s club”. [sent-29, score-0.339]
19 Algorithms with controlled guarantees on computational requirements are strongly preferred. [sent-33, score-0.537]
20 Restated: there are often many algorithms capable of solving a particular problem reasonably well so fast algorithms with controlled resource guarantees distinguish themselves by requiring less TLC to make them work well. [sent-35, score-0.679]
wordName wordTfidf (topN-words)
[('scale', 0.338), ('computation', 0.243), ('algorithms', 0.225), ('bytes', 0.186), ('mn', 0.186), ('memory', 0.175), ('algorithm', 0.166), ('number', 0.157), ('dependence', 0.157), ('isn', 0.146), ('controlled', 0.132), ('requirements', 0.127), ('notation', 0.123), ('interest', 0.123), ('turing', 0.12), ('unless', 0.111), ('unknown', 0.109), ('neighbor', 0.107), ('linear', 0.106), ('nearest', 0.101), ('guarantees', 0.097), ('programs', 0.096), ('care', 0.096), ('computational', 0.095), ('primary', 0.092), ('input', 0.086), ('strongly', 0.086), ('datasets', 0.085), ('practicality', 0.082), ('unacceptable', 0.082), ('convergent', 0.082), ('crazy', 0.082), ('flops', 0.082), ('size', 0.079), ('tackle', 0.076), ('log', 0.075), ('application', 0.075), ('amount', 0.073), ('interesting', 0.073), ('problems', 0.072), ('mismatched', 0.072), ('features', 0.07), ('requirement', 0.069), ('crude', 0.069), ('stick', 0.069), ('keeps', 0.069), ('polynomial', 0.069), ('category', 0.066), ('intuitive', 0.066), ('implied', 0.066)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of
2 0.17733924 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto
3 0.17534176 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge
Introduction: The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured. From the instructions page , several issues come up: Large Definition My personal definition of dataset size is: small A dataset is small if a human could look at the dataset and plausibly find a good solution. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer. large A large dataset does not fit in the RAM of a reasonably priced computer. By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that. The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require ran
4 0.17215139 235 hunch net-2007-03-03-All Models of Learning have Flaws
Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c
5 0.17160492 332 hunch net-2008-12-23-Use of Learning Theory
Introduction: I’ve had serious conversations with several people who believe that the theory in machine learning is “only useful for getting papers published”. That’s a compelling statement, as I’ve seen many papers where the algorithm clearly came first, and the theoretical justification for it came second, purely as a perceived means to improve the chance of publication. Naturally, I disagree and believe that learning theory has much more substantial applications. Even in core learning algorithm design, I’ve found learning theory to be useful, although it’s application is more subtle than many realize. The most straightforward applications can fail, because (as expectation suggests) worst case bounds tend to be loose in practice (*). In my experience, considering learning theory when designing an algorithm has two important effects in practice: It can help make your algorithm behave right at a crude level of analysis, leaving finer details to tuning or common sense. The best example
6 0.158875 111 hunch net-2005-09-12-Fast Gradient Descent
7 0.15732557 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?
8 0.15458025 420 hunch net-2010-12-26-NIPS 2010
9 0.1514947 281 hunch net-2007-12-21-Vowpal Wabbit Code Release
10 0.14586706 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
11 0.14263184 347 hunch net-2009-03-26-Machine Learning is too easy
12 0.14149582 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
13 0.13913676 388 hunch net-2010-01-24-Specializations of the Master Problem
14 0.1375078 352 hunch net-2009-05-06-Machine Learning to AI
15 0.13181007 35 hunch net-2005-03-04-The Big O and Constants in Learning
16 0.12646165 346 hunch net-2009-03-18-Parallel ML primitives
17 0.12416722 269 hunch net-2007-10-24-Contextual Bandits
18 0.12395934 237 hunch net-2007-04-02-Contextual Scaling
19 0.12375943 454 hunch net-2012-01-30-ICML Posters and Scope
20 0.11942103 353 hunch net-2009-05-08-Computability in Artificial Intelligence
topicId topicWeight
[(0, 0.305), (1, 0.118), (2, -0.083), (3, 0.032), (4, 0.097), (5, -0.004), (6, -0.067), (7, 0.005), (8, 0.006), (9, 0.108), (10, -0.051), (11, -0.005), (12, 0.054), (13, -0.013), (14, -0.017), (15, 0.029), (16, 0.107), (17, -0.086), (18, -0.024), (19, 0.032), (20, -0.007), (21, 0.062), (22, -0.032), (23, 0.007), (24, -0.014), (25, -0.047), (26, -0.002), (27, -0.08), (28, -0.032), (29, -0.089), (30, -0.11), (31, -0.019), (32, 0.005), (33, 0.009), (34, -0.014), (35, -0.024), (36, -0.097), (37, -0.05), (38, 0.095), (39, 0.026), (40, 0.034), (41, -0.003), (42, 0.041), (43, -0.043), (44, 0.027), (45, 0.045), (46, 0.043), (47, -0.135), (48, -0.073), (49, 0.007)]
simIndex simValue blogId blogTitle
same-blog 1 0.97195184 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of
2 0.78844953 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge
Introduction: The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured. From the instructions page , several issues come up: Large Definition My personal definition of dataset size is: small A dataset is small if a human could look at the dataset and plausibly find a good solution. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer. large A large dataset does not fit in the RAM of a reasonably priced computer. By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that. The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require ran
3 0.74006599 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto
4 0.70213926 111 hunch net-2005-09-12-Fast Gradient Descent
Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be
5 0.69993192 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms
Introduction: Much of the success and popularity of machine learning has been driven by its practical impact. Of course, the evaluation of empirical work is an integral part of the field. But are the existing mechanisms for evaluating algorithms and comparing results good enough? We ( Percy and Jake ) believe there are currently a number of shortcomings: Incomplete Disclosure: You read a paper that proposes Algorithm A which is shown to outperform SVMs on two datasets. Great. But what about on other datasets? How sensitive is this result? What about compute time – does the algorithm take two seconds on a laptop or two weeks on a 100-node cluster? Lack of Standardization: Algorithm A beats Algorithm B on one version of a dataset. Algorithm B beats Algorithm A on another version yet uses slightly different preprocessing. Though doing a head-on comparison would be ideal, it would be tedious since the programs probably use different dataset formats and have a large array of options
6 0.69014537 262 hunch net-2007-09-16-Optimizing Machine Learning Programs
7 0.67454511 332 hunch net-2008-12-23-Use of Learning Theory
8 0.66766268 42 hunch net-2005-03-17-Going all the Way, Sometimes
9 0.66723174 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning
10 0.66381419 148 hunch net-2006-01-13-Benchmarks for RL
11 0.65928805 253 hunch net-2007-07-06-Idempotent-capable Predictors
12 0.65804321 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms
13 0.65676522 346 hunch net-2009-03-18-Parallel ML primitives
14 0.65286541 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project
15 0.64888197 348 hunch net-2009-04-02-Asymmophobia
16 0.64486808 153 hunch net-2006-02-02-Introspectionism as a Disease
17 0.64363182 138 hunch net-2005-12-09-Some NIPS papers
18 0.64257503 35 hunch net-2005-03-04-The Big O and Constants in Learning
19 0.63922179 179 hunch net-2006-05-16-The value of the orthodox view of Boosting
20 0.63770592 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
topicId topicWeight
[(3, 0.021), (9, 0.011), (10, 0.013), (11, 0.029), (16, 0.026), (27, 0.234), (38, 0.055), (45, 0.033), (51, 0.016), (53, 0.112), (55, 0.1), (77, 0.021), (78, 0.016), (79, 0.013), (94, 0.154), (95, 0.048)]
simIndex simValue blogId blogTitle
same-blog 1 0.98409659 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of
2 0.95325232 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto
3 0.95325017 95 hunch net-2005-07-14-What Learning Theory might do
Introduction: I wanted to expand on this post and some of the previous problems/research directions about where learning theory might make large strides. Why theory? The essential reason for theory is “intuition extension”. A very good applied learning person can master some particular application domain yielding the best computer algorithms for solving that problem. A very good theory can take the intuitions discovered by this and other applied learning people and extend them to new domains in a relatively automatic fashion. To do this, we take these basic intuitions and try to find a mathematical model that: Explains the basic intuitions. Makes new testable predictions about how to learn. Succeeds in so learning. This is “intuition extension”: taking what we have learned somewhere else and applying it in new domains. It is fundamentally useful to everyone because it increases the level of automation in solving problems. Where next for learning theory? I like the a
4 0.9498322 297 hunch net-2008-04-22-Taking the next step
Introduction: At the last ICML , Tom Dietterich asked me to look into systems for commenting on papers. I’ve been slow getting to this, but it’s relevant now. The essential observation is that we now have many tools for online collaboration, but they are not yet much used in academic research. If we can find the right way to use them, then perhaps great things might happen, with extra kudos to the first conference that manages to really create an online community. Various conferences have been poking at this. For example, UAI has setup a wiki , COLT has started using Joomla , with some dynamic content, and AAAI has been setting up a “ student blog “. Similarly, Dinoj Surendran setup a twiki for the Chicago Machine Learning Summer School , which was quite useful for coordinating events and other things. I believe the most important thing is a willingness to experiment. A good place to start seems to be enhancing existing conference websites. For example, the ICML 2007 papers pag
5 0.94882333 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re
6 0.94489175 237 hunch net-2007-04-02-Contextual Scaling
7 0.94174153 235 hunch net-2007-03-03-All Models of Learning have Flaws
8 0.93866092 423 hunch net-2011-02-02-User preferences for search engines
9 0.93788993 351 hunch net-2009-05-02-Wielding a New Abstraction
10 0.93727171 371 hunch net-2009-09-21-Netflix finishes (and starts)
11 0.93717986 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
12 0.93677533 221 hunch net-2006-12-04-Structural Problems in NIPS Decision Making
13 0.93659508 19 hunch net-2005-02-14-Clever Methods of Overfitting
14 0.93496788 358 hunch net-2009-06-01-Multitask Poisoning
15 0.93325692 253 hunch net-2007-07-06-Idempotent-capable Predictors
16 0.93311042 98 hunch net-2005-07-27-Not goal metrics
17 0.93288821 347 hunch net-2009-03-26-Machine Learning is too easy
18 0.93189508 96 hunch net-2005-07-21-Six Months
19 0.93141973 370 hunch net-2009-09-18-Necessary and Sufficient Research
20 0.93064433 207 hunch net-2006-09-12-Incentive Compatible Reviewing