hunch_net hunch_net-2009 hunch_net-2009-346 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML. Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second? One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit . Antoine Bordes showed a variant was competitive in the large scale learning challenge . It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimiz
sentIndex sentText sentNum sentScore
1 Previously, we discussed parallel machine learning a bit. [sent-1, score-0.308]
2 As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. [sent-2, score-0.464]
3 This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML. [sent-3, score-0.487]
4 Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. [sent-5, score-0.438]
5 This leaves the question: Which one fast algorithm is the best to parallelize? [sent-6, score-0.388]
6 One compellingly fast simple algorithm is online gradient descent on a linear representation. [sent-8, score-0.542]
7 This is the core of Leon’s sgd code and Vowpal Wabbit . [sent-9, score-0.099]
8 Antoine Bordes showed a variant was competitive in the large scale learning challenge . [sent-10, score-0.349]
9 It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. [sent-11, score-0.513]
10 It also applies to online learning rather than just online optimization , implying the algorithm can be used in a host of situations where batch algorithms are awkward or unviable. [sent-12, score-0.864]
11 If we start with a fast learning algorithm as a baseline, there seem to be two directions we can go with parallel ML: (easier) Try to do more in the same amount of time. [sent-13, score-0.791]
12 For example, Paul and Neil suggest using parallelism to support ensemble methods. [sent-14, score-0.293]
13 (harder) Try to use parallelism to reduce the amount of time required to effectively learn on large amounts of data. [sent-15, score-0.703]
14 For this approach, bandwidth and latency become uncomfortably critical implementation details. [sent-16, score-0.412]
15 Due to these issues, it appears important to at least loosen the goal to competing with learning on large amounts of data. [sent-17, score-0.361]
16 Alternatively, we could consider this as evidence some other learning primitive is desirable, although I’m not sure which one. [sent-18, score-0.205]
wordName wordTfidf (topN-words)
[('parallel', 0.308), ('parallelizing', 0.256), ('parallelize', 0.205), ('primitive', 0.205), ('parallelism', 0.205), ('fast', 0.177), ('amounts', 0.177), ('host', 0.128), ('neil', 0.128), ('compellingly', 0.128), ('antoine', 0.128), ('bordes', 0.128), ('uncompelling', 0.128), ('alternatively', 0.128), ('uncomfortably', 0.128), ('online', 0.122), ('required', 0.117), ('algorithms', 0.116), ('ml', 0.116), ('algorithm', 0.115), ('amount', 0.113), ('decades', 0.107), ('try', 0.105), ('bandwidth', 0.102), ('reused', 0.102), ('awkward', 0.099), ('continues', 0.099), ('latency', 0.099), ('sgd', 0.099), ('leaves', 0.096), ('tomorrow', 0.096), ('paul', 0.093), ('competing', 0.093), ('baseline', 0.093), ('large', 0.091), ('leon', 0.088), ('ensemble', 0.088), ('variant', 0.086), ('competitive', 0.086), ('showed', 0.086), ('desirable', 0.083), ('implementation', 0.083), ('slow', 0.083), ('attending', 0.083), ('applies', 0.081), ('batch', 0.081), ('moment', 0.08), ('describe', 0.078), ('directions', 0.078), ('advice', 0.078)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 346 hunch net-2009-03-18-Parallel ML primitives
Introduction: Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML. Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second? One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit . Antoine Bordes showed a variant was competitive in the large scale learning challenge . It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimiz
2 0.18902324 229 hunch net-2007-01-26-Parallel Machine Learning Problems
Introduction: Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because: Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore . Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future. IBM’s Cell processor has (essentially) 9 cores. Modern graphics chips can have an order of magnitude more separate execution units. The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear. So, how do we parallelize machine learning algorithms? The simplest and most common technique is to simply run the same learning algorithm with di
3 0.18634909 404 hunch net-2010-08-20-The Workshop on Cores, Clusters, and Clouds
Introduction: Alekh , John , Ofer , and I are organizing a workshop at NIPS this year on learning in parallel and distributed environments. The general interest level in parallel learning seems to be growing rapidly, so I expect quite a bit of attendance. Please join us if you are parallel-interested. And, if you are working in the area of parallel learning, please consider submitting an abstract due Oct. 17 for presentation at the workshop.
4 0.18618935 451 hunch net-2011-12-13-Vowpal Wabbit version 6.1 & the NIPS tutorial
Introduction: I just made version 6.1 of Vowpal Wabbit . Relative to 6.0 , there are few new features, but many refinements. The cluster parallel learning code better supports multiple simultaneous runs, and other forms of parallelism have been mostly removed. This incidentally significantly simplifies the learning core. The online learning algorithms are more general, with support for l 1 (via a truncated gradient variant) and l 2 regularization, and a generalized form of variable metric learning. There is a solid persistent server mode which can train online, as well as serve answers to many simultaneous queries, either in text or binary. This should be a very good release if you are just getting started, as we’ve made it compile more automatically out of the box, have several new examples and updated documentation. As per tradition , we’re planning to do a tutorial at NIPS during the break at the parallel learning workshop at 2pm Spanish time Friday. I’ll cover the
5 0.16727184 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge
Introduction: The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured. From the instructions page , several issues come up: Large Definition My personal definition of dataset size is: small A dataset is small if a human could look at the dataset and plausibly find a good solution. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer. large A large dataset does not fit in the RAM of a reasonably priced computer. By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that. The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require ran
6 0.16284244 281 hunch net-2007-12-21-Vowpal Wabbit Code Release
7 0.1451101 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
8 0.13953672 267 hunch net-2007-10-17-Online as the new adjective
9 0.13262947 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning
10 0.13233405 441 hunch net-2011-08-15-Vowpal Wabbit 6.0
11 0.12646165 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
12 0.12353014 54 hunch net-2005-04-08-Fast SVMs
13 0.12043139 442 hunch net-2011-08-20-The Large Scale Learning Survey Tutorial
14 0.11712749 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project
15 0.11270771 366 hunch net-2009-08-03-Carbon in Computer Science Research
16 0.11076246 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy
17 0.10768677 381 hunch net-2009-12-07-Vowpal Wabbit version 4.0, and a NIPS heresy
18 0.10733838 111 hunch net-2005-09-12-Fast Gradient Descent
19 0.10398486 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
20 0.10109231 378 hunch net-2009-11-15-The Other Online Learning
topicId topicWeight
[(0, 0.216), (1, 0.065), (2, -0.124), (3, -0.024), (4, 0.111), (5, 0.077), (6, -0.174), (7, -0.062), (8, -0.12), (9, 0.158), (10, -0.062), (11, -0.057), (12, 0.142), (13, -0.028), (14, 0.048), (15, -0.049), (16, 0.055), (17, -0.018), (18, -0.007), (19, -0.098), (20, -0.017), (21, 0.039), (22, -0.054), (23, 0.072), (24, 0.022), (25, -0.048), (26, 0.018), (27, -0.057), (28, 0.008), (29, 0.014), (30, -0.029), (31, 0.014), (32, -0.06), (33, 0.043), (34, 0.034), (35, -0.037), (36, -0.007), (37, 0.042), (38, -0.018), (39, -0.066), (40, 0.046), (41, -0.097), (42, 0.035), (43, 0.12), (44, -0.044), (45, 0.029), (46, 0.095), (47, 0.04), (48, 0.022), (49, -0.012)]
simIndex simValue blogId blogTitle
same-blog 1 0.95196623 346 hunch net-2009-03-18-Parallel ML primitives
Introduction: Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML. Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second? One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit . Antoine Bordes showed a variant was competitive in the large scale learning challenge . It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimiz
2 0.817406 451 hunch net-2011-12-13-Vowpal Wabbit version 6.1 & the NIPS tutorial
Introduction: I just made version 6.1 of Vowpal Wabbit . Relative to 6.0 , there are few new features, but many refinements. The cluster parallel learning code better supports multiple simultaneous runs, and other forms of parallelism have been mostly removed. This incidentally significantly simplifies the learning core. The online learning algorithms are more general, with support for l 1 (via a truncated gradient variant) and l 2 regularization, and a generalized form of variable metric learning. There is a solid persistent server mode which can train online, as well as serve answers to many simultaneous queries, either in text or binary. This should be a very good release if you are just getting started, as we’ve made it compile more automatically out of the box, have several new examples and updated documentation. As per tradition , we’re planning to do a tutorial at NIPS during the break at the parallel learning workshop at 2pm Spanish time Friday. I’ll cover the
3 0.76607323 229 hunch net-2007-01-26-Parallel Machine Learning Problems
Introduction: Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because: Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore . Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future. IBM’s Cell processor has (essentially) 9 cores. Modern graphics chips can have an order of magnitude more separate execution units. The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear. So, how do we parallelize machine learning algorithms? The simplest and most common technique is to simply run the same learning algorithm with di
4 0.72487837 281 hunch net-2007-12-21-Vowpal Wabbit Code Release
Introduction: We are releasing the Vowpal Wabbit (Fast Online Learning) code as open source under a BSD (revised) license. This is a project at Yahoo! Research to build a useful large scale learning algorithm which Lihong Li , Alex Strehl , and I have been working on. To appreciate the meaning of “large”, it’s useful to define “small” and “medium”. A “small” supervised learning problem is one where a human could use a labeled dataset and come up with a reasonable predictor. A “medium” supervised learning problem dataset fits into the RAM of a modern desktop computer. A “large” supervised learning problem is one which does not fit into the RAM of a normal machine. VW tackles large scale learning problems by this definition of large. I’m not aware of any other open source Machine Learning tools which can handle this scale (although they may exist). A few close ones are: IBM’s Parallel Machine Learning Toolbox isn’t quite open source . The approach used by this toolbox is essenti
5 0.72376478 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning
Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne
6 0.66550797 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project
7 0.66392738 441 hunch net-2011-08-15-Vowpal Wabbit 6.0
8 0.66143626 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy
9 0.66017658 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge
10 0.6583373 442 hunch net-2011-08-20-The Large Scale Learning Survey Tutorial
11 0.62455791 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
12 0.59353715 262 hunch net-2007-09-16-Optimizing Machine Learning Programs
13 0.58694685 366 hunch net-2009-08-03-Carbon in Computer Science Research
14 0.57753766 404 hunch net-2010-08-20-The Workshop on Cores, Clusters, and Clouds
15 0.57518947 128 hunch net-2005-11-05-The design of a computing cluster
16 0.55973786 381 hunch net-2009-12-07-Vowpal Wabbit version 4.0, and a NIPS heresy
17 0.55603862 436 hunch net-2011-06-22-Ultra LDA
18 0.54516554 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
19 0.54041368 53 hunch net-2005-04-06-Structured Regret Minimization
20 0.53874475 37 hunch net-2005-03-08-Fast Physics for Learning
topicId topicWeight
[(27, 0.173), (55, 0.062), (94, 0.638), (95, 0.02)]
simIndex simValue blogId blogTitle
1 0.9890039 42 hunch net-2005-03-17-Going all the Way, Sometimes
Introduction: At many points in research, you face a choice: should I keep on improving some old piece of technology or should I do something new? For example: Should I refine bounds to make them tighter? Should I take some learning theory and turn it into a learning algorithm? Should I implement the learning algorithm? Should I test the learning algorithm widely? Should I release the algorithm as source code? Should I go see what problems people actually need to solve? The universal temptation of people attracted to research is doing something new. That is sometimes the right decision, but is also often not. I’d like to discuss some reasons why not. Expertise Once expertise are developed on some subject, you are the right person to refine them. What is the real problem? Continually improving a piece of technology is a mechanism forcing you to confront this question. In many cases, this confrontation is uncomfortable because you discover that your method has fundamen
2 0.96539211 81 hunch net-2005-06-13-Wikis for Summer Schools and Workshops
Introduction: Chicago ’05 ended a couple of weeks ago. This was the sixth Machine Learning Summer School , and the second one that used a wiki . (The first was Berder ’04, thanks to Gunnar Raetsch.) Wikis are relatively easy to set up, greatly aid social interaction, and should be used a lot more at summer schools and workshops. They can even be used as the meeting’s webpage, as a permanent record of its participants’ collaborations — see for example the wiki/website for last year’s NVO Summer School . A basic wiki is a collection of editable webpages, maintained by software called a wiki engine . The engine used at both Berder and Chicago was TikiWiki — it is well documented and gets you something running fast. It uses PHP and MySQL, but doesn’t require you to know either. Tikiwiki has far more features than most wikis, as it is really a full Content Management System . (My thanks to Sebastian Stark for pointing this out.) Here are the features we found most useful: Bulletin boa
3 0.96501559 35 hunch net-2005-03-04-The Big O and Constants in Learning
Introduction: The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n) . In learning theory, there are many statements about learning algorithms of the form “under assumptions x , y , and z , the classifier learned has an error rate of at most O(f(m)) “. There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well. Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not genera
same-blog 4 0.96448153 346 hunch net-2009-03-18-Parallel ML primitives
Introduction: Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML. Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second? One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit . Antoine Bordes showed a variant was competitive in the large scale learning challenge . It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimiz
5 0.9601509 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science
Introduction: “Science” has many meanings, but one common meaning is “the scientific method ” which is a principled method for investigating the world using the following steps: Form a hypothesis about the world. Use the hypothesis to make predictions. Run experiments to confirm or disprove the predictions. The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run. Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not. Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here : including “reprobleming”, “data set selection”, and probably “overfitting by review”
6 0.91012889 120 hunch net-2005-10-10-Predictive Search is Coming
7 0.8695668 276 hunch net-2007-12-10-Learning Track of International Planning Competition
8 0.82083619 221 hunch net-2006-12-04-Structural Problems in NIPS Decision Making
9 0.7531088 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
10 0.74651253 229 hunch net-2007-01-26-Parallel Machine Learning Problems
11 0.72430313 441 hunch net-2011-08-15-Vowpal Wabbit 6.0
12 0.68244827 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
13 0.67998761 73 hunch net-2005-05-17-A Short Guide to PhD Graduate Study
14 0.6678409 471 hunch net-2012-08-24-Patterns for research in machine learning
15 0.66399181 13 hunch net-2005-02-04-JMLG
16 0.65708959 178 hunch net-2006-05-08-Big machine learning
17 0.65563476 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning
18 0.65508831 253 hunch net-2007-07-06-Idempotent-capable Predictors
19 0.65359437 200 hunch net-2006-08-03-AOL’s data drop
20 0.64449918 193 hunch net-2006-07-09-The Stock Prediction Machine Learning Problem