hunch_net hunch_net-2011 hunch_net-2011-450 knowledge-graph by maker-knowledge-mining

450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning


meta infos for this blog

Source: html

Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. [sent-5, score-0.279]

2 It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. [sent-7, score-0.387]

3 For communication infrastructures, the two most prevalent approaches are MPI and MapReduce , both of which have substantial drawbacks for machine learning with lots of data. [sent-10, score-0.35]

4 When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. [sent-13, score-0.361]

5 The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers. [sent-25, score-0.434]

6 You just sprinkle allreduce in a few locations in your single machine code. [sent-29, score-0.477]

7 And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. [sent-34, score-0.284]

8 MPI: Fast optimization approaches remain within the conceptual scope. [sent-38, score-0.431]

9 Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. [sent-39, score-0.426]

10 But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2. [sent-44, score-0.403]

11 To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks . [sent-47, score-0.35]

12 As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. [sent-48, score-0.354]

13 It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit . [sent-49, score-0.474]

14 This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. [sent-50, score-0.352]

15 Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine. [sent-52, score-0.445]

16 You also need to get optimization approaches right. [sent-53, score-0.4]

17 The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS , and there are a number of very different approaches people are taking. [sent-66, score-0.283]

18 Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. [sent-76, score-0.385]

19 While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. [sent-77, score-0.454]

20 The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. [sent-78, score-0.335]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('mapreduce', 0.438), ('allreduce', 0.385), ('mpi', 0.28), ('approaches', 0.178), ('nodes', 0.168), ('lbfgs', 0.162), ('hadoop', 0.131), ('optimization', 0.118), ('programming', 0.115), ('cluster', 0.113), ('node', 0.112), ('parallel', 0.105), ('need', 0.104), ('drawbacks', 0.099), ('single', 0.092), ('code', 0.085), ('suffers', 0.084), ('ram', 0.081), ('data', 0.08), ('function', 0.08), ('local', 0.08), ('temporary', 0.079), ('communication', 0.073), ('within', 0.07), ('scope', 0.069), ('drawback', 0.067), ('significantly', 0.067), ('practice', 0.066), ('eliminated', 0.065), ('conceptual', 0.065), ('approach', 0.063), ('inefficient', 0.061), ('disk', 0.061), ('native', 0.061), ('scaling', 0.061), ('conceptually', 0.061), ('parameters', 0.059), ('faster', 0.057), ('execute', 0.056), ('algorithms', 0.056), ('online', 0.056), ('failures', 0.054), ('sgd', 0.054), ('flaw', 0.052), ('sufficient', 0.052), ('discussed', 0.051), ('predictor', 0.05), ('visited', 0.05), ('state', 0.049), ('query', 0.048)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999964 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne

2 0.22138451 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

Introduction: Urs Hoelzle from Google gave an invited presentation at NIPS . In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following: Make a cluster of machines. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.) Interact with data via MapReduce . Creating a cluster of machines is, by this point, relatively straightforward. Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters. MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interacti

3 0.18062027 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

Introduction: I just released Vowpal Wabbit 6.0 . Since the last version: VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh . Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point. The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs. The new matrix factoriz

4 0.1595488 229 hunch net-2007-01-26-Parallel Machine Learning Problems

Introduction: Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because: Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore . Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future. IBM’s Cell processor has (essentially) 9 cores. Modern graphics chips can have an order of magnitude more separate execution units. The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear. So, how do we parallelize machine learning algorithms? The simplest and most common technique is to simply run the same learning algorithm with di

5 0.15254392 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

Introduction: We are releasing the Vowpal Wabbit (Fast Online Learning) code as open source under a BSD (revised) license. This is a project at Yahoo! Research to build a useful large scale learning algorithm which Lihong Li , Alex Strehl , and I have been working on. To appreciate the meaning of “large”, it’s useful to define “small” and “medium”. A “small” supervised learning problem is one where a human could use a labeled dataset and come up with a reasonable predictor. A “medium” supervised learning problem dataset fits into the RAM of a modern desktop computer. A “large” supervised learning problem is one which does not fit into the RAM of a normal machine. VW tackles large scale learning problems by this definition of large. I’m not aware of any other open source Machine Learning tools which can handle this scale (although they may exist). A few close ones are: IBM’s Parallel Machine Learning Toolbox isn’t quite open source . The approach used by this toolbox is essenti

6 0.14867109 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

7 0.13755792 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

8 0.13262947 346 hunch net-2009-03-18-Parallel ML primitives

9 0.12616239 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

10 0.12480675 451 hunch net-2011-12-13-Vowpal Wabbit version 6.1 & the NIPS tutorial

11 0.1202106 347 hunch net-2009-03-26-Machine Learning is too easy

12 0.11823484 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

13 0.11637802 235 hunch net-2007-03-03-All Models of Learning have Flaws

14 0.11562029 215 hunch net-2006-10-22-Exemplar programming

15 0.11442544 420 hunch net-2010-12-26-NIPS 2010

16 0.11377785 128 hunch net-2005-11-05-The design of a computing cluster

17 0.11314935 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

18 0.10935119 237 hunch net-2007-04-02-Contextual Scaling

19 0.10219556 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

20 0.10045073 366 hunch net-2009-08-03-Carbon in Computer Science Research


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.261), (1, 0.072), (2, -0.094), (3, 0.014), (4, 0.107), (5, 0.028), (6, -0.133), (7, -0.031), (8, -0.073), (9, 0.117), (10, -0.141), (11, -0.059), (12, 0.065), (13, -0.004), (14, -0.003), (15, -0.112), (16, 0.01), (17, 0.006), (18, -0.033), (19, -0.073), (20, 0.027), (21, 0.02), (22, -0.012), (23, 0.059), (24, -0.038), (25, 0.056), (26, 0.014), (27, -0.001), (28, 0.046), (29, -0.052), (30, -0.0), (31, -0.008), (32, 0.042), (33, 0.06), (34, -0.013), (35, 0.0), (36, 0.044), (37, -0.011), (38, -0.029), (39, -0.066), (40, 0.036), (41, -0.027), (42, 0.063), (43, 0.057), (44, 0.0), (45, 0.025), (46, -0.043), (47, 0.016), (48, -0.073), (49, -0.024)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.93854833 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne

2 0.80171442 229 hunch net-2007-01-26-Parallel Machine Learning Problems

Introduction: Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because: Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore . Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future. IBM’s Cell processor has (essentially) 9 cores. Modern graphics chips can have an order of magnitude more separate execution units. The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear. So, how do we parallelize machine learning algorithms? The simplest and most common technique is to simply run the same learning algorithm with di

3 0.800026 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

Introduction: Urs Hoelzle from Google gave an invited presentation at NIPS . In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following: Make a cluster of machines. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.) Interact with data via MapReduce . Creating a cluster of machines is, by this point, relatively straightforward. Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters. MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interacti

4 0.77580941 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

Introduction: The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured. From the instructions page , several issues come up: Large Definition My personal definition of dataset size is: small A dataset is small if a human could look at the dataset and plausibly find a good solution. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer. large A large dataset does not fit in the RAM of a reasonably priced computer. By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that. The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require ran

5 0.77508086 346 hunch net-2009-03-18-Parallel ML primitives

Introduction: Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML. Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second? One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit . Antoine Bordes showed a variant was competitive in the large scale learning challenge . It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimiz

6 0.76324719 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

7 0.75444984 451 hunch net-2011-12-13-Vowpal Wabbit version 6.1 & the NIPS tutorial

8 0.74374151 128 hunch net-2005-11-05-The design of a computing cluster

9 0.74317384 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project

10 0.73001748 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

11 0.72902274 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

12 0.705778 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

13 0.67831218 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

14 0.66433203 471 hunch net-2012-08-24-Patterns for research in machine learning

15 0.61883634 366 hunch net-2009-08-03-Carbon in Computer Science Research

16 0.60094851 381 hunch net-2009-12-07-Vowpal Wabbit version 4.0, and a NIPS heresy

17 0.59868389 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

18 0.59399593 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

19 0.59294039 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

20 0.58133155 210 hunch net-2006-09-28-Programming Languages for Machine Learning Implementations


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.02), (3, 0.021), (10, 0.039), (27, 0.172), (37, 0.017), (38, 0.044), (48, 0.02), (51, 0.027), (53, 0.047), (55, 0.077), (64, 0.013), (71, 0.197), (78, 0.016), (91, 0.012), (94, 0.136), (95, 0.045)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.94954276 161 hunch net-2006-03-05-“Structural” Learning

Introduction: Fernando Pereira pointed out Ando and Zhang ‘s paper on “structural” learning. Structural learning is multitask learning on subproblems created from unlabeled data. The basic idea is to take a look at the unlabeled data and create many supervised problems. On text data, which they test on, these subproblems might be of the form “Given surrounding words predict the middle word”. The hope here is that successfully predicting on these subproblems is relevant to the prediction of your core problem. In the long run, the precise mechanism used (essentially, linear predictors with parameters tied by a common matrix) and the precise problems formed may not be critical. What seems critical is that the hope is realized: the technique provides a significant edge in practice. Some basic questions about this approach are: Are there effective automated mechanisms for creating the subproblems? Is it necessary to use a shared representation?

2 0.92333555 147 hunch net-2006-01-08-Debugging Your Brain

Introduction: One part of doing research is debugging your understanding of reality. This is hard work: How do you even discover where you misunderstand? If you discover a misunderstanding, how do you go about removing it? The process of debugging computer programs is quite analogous to debugging reality misunderstandings. This is natural—a bug in a computer program is a misunderstanding between you and the computer about what you said. Many of the familiar techniques from debugging have exact parallels. Details When programming, there are often signs that some bug exists like: “the graph my program output is shifted a little bit” = maybe you have an indexing error. In debugging yourself, we often have some impression that something is “not right”. These impressions should be addressed directly and immediately. (Some people have the habit of suppressing worries in favor of excess certainty. That’s not healthy for research.) Corner Cases A “corner case” is an input to a program wh

3 0.91356152 417 hunch net-2010-11-18-ICML 2011 – Call for Tutorials

Introduction: I would like to encourage people to consider giving a tutorial at next years ICML. The ideal tutorial attracts a wide audience, provides a gentle and easily taught introduction to the chosen research area, and also covers the most important contributions in depth. Submissions are due January 14  (about two weeks before paper deadline). http://www.icml-2011.org/tutorials.php Regards, Ulf

same-blog 4 0.8935625 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne

5 0.81627285 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

Introduction: Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds. To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values). On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix . In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted. A vector window and a matrix both imply a notion of locality which is being actively and

6 0.78087831 229 hunch net-2007-01-26-Parallel Machine Learning Problems

7 0.78053844 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

8 0.76347244 221 hunch net-2006-12-04-Structural Problems in NIPS Decision Making

9 0.75896937 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

10 0.75048286 379 hunch net-2009-11-23-ICML 2009 Workshops (and Tutorials)

11 0.7454716 95 hunch net-2005-07-14-What Learning Theory might do

12 0.74159491 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

13 0.73954582 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

14 0.73932844 276 hunch net-2007-12-10-Learning Track of International Planning Competition

15 0.73549271 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

16 0.73382413 423 hunch net-2011-02-02-User preferences for search engines

17 0.73379064 371 hunch net-2009-09-21-Netflix finishes (and starts)

18 0.73286176 235 hunch net-2007-03-03-All Models of Learning have Flaws

19 0.73198044 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

20 0.73106664 297 hunch net-2008-04-22-Taking the next step