hunch_net hunch_net-2005 hunch_net-2005-136 knowledge-graph by maker-knowledge-mining

136 hunch net-2005-12-07-Is the Google way the way for machine learning?


meta infos for this blog

Source: html

Introduction: Urs Hoelzle from Google gave an invited presentation at NIPS . In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following: Make a cluster of machines. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.) Interact with data via MapReduce . Creating a cluster of machines is, by this point, relatively straightforward. Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters. MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interacti


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following: Make a cluster of machines. [sent-2, score-0.54]

2 Creating a cluster of machines is, by this point, relatively straightforward. [sent-6, score-0.365]

3 Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. [sent-7, score-0.151]

4 NFS can bottleneck because all of the data has to move through one machine. [sent-8, score-0.273]

5 Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). [sent-11, score-0.376]

6 These operations are highly parallelizable and the claim is that they support the forms of data interaction which are necessary. [sent-12, score-0.467]

7 Apparently, the Nutch project has an open source implementation of mapreduce (but this is clearly the most nonstandard element). [sent-13, score-0.301]

8 Shifting towards this paradigm has several effects: It makes “big data” applications more viable. [sent-14, score-0.133]

9 It makes some learning algorithms more viable than others. [sent-15, score-0.303]

10 One way to think about this is in terms of statistical query learning algorithms . [sent-16, score-0.907]

11 The (generalized) notion of statistical query algorithms is algorithms that rely upon only the results of expections of a (relatively small) number of functions. [sent-17, score-1.13]

12 The “naive bayes” algorithm and most decision tree algorithms can be easily phrased as statistical query algorithms. [sent-19, score-1.09]

13 Support vector machines can (technically) be phrased as statistical query algorithms, but the number of queries scales with the number of datapoints. [sent-20, score-1.47]

14 Gradient descent algorithms can also be phrased as statistical query algorithms. [sent-21, score-1.09]

15 Learning algorithms which work on one example at a time are not generally statistical query algorithms. [sent-22, score-0.832]

16 Roughly speaking, as the amount of data scales, only O(n) or (perhaps) O(n log(n)) algorithms are tractable. [sent-24, score-0.488]

17 Decision trees and naive bayes are (again) relatively reasonable. [sent-26, score-0.327]

18 Support vector machines (or gaussian processes) encounter difficulties related to scaling. [sent-27, score-0.243]

19 There is a reasonable argument that the “low hanging fruit” of machine learning research is in the big data with enabling tools paradigm. [sent-28, score-0.436]

20 This is because (a) the amount of data available has been growing far faster than the amount of computation and (b) we just haven’t had the tools to scale here, until recently. [sent-29, score-0.529]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('query', 0.356), ('statistical', 0.3), ('phrased', 0.258), ('mapreduce', 0.215), ('data', 0.198), ('gfs', 0.194), ('nfs', 0.194), ('urs', 0.194), ('algorithms', 0.176), ('unified', 0.15), ('machines', 0.15), ('naive', 0.129), ('scales', 0.125), ('support', 0.119), ('element', 0.116), ('amount', 0.114), ('cluster', 0.111), ('relatively', 0.104), ('tools', 0.103), ('smaller', 0.1), ('google', 0.097), ('bayes', 0.094), ('presentation', 0.094), ('vector', 0.093), ('strongly', 0.09), ('technically', 0.086), ('parallelizable', 0.086), ('nonstandard', 0.086), ('throughput', 0.086), ('favors', 0.08), ('bottleneck', 0.075), ('terms', 0.075), ('paradigm', 0.072), ('enabling', 0.072), ('interacting', 0.072), ('scalable', 0.069), ('queries', 0.066), ('viable', 0.066), ('essentially', 0.065), ('via', 0.065), ('operations', 0.064), ('interact', 0.064), ('big', 0.063), ('transform', 0.062), ('apparently', 0.062), ('generalized', 0.062), ('number', 0.061), ('makes', 0.061), ('rely', 0.061), ('shifting', 0.061)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999976 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

Introduction: Urs Hoelzle from Google gave an invited presentation at NIPS . In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following: Make a cluster of machines. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.) Interact with data via MapReduce . Creating a cluster of machines is, by this point, relatively straightforward. Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters. MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interacti

2 0.22138451 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne

3 0.19470757 229 hunch net-2007-01-26-Parallel Machine Learning Problems

Introduction: Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because: Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore . Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future. IBM’s Cell processor has (essentially) 9 cores. Modern graphics chips can have an order of magnitude more separate execution units. The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear. So, how do we parallelize machine learning algorithms? The simplest and most common technique is to simply run the same learning algorithm with di

4 0.14586706 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of

5 0.12279023 183 hunch net-2006-06-14-Explorations of Exploration

Introduction: Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models. Reinforcement Learning (1) . Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a . The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E 3 by Satinder Singh and Michael Kearns . Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E 3 and

6 0.11465806 12 hunch net-2005-02-03-Learning Theory, by assumption

7 0.11355792 235 hunch net-2007-03-03-All Models of Learning have Flaws

8 0.11149916 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

9 0.10515106 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

10 0.096545093 332 hunch net-2008-12-23-Use of Learning Theory

11 0.089480497 451 hunch net-2011-12-13-Vowpal Wabbit version 6.1 & the NIPS tutorial

12 0.087004349 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

13 0.086559862 347 hunch net-2009-03-26-Machine Learning is too easy

14 0.084204949 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

15 0.083628148 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

16 0.083507657 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

17 0.083441474 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

18 0.081319891 111 hunch net-2005-09-12-Fast Gradient Descent

19 0.080306046 444 hunch net-2011-09-07-KDD and MUCMD 2011

20 0.078995846 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.205), (1, 0.078), (2, -0.083), (3, 0.007), (4, 0.105), (5, 0.016), (6, -0.093), (7, 0.018), (8, -0.015), (9, 0.058), (10, -0.094), (11, -0.012), (12, 0.027), (13, -0.024), (14, -0.042), (15, -0.086), (16, -0.024), (17, 0.005), (18, -0.051), (19, -0.113), (20, 0.034), (21, -0.033), (22, 0.005), (23, 0.044), (24, -0.035), (25, 0.04), (26, 0.057), (27, 0.054), (28, 0.085), (29, -0.055), (30, -0.005), (31, -0.093), (32, -0.026), (33, 0.063), (34, 0.038), (35, -0.057), (36, 0.032), (37, -0.005), (38, 0.037), (39, -0.063), (40, 0.006), (41, -0.043), (42, 0.075), (43, 0.064), (44, -0.081), (45, 0.02), (46, -0.028), (47, 0.03), (48, -0.023), (49, 0.061)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9608354 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

Introduction: Urs Hoelzle from Google gave an invited presentation at NIPS . In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following: Make a cluster of machines. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.) Interact with data via MapReduce . Creating a cluster of machines is, by this point, relatively straightforward. Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters. MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interacti

2 0.82275867 229 hunch net-2007-01-26-Parallel Machine Learning Problems

Introduction: Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because: Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore . Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future. IBM’s Cell processor has (essentially) 9 cores. Modern graphics chips can have an order of magnitude more separate execution units. The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear. So, how do we parallelize machine learning algorithms? The simplest and most common technique is to simply run the same learning algorithm with di

3 0.79326314 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne

4 0.71807504 128 hunch net-2005-11-05-The design of a computing cluster

Introduction: This is about the design of a computing cluster from the viewpoint of applied machine learning using current technology. We just built a small one at TTI so this is some evidence of what is feasible and thoughts about the design choices. Architecture There are several architectural choices. AMD Athlon64 based system. This seems to have the cheapest bang/buck. Maximum RAM is typically 2-3GB. AMD Opteron based system. Opterons provide the additional capability to buy an SMP motherboard with two chips, and the motherboards often support 16GB of RAM. The RAM is also the more expensive error correcting type. Intel PIV or Xeon based system. The PIV and Xeon based systems are the intel analog of the above 2. Due to architectural design reasons, these chips tend to run a bit hotter and be a bit more expensive. Dual core chips. Both Intel and AMD have chips that actually have 2 processors embedded in them. In the end, we decided to go with option (2). Roughly speaking,

5 0.6665194 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

Introduction: The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured. From the instructions page , several issues come up: Large Definition My personal definition of dataset size is: small A dataset is small if a human could look at the dataset and plausibly find a good solution. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer. large A large dataset does not fit in the RAM of a reasonably priced computer. By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that. The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require ran

6 0.65966523 346 hunch net-2009-03-18-Parallel ML primitives

7 0.62072462 250 hunch net-2007-06-23-Machine Learning Jobs are Growing on Trees

8 0.57075351 451 hunch net-2011-12-13-Vowpal Wabbit version 6.1 & the NIPS tutorial

9 0.56947458 260 hunch net-2007-08-25-The Privacy Problem

10 0.56522733 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

11 0.56446272 442 hunch net-2011-08-20-The Large Scale Learning Survey Tutorial

12 0.56278348 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

13 0.55053151 471 hunch net-2012-08-24-Patterns for research in machine learning

14 0.54850924 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

15 0.54724097 366 hunch net-2009-08-03-Carbon in Computer Science Research

16 0.52674502 444 hunch net-2011-09-07-KDD and MUCMD 2011

17 0.52498043 235 hunch net-2007-03-03-All Models of Learning have Flaws

18 0.52440149 455 hunch net-2012-02-20-Berkeley Streaming Data Workshop

19 0.52378219 12 hunch net-2005-02-03-Learning Theory, by assumption

20 0.52294004 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.02), (9, 0.021), (27, 0.213), (30, 0.013), (38, 0.056), (51, 0.031), (52, 0.204), (53, 0.03), (55, 0.078), (56, 0.01), (94, 0.147), (95, 0.082)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.91432118 169 hunch net-2006-04-05-What is state?

Introduction: In reinforcement learning (and sometimes other settings), there is a notion of “state”. Based upon the state various predictions are made such as “Which action should be taken next?” or “How much cumulative reward do I expect if I take some action from this state?” Given the importance of state, it is important to examine the meaning. There are actually several distinct options and it turns out the definition variation is very important in motivating different pieces of work. Newtonian State. State is the physical pose of the world. Under this definition, there are very many states, often too many for explicit representation. This is also the definition typically used in games. Abstracted State. State is an abstracted physical state of the world. “Is the door open or closed?” “Are you in room A or not?” The number of states is much smaller here. A basic issue here is: “How do you compute the state from observations?” Mathematical State. State is a sufficient stati

same-blog 2 0.89680582 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

Introduction: Urs Hoelzle from Google gave an invited presentation at NIPS . In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following: Make a cluster of machines. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.) Interact with data via MapReduce . Creating a cluster of machines is, by this point, relatively straightforward. Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters. MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interacti

3 0.88082278 5 hunch net-2005-01-26-Watchword: Probability

Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in

4 0.7873717 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of

5 0.78615618 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

Introduction: The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured. From the instructions page , several issues come up: Large Definition My personal definition of dataset size is: small A dataset is small if a human could look at the dataset and plausibly find a good solution. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer. large A large dataset does not fit in the RAM of a reasonably priced computer. By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that. The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require ran

6 0.78470516 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

7 0.77805674 221 hunch net-2006-12-04-Structural Problems in NIPS Decision Making

8 0.77616072 235 hunch net-2007-03-03-All Models of Learning have Flaws

9 0.77518982 360 hunch net-2009-06-15-In Active Learning, the question changes

10 0.77477747 95 hunch net-2005-07-14-What Learning Theory might do

11 0.77298021 371 hunch net-2009-09-21-Netflix finishes (and starts)

12 0.77177352 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

13 0.77121788 229 hunch net-2007-01-26-Parallel Machine Learning Problems

14 0.77069134 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

15 0.76907551 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

16 0.7645787 345 hunch net-2009-03-08-Prediction Science

17 0.7645275 276 hunch net-2007-12-10-Learning Track of International Planning Competition

18 0.76392686 351 hunch net-2009-05-02-Wielding a New Abstraction

19 0.76377356 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

20 0.76363969 237 hunch net-2007-04-02-Contextual Scaling