hunch_net hunch_net-2007 hunch_net-2007-262 knowledge-graph by maker-knowledge-mining

262 hunch net-2007-09-16-Optimizing Machine Learning Programs

meta infos for this blog

Source: html

Introduction: Machine learning is often computationally bounded which implies that the ability to write fast code becomes important if you ever want to implement a machine learning algorithm. Basic tactical optimizations are covered well elsewhere , but I haven’t seen a reasonable guide to higher level optimizations, which are the most important in my experience. Here are some of the higher level optimizations I’ve often found useful. Algorithmic Improvement First . This is Hard, but it is the most important consideration, and typically yields the most benefits. Good optimizations here are publishable. In the context of machine learning, you should be familiar with the arguments for online vs. batch learning. Choice of Language . There are many arguments about the choice of language . Sometimes you don’t have a choice when interfacing with other people. Personally, I favor C/C++ when I want to write fast code. This (admittedly) makes me a slower programmer than when using higher lev

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Machine learning is often computationally bounded which implies that the ability to write fast code becomes important if you ever want to implement a machine learning algorithm. [sent-1, score-0.61]

2 Basic tactical optimizations are covered well elsewhere , but I haven’t seen a reasonable guide to higher level optimizations, which are the most important in my experience. [sent-2, score-0.452]

3 Here are some of the higher level optimizations I’ve often found useful. [sent-3, score-0.526]

4 There are many arguments about the choice of language . [sent-10, score-0.287]

5 Personally, I favor C/C++ when I want to write fast code. [sent-12, score-0.296]

6 My rule of thumb is “for fast programs, it’s all arrays in the end”. [sent-18, score-0.419]

7 The Google dense_hash_map replaces the array of pointers with a plain old array and (unsurprisingly) is even faster. [sent-23, score-0.541]

8 What’s fundamentally happening here is locality: dereferencing pointers is a very expensive operation on modern computers because the CPU runs much faster than the latency to RAM. [sent-24, score-0.403]

9 Converting things into an array is not always easy, but it is often possible with a little bit of thought and care. [sent-26, score-0.358]

10 Unfortunately, the computational process of reading and parsing the data is often intensive. [sent-29, score-0.331]

11 By caching parsed examples in a machine representation format (either in RAM or on disk), a substantial performance boost is achievable. [sent-30, score-0.465]

12 The machine representation can be more concise implying improved system caching effects. [sent-32, score-0.308]

13 There are many reasons to write less code where you can. [sent-38, score-0.375]

14 For the purposes of optimization, having less code in the bottleneck is the surest way to reduce work in the bottleneck. [sent-39, score-0.336]

15 In particular, don’t trust library calls inside the bottleneck. [sent-42, score-0.287]

16 It’s often the case that a library function is general purpose while you can get away with a much faster hand-crafted (and inlined) function. [sent-43, score-0.324]

17 There is a huge difference in performance between reading and writing directly from the disk and doing this through a buffering layer. [sent-45, score-0.494]

18 C++ I/O libraries seem to handle this better than C libraries in general. [sent-48, score-0.398]

19 A reasonable rule of thumb is to spend time on optimization when you are waiting for the program to finish running. [sent-55, score-0.436]

20 In all program optimization, it is critical to know where the bottleneck is, and optimize preferentially there. [sent-57, score-0.293]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('optimizations', 0.239), ('array', 0.208), ('libraries', 0.199), ('disk', 0.188), ('parsing', 0.179), ('library', 0.172), ('write', 0.164), ('amortization', 0.161), ('copying', 0.161), ('buffering', 0.143), ('code', 0.139), ('optimization', 0.138), ('thumb', 0.133), ('fast', 0.132), ('bottleneck', 0.125), ('pointers', 0.125), ('caching', 0.125), ('higher', 0.123), ('converting', 0.119), ('trust', 0.115), ('latency', 0.111), ('representation', 0.111), ('language', 0.104), ('implement', 0.101), ('observe', 0.099), ('arguments', 0.097), ('avoid', 0.093), ('level', 0.09), ('modern', 0.089), ('choice', 0.086), ('performance', 0.085), ('optimize', 0.085), ('program', 0.083), ('rule', 0.082), ('improvement', 0.079), ('reading', 0.078), ('faster', 0.078), ('always', 0.076), ('often', 0.074), ('less', 0.072), ('locality', 0.072), ('dramatic', 0.072), ('boost', 0.072), ('arrays', 0.072), ('concise', 0.072), ('ocaml', 0.072), ('substitute', 0.072), ('amortized', 0.072), ('cached', 0.072), ('parsed', 0.072)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9999997 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

2 0.21994746 210 hunch net-2006-09-28-Programming Languages for Machine Learning Implementations

Introduction: Machine learning algorithms have a much better chance of being widely adopted if they are implemented in some easy-to-use code. There are several important concerns associated with machine learning which stress programming languages on the ease-of-use vs. speed frontier. Speed The rate at which data sources are growing seems to be outstripping the rate at which computational power is growing, so it is important that we be able to eak out every bit of computational power. Garbage collected languages ( java , ocaml , perl and python ) often have several issues here. Garbage collection often implies that floating point numbers are “boxed”: every float is represented by a pointer to a float. Boxing can cause an order of magnitude slowdown because an extra nonlocalized memory reference is made, and accesses to main memory can are many CPU cycles long. Garbage collection often implies that considerably more memory is used than is necessary. This has a variable effect. I

3 0.15375407 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project

Introduction: Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github . For example, the lastest and greatest can be downloaded via: git clone git://github.com/JohnLangford/vowpal_wabbit.git If you aren’t familiar with git , it’s a distributed version control system which supports quick and easy branching, as well as reconciliation. This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes. As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience. The program can effectively scale up to batch-style training on sparse terafeature (i.e. 10 12 sparse feature) size datasets. The limiting factor is typically i/o. I started using the the real datasets from the large-scale learning workshop as a conve

4 0.12261014 471 hunch net-2012-08-24-Patterns for research in machine learning

Introduction: There are a handful of basic code patterns that I wish I was more aware of when I started research in machine learning. Each on its own may seem pointless, but collectively they go a long way towards making the typical research workflow more efficient. Here they are: Separate code from data. Separate input data, working data and output data. Save everything to disk frequently. Separate options from parameters. Do not use global variables. Record the options used to generate each run of the algorithm. Make it easy to sweep options. Make it easy to execute only portions of the code. Use checkpointing. Write demos and tests. Click here for discussion and examples for each item. Also see Charles Sutton’s and HackerNews’ thoughts on the same topic. My guess is that these patterns will not only be useful for machine learning, but also any other computational work that involves either a) processing large amounts of data, or b) algorithms that take a signif

5 0.12028599 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

6 0.11314935 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

7 0.10921284 235 hunch net-2007-03-03-All Models of Learning have Flaws

8 0.099491082 346 hunch net-2009-03-18-Parallel ML primitives

9 0.097941518 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

10 0.095017456 19 hunch net-2005-02-14-Clever Methods of Overfitting

11 0.094682775 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

12 0.090844437 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

13 0.089997873 49 hunch net-2005-03-30-What can Type Theory teach us about Machine Learning?

14 0.089078784 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

15 0.088767171 454 hunch net-2012-01-30-ICML Posters and Scope

16 0.087614357 473 hunch net-2012-09-29-Vowpal Wabbit, version 7.0

17 0.086208999 437 hunch net-2011-07-10-ICML 2011 and the future

18 0.085871853 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

19 0.085472807 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms

20 0.084181644 148 hunch net-2006-01-13-Benchmarks for RL

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.229), (1, 0.045), (2, -0.055), (3, 0.054), (4, 0.009), (5, 0.012), (6, -0.067), (7, -0.002), (8, -0.015), (9, 0.042), (10, -0.107), (11, -0.057), (12, 0.029), (13, -0.034), (14, -0.007), (15, -0.095), (16, 0.071), (17, -0.03), (18, 0.013), (19, -0.07), (20, 0.062), (21, 0.017), (22, -0.026), (23, 0.005), (24, -0.054), (25, -0.021), (26, -0.032), (27, -0.045), (28, -0.007), (29, 0.022), (30, -0.139), (31, 0.074), (32, 0.105), (33, 0.087), (34, 0.018), (35, 0.005), (36, -0.002), (37, 0.035), (38, -0.036), (39, -0.046), (40, 0.049), (41, 0.018), (42, 0.045), (43, -0.014), (44, 0.016), (45, 0.005), (46, 0.035), (47, 0.015), (48, -0.015), (49, -0.007)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95909894 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

2 0.82236117 210 hunch net-2006-09-28-Programming Languages for Machine Learning Implementations

3 0.70609403 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne

4 0.69463271 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project

5 0.6773302 471 hunch net-2012-08-24-Patterns for research in machine learning

6 0.66661245 128 hunch net-2005-11-05-The design of a computing cluster

7 0.65795714 84 hunch net-2005-06-22-Languages of Learning

8 0.62673718 229 hunch net-2007-01-26-Parallel Machine Learning Problems

9 0.61845863 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

10 0.60925204 346 hunch net-2009-03-18-Parallel ML primitives

11 0.60570771 215 hunch net-2006-10-22-Exemplar programming

12 0.60192221 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

13 0.59685457 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

14 0.59397495 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

15 0.59379816 366 hunch net-2009-08-03-Carbon in Computer Science Research

16 0.58521438 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

17 0.58241469 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

18 0.58061361 153 hunch net-2006-02-02-Introspectionism as a Disease

19 0.57918543 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

20 0.57536274 162 hunch net-2006-03-09-Use of Notation

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.02), (3, 0.027), (27, 0.196), (38, 0.05), (48, 0.016), (49, 0.025), (51, 0.015), (53, 0.06), (55, 0.082), (64, 0.022), (91, 0.267), (94, 0.124), (95, 0.025)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.89345586 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

2 0.86350805 103 hunch net-2005-08-18-SVM Adaptability

Introduction: Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions. This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance. A current laundry list of capabilities includes: 2002 multiclass SVM including arbitrary cost matrices ICML 2003 Hidden Markov Models NIPS 2003 Markov Networks (see some discussion ) EMNLP 2004 Context free grammars ICML 2004 Any loss (with much computation) ICML 2005 Any constrained linear prediction model (thatâ€™s my own

3 0.68802178 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of

4 0.68610942 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re

5 0.67793423 95 hunch net-2005-07-14-What Learning Theory might do

Introduction: I wanted to expand on this post and some of the previous problems/research directions about where learning theory might make large strides. Why theory? The essential reason for theory is “intuition extension”. A very good applied learning person can master some particular application domain yielding the best computer algorithms for solving that problem. A very good theory can take the intuitions discovered by this and other applied learning people and extend them to new domains in a relatively automatic fashion. To do this, we take these basic intuitions and try to find a mathematical model that: Explains the basic intuitions. Makes new testable predictions about how to learn. Succeeds in so learning. This is “intuition extension”: taking what we have learned somewhere else and applying it in new domains. It is fundamentally useful to everyone because it increases the level of automation in solving problems. Where next for learning theory? I like the a

6 0.67487925 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

7 0.67194003 351 hunch net-2009-05-02-Wielding a New Abstraction

8 0.67190188 237 hunch net-2007-04-02-Contextual Scaling

9 0.67079699 221 hunch net-2006-12-04-Structural Problems in NIPS Decision Making

10 0.66826683 371 hunch net-2009-09-21-Netflix finishes (and starts)

11 0.66681111 235 hunch net-2007-03-03-All Models of Learning have Flaws

12 0.66580468 229 hunch net-2007-01-26-Parallel Machine Learning Problems

13 0.6652723 143 hunch net-2005-12-27-Automated Labeling

14 0.66525888 98 hunch net-2005-07-27-Not goal metrics

15 0.66505003 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

16 0.66497105 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

17 0.66434759 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

18 0.66312271 343 hunch net-2009-02-18-Decision by Vetocracy

19 0.6623261 306 hunch net-2008-07-02-Proprietary Data in Academic Research?

20 0.6614123 297 hunch net-2008-04-22-Taking the next step