hunch_net hunch_net-2005 hunch_net-2005-35 knowledge-graph by maker-knowledge-mining

35 hunch net-2005-03-04-The Big O and Constants in Learning

meta infos for this blog

Source: html

Introduction: The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n) . In learning theory, there are many statements about learning algorithms of the form “under assumptions x , y , and z , the classifier learned has an error rate of at most O(f(m)) “. There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well. Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not genera

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n) . [sent-1, score-0.627]

2 In learning theory, there are many statements about learning algorithms of the form “under assumptions x , y , and z , the classifier learned has an error rate of at most O(f(m)) “. [sent-2, score-0.44]

3 There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. [sent-3, score-0.836]

4 However, there are some important reasons not to do this as well. [sent-4, score-0.088]

5 Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. [sent-5, score-0.455]

6 Determining the exact value of C is inherently computer architecture dependent. [sent-6, score-0.308]

7 (The “C” for x86 processors might differ from the “C” on PowerPC processors. [sent-7, score-0.232]

8 ) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. [sent-8, score-0.564]

9 The O() abstraction breaks here—you can not generally change learning algorithm and decrease your error rate by some constant factor. [sent-9, score-1.088]

10 You’re fired When you go and run a learning algorithm to acquire some predictor, the performance it achieves is a key quantity of significant interest. [sent-10, score-0.674]

11 Using an algorithm with merely twice the error rate of a better algorithm can easily make the difference between “it works” and “it doesn’t”. [sent-11, score-0.918]

12 This is not often true when programming, as evidenced by the large number of people who use computationally inefficient languages to solve problems. [sent-12, score-0.554]

13 We can’t say “never use O() “, because sometimes abstracting away details is the right thing to do. [sent-13, score-0.541]

14 However, any use of O() should be analyzed for “reasonableness” more thoroughly than for computational complexity. [sent-14, score-0.402]

15 Similarly, more interest should be allocated to improving constants in such analysis than is done in algorithms. [sent-15, score-0.472]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('notation', 0.228), ('error', 0.176), ('constant', 0.175), ('rate', 0.172), ('algorithm', 0.171), ('use', 0.162), ('abstracting', 0.153), ('decrease', 0.153), ('fired', 0.153), ('reasonableness', 0.153), ('evidenced', 0.141), ('theorists', 0.141), ('details', 0.134), ('constants', 0.133), ('allocated', 0.133), ('inefficient', 0.133), ('processors', 0.133), ('picture', 0.127), ('infinity', 0.127), ('achieves', 0.127), ('breaks', 0.127), ('determining', 0.127), ('pervasive', 0.122), ('acquire', 0.122), ('architecture', 0.122), ('thoroughly', 0.122), ('analysis', 0.119), ('analyzed', 0.118), ('languages', 0.118), ('merely', 0.114), ('abstraction', 0.114), ('twice', 0.114), ('big', 0.112), ('however', 0.11), ('cs', 0.105), ('generalization', 0.103), ('exact', 0.103), ('minor', 0.101), ('background', 0.101), ('quantity', 0.101), ('re', 0.101), ('differ', 0.099), ('limit', 0.097), ('theory', 0.092), ('statements', 0.092), ('away', 0.092), ('important', 0.088), ('improving', 0.087), ('programming', 0.084), ('inherently', 0.083)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 35 hunch net-2005-03-04-The Big O and Constants in Learning

2 0.2057671 351 hunch net-2009-05-02-Wielding a New Abstraction

Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo

3 0.20278485 78 hunch net-2005-06-06-Exact Online Learning for Classification

Introduction: Jacob Abernethy and I have found a computationally tractable method for computing an optimal (or near optimal depending on setting) master algorithm combining expert predictions addressing this open problem . A draft is here . The effect of this improvement seems to be about a factor of 2 decrease in the regret (= error rate minus best possible error rate) for the low error rate situation. (At large error rates, there may be no significant difference.) There are some unfinished details still to consider: When we remove all of the approximation slack from online learning, is the result a satisfying learning algorithm, in practice? I consider online learning is one of the more compelling methods of analyzing and deriving algorithms, but that expectation must be either met or not by this algorithm Some extra details: The algorithm is optimal given a small amount of side information ( k in the draft). What is the best way to remove this side information? The removal

4 0.18906716 162 hunch net-2006-03-09-Use of Notation

Introduction: For most people, a mathematical notation is like a language: you learn it and stick with it. For people doing mathematical research, however, this is not enough: they must design new notations for new problems. The design of good notation is both hard and worthwhile since a bad initial notation can retard a line of research greatly. Before we had mathematical notation, equations were all written out in language. Since words have multiple meanings and variable precedences, long equations written out in language can be extraordinarily difficult and sometimes fundamentally ambiguous. A good representative example of this is the legalese in the tax code. Since we want greater precision and clarity, we adopt mathematical notation. One fundamental thing to understand about mathematical notation, is that humans as logic verifiers, are barely capable. This is the fundamental reason why one notation can be much better than another. This observation is easier to miss than you might

5 0.17340034 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi

6 0.16265804 332 hunch net-2008-12-23-Use of Learning Theory

7 0.15414295 33 hunch net-2005-02-28-Regularization

8 0.13181007 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

9 0.12346947 14 hunch net-2005-02-07-The State of the Reduction

10 0.11837821 210 hunch net-2006-09-28-Programming Languages for Machine Learning Implementations

11 0.10956097 19 hunch net-2005-02-14-Clever Methods of Overfitting

12 0.10901002 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

13 0.10515319 67 hunch net-2005-05-06-Don’t mix the solution into the problem

14 0.10450834 84 hunch net-2005-06-22-Languages of Learning

15 0.10353272 218 hunch net-2006-11-20-Context and the calculation misperception

16 0.099169835 347 hunch net-2009-03-26-Machine Learning is too easy

17 0.099063545 202 hunch net-2006-08-10-Precision is not accuracy

18 0.098224103 42 hunch net-2005-03-17-Going all the Way, Sometimes

19 0.09604682 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

20 0.094368093 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.235), (1, 0.122), (2, 0.019), (3, 0.002), (4, 0.02), (5, -0.061), (6, 0.095), (7, 0.009), (8, -0.05), (9, 0.023), (10, -0.09), (11, 0.028), (12, 0.114), (13, -0.053), (14, 0.023), (15, -0.035), (16, 0.024), (17, 0.004), (18, -0.066), (19, -0.002), (20, 0.043), (21, 0.061), (22, 0.042), (23, 0.07), (24, -0.051), (25, -0.111), (26, 0.101), (27, -0.072), (28, -0.021), (29, -0.085), (30, -0.099), (31, 0.061), (32, 0.07), (33, 0.054), (34, -0.011), (35, -0.018), (36, 0.155), (37, -0.121), (38, 0.078), (39, -0.033), (40, -0.085), (41, 0.009), (42, 0.09), (43, -0.09), (44, 0.045), (45, -0.015), (46, 0.006), (47, -0.048), (48, 0.025), (49, 0.082)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96986145 35 hunch net-2005-03-04-The Big O and Constants in Learning

2 0.74246073 78 hunch net-2005-06-06-Exact Online Learning for Classification

3 0.68238455 33 hunch net-2005-02-28-Regularization

Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s

4 0.66769898 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

Introduction: Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues. To my mind, the answer to the question “Are the core problems from machine learning different from the core problems of statistics?” is a clear Yes. The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP– which, for all we know, is true– then we would suddenly find almost all of our favorite machine learning problems considerably more tractable). If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize?” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e.g., in polynomial-time)?” With t

5 0.61402476 67 hunch net-2005-05-06-Don’t mix the solution into the problem

Introduction: A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning: “The learning problem is finding a good seperating hyperplane.” “The goal of learning is to minimize (y-p) 2 + C w 2 where y = the observation, p = the prediction and w = a parameter vector.” Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world. The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially l

6 0.60496134 332 hunch net-2008-12-23-Use of Learning Theory

7 0.59208339 351 hunch net-2009-05-02-Wielding a New Abstraction

8 0.58343065 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

9 0.5759356 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

10 0.57400721 176 hunch net-2006-05-01-A conversation between Theo and Pat

11 0.56181645 42 hunch net-2005-03-17-Going all the Way, Sometimes

12 0.56056052 43 hunch net-2005-03-18-Binomial Weighting

13 0.55634409 218 hunch net-2006-11-20-Context and the calculation misperception

14 0.55424738 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

15 0.54802287 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

16 0.54520226 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

17 0.54391152 163 hunch net-2006-03-12-Online learning or online preservation of learning?

18 0.53827697 210 hunch net-2006-09-28-Programming Languages for Machine Learning Implementations

19 0.53685218 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting

20 0.5306828 28 hunch net-2005-02-25-Problem: Online Learning

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(27, 0.191), (53, 0.048), (94, 0.667)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.99083179 42 hunch net-2005-03-17-Going all the Way, Sometimes

Introduction: At many points in research, you face a choice: should I keep on improving some old piece of technology or should I do something new? For example: Should I refine bounds to make them tighter? Should I take some learning theory and turn it into a learning algorithm? Should I implement the learning algorithm? Should I test the learning algorithm widely? Should I release the algorithm as source code? Should I go see what problems people actually need to solve? The universal temptation of people attracted to research is doing something new. That is sometimes the right decision, but is also often not. I’d like to discuss some reasons why not. Expertise Once expertise are developed on some subject, you are the right person to refine them. What is the real problem? Continually improving a piece of technology is a mechanism forcing you to confront this question. In many cases, this confrontation is uncomfortable because you discover that your method has fundamen

same-blog 2 0.97302449 35 hunch net-2005-03-04-The Big O and Constants in Learning

3 0.96118194 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science

Introduction: “Science” has many meanings, but one common meaning is “the scientific method ” which is a principled method for investigating the world using the following steps: Form a hypothesis about the world. Use the hypothesis to make predictions. Run experiments to confirm or disprove the predictions. The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run. Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not. Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here : including “reprobleming”, “data set selection”, and probably “overfitting by review”

4 0.95976782 346 hunch net-2009-03-18-Parallel ML primitives

Introduction: Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML. Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second? One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit . Antoine Bordes showed a variant was competitive in the large scale learning challenge . It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimiz

5 0.95518285 81 hunch net-2005-06-13-Wikis for Summer Schools and Workshops

Introduction: Chicago ’05 ended a couple of weeks ago. This was the sixth Machine Learning Summer School , and the second one that used a wiki . (The first was Berder ’04, thanks to Gunnar Raetsch.) Wikis are relatively easy to set up, greatly aid social interaction, and should be used a lot more at summer schools and workshops. They can even be used as the meeting’s webpage, as a permanent record of its participants’ collaborations — see for example the wiki/website for last year’s NVO Summer School . A basic wiki is a collection of editable webpages, maintained by software called a wiki engine . The engine used at both Berder and Chicago was TikiWiki — it is well documented and gets you something running fast. It uses PHP and MySQL, but doesn’t require you to know either. Tikiwiki has far more features than most wikis, as it is really a full Content Management System . (My thanks to Sebastian Stark for pointing this out.) Here are the features we found most useful: Bulletin boa

6 0.90709859 120 hunch net-2005-10-10-Predictive Search is Coming

7 0.8694973 276 hunch net-2007-12-10-Learning Track of International Planning Competition

8 0.80109453 221 hunch net-2006-12-04-Structural Problems in NIPS Decision Making

9 0.74255973 229 hunch net-2007-01-26-Parallel Machine Learning Problems

10 0.74137229 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

11 0.70463103 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

12 0.68807679 13 hunch net-2005-02-04-JMLG

13 0.68289721 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

14 0.66271812 200 hunch net-2006-08-03-AOL’s data drop

15 0.662606 253 hunch net-2007-07-06-Idempotent-capable Predictors

16 0.66162121 73 hunch net-2005-05-17-A Short Guide to PhD Graduate Study

17 0.65832305 471 hunch net-2012-08-24-Patterns for research in machine learning

18 0.64962602 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

19 0.64704609 178 hunch net-2006-05-08-Big machine learning

20 0.64497358 43 hunch net-2005-03-18-Binomial Weighting