hunch_net hunch_net-2009 hunch_net-2009-337 knowledge-graph by maker-knowledge-mining

337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity


meta infos for this blog

Source: html

Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. [sent-1, score-1.229]

2 This conventional wisdom appears unsupported by empirical evidence as far as I can tell. [sent-2, score-0.342]

3 In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. [sent-3, score-1.499]

4 There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. [sent-4, score-0.223]

5 Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. [sent-5, score-1.078]

6 In contrast, if you go to a machine learning conference, a large number of the new algorithms are variations of learning on a linear representation. [sent-7, score-0.824]

7 This claim should be understood broadly to include (for example) kernel methods, random projection methods, and more traditionally linear representations such as the perceptron. [sent-8, score-1.021]

8 A basic question is: Why is the study of linear representations so prevalent? [sent-9, score-0.809]

9 There are several reasons for investigating the linear viewpoint. [sent-10, score-0.74]

10 On one hand, there is a compelling directness to that approach, but on the other it’s not the kind of approach which transfers well to new problems. [sent-13, score-0.233]

11 Many of the effective approaches for nonlinear learning use some combination of linear primitives connected by nonlinearities to make a final prediction. [sent-15, score-1.138]

12 As such, there is a plausible hope that improvements in linear learning can be applied repeatedly in these more complex structures. [sent-16, score-0.75]

13 Linear learning is the only thing tractable, empirically. [sent-17, score-0.14]

14 This has a grain of truth to it, but it appears to be uncompelling when you get down to the nitty-gritty details. [sent-18, score-0.23]

15 And, when you use online learning with a pure linear representation, the limiting factor is the speed that data can be sucked into the CPU from the network or the disk. [sent-20, score-0.762]

16 If you aren’t doing something more interesting than plain vanilla linear prediction, you are wasting most of your CPU cycles. [sent-21, score-0.76]

17 There are certainly many statements and guarantees that we only know how to make with linear representations and (typically) convex losses. [sent-23, score-0.809]

18 In addition, there are some analysis methods which apply to nonlinear learning systems—my favorite example is learning reductions , but there are others also. [sent-25, score-0.531]

19 Some of the reasons for linear investigations appear sound, while others are simply variants of “looking where the light is”, which comes from an often retold story: At night you see someone searching the ground under a streetlight. [sent-26, score-1.065]

20 ” They say, “I’m looking for the keys I dropped in the bushes. [sent-28, score-0.223]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('linear', 0.608), ('representations', 0.201), ('nonlinear', 0.168), ('enormous', 0.158), ('nonlinearities', 0.158), ('methods', 0.154), ('wisdom', 0.138), ('cpu', 0.138), ('conventional', 0.132), ('representation', 0.122), ('say', 0.106), ('tractable', 0.091), ('compelling', 0.089), ('understood', 0.087), ('hand', 0.084), ('looking', 0.081), ('wasting', 0.079), ('grain', 0.079), ('investigations', 0.079), ('sucked', 0.079), ('uncompelling', 0.079), ('learning', 0.075), ('keys', 0.073), ('plain', 0.073), ('describes', 0.073), ('transfers', 0.073), ('aren', 0.073), ('appears', 0.072), ('approach', 0.071), ('dropped', 0.069), ('capturing', 0.069), ('investigating', 0.069), ('crafted', 0.069), ('exceptions', 0.069), ('applied', 0.067), ('searching', 0.066), ('ground', 0.066), ('projection', 0.066), ('connected', 0.066), ('night', 0.066), ('variations', 0.066), ('thing', 0.065), ('primitives', 0.063), ('prevalent', 0.063), ('reasons', 0.063), ('face', 0.061), ('traditionally', 0.059), ('robotics', 0.059), ('others', 0.059), ('light', 0.058)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v

2 0.17596497 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re

3 0.17066656 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

Introduction: (I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.) Since Nemirovski’s Saint Flour lecture notes , numerous researchers have studied the following problem in least squares regression: predict as well as (MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions) (C) the best convex combination of these functions (i.e., model = convex hull of the d functions) (L) the best linear combination of these functions (i.e., model = linear span of the d functions) It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample. The practical use of these results seems rather limited to trivial

4 0.1681325 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression

Introduction: Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11 ). Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes. In particular, If there exists a set of weight vectors w i such that P(i|x)=, then for any invertible error correcting output code C , there exists weight vectors w c which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector w c can be constructed acc

5 0.1654366 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

6 0.14491341 332 hunch net-2008-12-23-Use of Learning Theory

7 0.14149582 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

8 0.12458973 444 hunch net-2011-09-07-KDD and MUCMD 2011

9 0.11483438 201 hunch net-2006-08-07-The Call of the Deep

10 0.11235454 351 hunch net-2009-05-02-Wielding a New Abstraction

11 0.11140922 235 hunch net-2007-03-03-All Models of Learning have Flaws

12 0.10538612 272 hunch net-2007-11-14-BellKor wins Netflix

13 0.10215151 258 hunch net-2007-08-12-Exponentiated Gradient

14 0.10166179 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

15 0.099042587 267 hunch net-2007-10-17-Online as the new adjective

16 0.098935962 95 hunch net-2005-07-14-What Learning Theory might do

17 0.098497547 347 hunch net-2009-03-26-Machine Learning is too easy

18 0.097677998 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009

19 0.096578561 388 hunch net-2010-01-24-Specializations of the Master Problem

20 0.09410885 161 hunch net-2006-03-05-“Structural” Learning


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.224), (1, 0.104), (2, -0.05), (3, -0.009), (4, 0.095), (5, -0.014), (6, -0.051), (7, -0.006), (8, 0.031), (9, 0.03), (10, -0.082), (11, -0.093), (12, -0.044), (13, 0.017), (14, -0.032), (15, 0.075), (16, 0.041), (17, 0.01), (18, -0.003), (19, -0.038), (20, 0.007), (21, 0.018), (22, 0.016), (23, 0.01), (24, -0.012), (25, -0.078), (26, 0.001), (27, -0.057), (28, -0.059), (29, 0.028), (30, -0.033), (31, 0.048), (32, 0.062), (33, 0.047), (34, -0.066), (35, 0.134), (36, 0.011), (37, 0.024), (38, -0.034), (39, -0.029), (40, 0.038), (41, 0.099), (42, 0.065), (43, -0.064), (44, -0.043), (45, 0.025), (46, -0.078), (47, 0.035), (48, -0.072), (49, -0.139)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95603693 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v

2 0.73184806 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re

3 0.70132548 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr

4 0.66556764 253 hunch net-2007-07-06-Idempotent-capable Predictors

Introduction: One way to distinguish different learning algorithms is by their ability or inability to easily use an input variable as the predicted output. This is desirable for at least two reasons: Modularity If we want to build complex learning systems via reuse of a subsystem, it’s important to have compatible I/O. “Prior” knowledge Machine learning is often applied in situations where we do have some knowledge of what the right solution is, often in the form of an existing system. In such situations, it’s good to start with a learning algorithm that can be at least as good as any existing system. When doing classification, most learning algorithms can do this. For example, a decision tree can split on a feature, and then classify. The real differences come up when we attempt regression. Many of the algorithms we know and commonly use are not idempotent predictors. Logistic regressors can not be idempotent, because all input features are mapped through a nonlinearity.

5 0.66338551 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

6 0.65446544 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression

7 0.64394033 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

8 0.63323343 348 hunch net-2009-04-02-Asymmophobia

9 0.62791872 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

10 0.61959195 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

11 0.60377365 444 hunch net-2011-09-07-KDD and MUCMD 2011

12 0.60243636 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

13 0.58295745 84 hunch net-2005-06-22-Languages of Learning

14 0.56943816 168 hunch net-2006-04-02-Mad (Neuro)science

15 0.56875169 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

16 0.56786156 153 hunch net-2006-02-02-Introspectionism as a Disease

17 0.56463033 272 hunch net-2007-11-14-BellKor wins Netflix

18 0.55867106 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

19 0.55208796 308 hunch net-2008-07-06-To Dual or Not

20 0.55184639 492 hunch net-2013-12-01-NIPS tutorials and Vowpal Wabbit 7.4


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.038), (10, 0.021), (27, 0.269), (37, 0.029), (38, 0.029), (53, 0.079), (55, 0.093), (84, 0.206), (94, 0.114), (95, 0.026)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97883856 121 hunch net-2005-10-12-The unrealized potential of the research lab

Introduction: I attended the IBM research 60th anniversary . IBM research is, by any reasonable account, the industrial research lab which has managed to bring the most value to it’s parent company over the long term. This can be seen by simply counting the survivors: IBM research is the only older research lab which has not gone through a period of massive firing. (Note that there are also new research labs .) Despite this impressive record, IBM research has failed, by far, to achieve it’s potential. Examples which came up in this meeting include: It took about a decade to produce DRAM after it was invented in the lab. (In fact, Intel produced it first.) Relational databases and SQL were invented and then languished. It was only under external competition that IBM released it’s own relational database. Why didn’t IBM grow an Oracle division ? An early lead in IP networking hardware did not result in IBM growing a Cisco division . Why not? And remember … IBM research is a s

2 0.96892947 411 hunch net-2010-09-21-Regretting the dead

Introduction: Nikos pointed out this new york times article about poor clinical design killing people . For those of us who study learning from exploration information this is a reminder that low regret algorithms are particularly important, as regret in clinical trials is measured by patient deaths. Two obvious improvements on the experimental design are: With reasonable record keeping of existing outcomes for the standard treatments, there is no need to explicitly assign people to a control group with the standard treatment, as that approach is effectively explored with great certainty. Asserting otherwise would imply that the nature of effective treatments for cancer has changed between now and a year ago, which denies the value of any clinical trial. An optimal experimental design will smoothly phase between exploration and exploitation as evidence for a new treatment shows that it can be effective. This is old tech, for example in the EXP3.P algorithm (page 12 aka 59) although

3 0.95536005 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there

4 0.94583625 383 hunch net-2009-12-09-Inherent Uncertainty

Introduction: I’d like to point out Inherent Uncertainty , which I’ve added to the ML blog post scanner on the right. My understanding from Jake is that the intention is to have a multiauthor blog which is more specialized towards learning theory/game theory than this one. Nevertheless, several of the posts seem to be of wider interest.

5 0.94401169 200 hunch net-2006-08-03-AOL’s data drop

Introduction: AOL has released several large search engine related datasets. This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want.

6 0.9240737 142 hunch net-2005-12-22-Yes , I am applying

same-blog 7 0.91770041 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

8 0.84090507 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

9 0.82592005 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

10 0.8252905 95 hunch net-2005-07-14-What Learning Theory might do

11 0.82461834 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

12 0.82350385 351 hunch net-2009-05-02-Wielding a New Abstraction

13 0.82188487 347 hunch net-2009-03-26-Machine Learning is too easy

14 0.82047182 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

15 0.81846869 194 hunch net-2006-07-11-New Models

16 0.81713021 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

17 0.81624758 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

18 0.81620729 258 hunch net-2007-08-12-Exponentiated Gradient

19 0.8144033 235 hunch net-2007-03-03-All Models of Learning have Flaws

20 0.81434339 343 hunch net-2009-02-18-Decision by Vetocracy