hunch_net hunch_net-2007 hunch_net-2007-230 knowledge-graph by maker-knowledge-mining

230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

meta infos for this blog

Source: html

Introduction: Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues. To my mind, the answer to the question “Are the core problems from machine learning different from the core problems of statistics?” is a clear Yes. The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP– which, for all we know, is true– then we would suddenly find almost all of our favorite machine learning problems considerably more tractable). If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize?” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e.g., in polynomial-time)?” With t

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues. [sent-1, score-0.534]

2 If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize? [sent-5, score-0.522]

3 ” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e. [sent-6, score-0.685]

4 Given a training set drawn from an arbitrary distribution D and *arbitrarily labeled* (that is, we are given samples from an arbitrary distribution D over some set X x Y ), efficiently output a hypothesis that is competitive with the best function in C with respect to D. [sent-11, score-1.225]

5 If there exists a provably efficient solution for the above problem we say C is agnostically learnable with respect to D. [sent-12, score-0.709]

6 Then the goal is, given an *arbitrarily* labeled data set of points in R n , find a hypothesis whose true error with respect to D is at most opt + eps . [sent-16, score-1.113]

7 Note that the hypothesis we output *need not be a halfspace*– we only require that its error rate is close to the error rate of the optimal halfspace. [sent-18, score-0.799]

8 At the heart of many of our most popular learning tools, such as Support Vector Machines, is an algorithm for learning a halfspace over a set of features in many dimensions. [sent-20, score-0.69]

9 Perhaps surprisingly, *it is still not known*, given samples from an arbitrary distribution on X x Y , how to efficiently ‘find the best fitting halfspace’ over a set of features. [sent-21, score-0.576]

10 In fact, if we require our algorithm to output a halfspace as its hypothesis then the agnostic halfspace learning problem is NP-hard. [sent-22, score-1.536]

11 Recent results have shown that even in the case where there is a halfspace with error rate . [sent-23, score-0.713]

12 0001 with respect to D it is NP-hard to output a halfspace with error rate . [sent-24, score-0.899]

13 The astute reader here may comment that I am ignoring the whole point of the ‘kernel trick,’ which is to provide an implicit representation of a halfspace over some feature set in many dimensions, therefore bypassing the above hardness result. [sent-27, score-0.546]

14 While this is true, I am still not aware of any results in the statistical learning literature that give provably efficient algorithms for agnostically learning halfspaces (regardless of the choice of kernel). [sent-28, score-1.29]

15 My larger point is that a fundamental approach from statistical learning theory (running an SVM) is deeply connected to strong negative results from computational complexity theory. [sent-29, score-0.603]

16 Negative results not withstanding, it now seems natural to ask ‘well, can we find any efficient algorithms for agnostically learning halfspaces if we allow ourselves to output hypotheses that are not halfspaces? [sent-31, score-1.133]

17 The solution runs in polynomial-time for any constant eps > 0 (for various reasons involving the ‘noisy XOR’ topic three posts ago, obtaining efficient algorithms for subconstant eps seems intractable). [sent-33, score-0.69]

18 The complexity of agnostically learning halfspaces with respect to arbitrary distributions remains open. [sent-34, score-1.04]

19 Solving future machine learning problems will require a more sophisticated ‘bag of tricks’ than the one we have now, as most of our algorithms are based on learning halfspaces (in a noisy or agnostic setting). [sent-43, score-0.838]

20 Computational learning theory gives us a methodology for how to proceed: find provably efficient solutions to unresolved computational problems– novel learning algorithms will no doubt follow. [sent-44, score-0.846]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('halfspace', 0.453), ('halfspaces', 0.326), ('agnostically', 0.227), ('hypothesis', 0.182), ('provably', 0.168), ('computational', 0.154), ('agnostic', 0.143), ('eps', 0.14), ('output', 0.136), ('find', 0.126), ('efficient', 0.121), ('respect', 0.118), ('statistical', 0.117), ('error', 0.116), ('let', 0.116), ('arbitrary', 0.115), ('bag', 0.113), ('opt', 0.101), ('posts', 0.1), ('require', 0.097), ('xor', 0.093), ('andrej', 0.093), ('set', 0.093), ('labeled', 0.089), ('distribution', 0.084), ('given', 0.084), ('arbitrarily', 0.081), ('rate', 0.076), ('theory', 0.076), ('fitting', 0.075), ('central', 0.075), ('solution', 0.075), ('pac', 0.073), ('learning', 0.072), ('core', 0.072), ('noisy', 0.071), ('favorite', 0.07), ('results', 0.068), ('true', 0.064), ('distributions', 0.063), ('efficiently', 0.063), ('still', 0.062), ('remains', 0.061), ('mind', 0.061), ('recent', 0.06), ('complexity', 0.058), ('function', 0.058), ('negative', 0.058), ('algorithms', 0.057), ('three', 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

2 0.18511805 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

Introduction: Several talks seem potentially interesting to ML folks at this year’s SODA. Maria-Florina Balcan , Avrim Blum , and Anupam Gupta , Approximate Clustering without the Approximation . This paper gives reasonable algorithms with provable approximation guarantees for k-median and other notions of clustering. It’s conceptually interesting, because it’s the second example I’ve seen where NP hardness is subverted by changing the problem definition subtle but reasonable way. Essentially, they show that if any near-approximation to an optimal solution is good, then it’s computationally easy to find a near-optimal solution. This subtle shift bears serious thought. A similar one occurred in our ranking paper with respect to minimum feedback arcset. With two known examples, it suggests that many more NP-complete problems might be finessed into irrelevance in this style. Yury Lifshits and Shengyu Zhang , Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates, and Smal

3 0.15641734 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

4 0.14230771 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.

5 0.13070594 33 hunch net-2005-02-28-Regularization

Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s

6 0.12284581 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

7 0.12067258 183 hunch net-2006-06-14-Explorations of Exploration

8 0.1191846 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

9 0.11510266 8 hunch net-2005-02-01-NIPS: Online Bayes

10 0.1146489 353 hunch net-2009-05-08-Computability in Artificial Intelligence

11 0.10945246 170 hunch net-2006-04-06-Bounds greater than 1

12 0.10937407 332 hunch net-2008-12-23-Use of Learning Theory

13 0.10744119 347 hunch net-2009-03-26-Machine Learning is too easy

14 0.10541939 67 hunch net-2005-05-06-Don’t mix the solution into the problem

15 0.10368757 6 hunch net-2005-01-27-Learning Complete Problems

16 0.10341641 351 hunch net-2009-05-02-Wielding a New Abstraction

17 0.10271549 251 hunch net-2007-06-24-Interesting Papers at ICML 2007

18 0.099139072 131 hunch net-2005-11-16-The Everything Ensemble Edge

19 0.098985419 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

20 0.098271474 454 hunch net-2012-01-30-ICML Posters and Scope

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.243), (1, 0.137), (2, 0.005), (3, -0.026), (4, 0.033), (5, -0.055), (6, 0.044), (7, -0.042), (8, 0.033), (9, -0.058), (10, 0.004), (11, 0.008), (12, 0.004), (13, -0.023), (14, 0.028), (15, 0.005), (16, -0.011), (17, -0.005), (18, 0.004), (19, 0.049), (20, 0.021), (21, 0.122), (22, -0.052), (23, 0.009), (24, 0.033), (25, -0.059), (26, 0.062), (27, -0.002), (28, 0.007), (29, -0.036), (30, -0.017), (31, -0.013), (32, 0.022), (33, 0.008), (34, -0.006), (35, 0.0), (36, 0.134), (37, -0.034), (38, 0.024), (39, -0.042), (40, -0.074), (41, -0.043), (42, 0.11), (43, 0.025), (44, -0.008), (45, -0.014), (46, -0.017), (47, -0.005), (48, -0.032), (49, -0.029)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9323287 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

2 0.75413465 33 hunch net-2005-02-28-Regularization

3 0.72164625 35 hunch net-2005-03-04-The Big O and Constants in Learning

Introduction: The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n) . In learning theory, there are many statements about learning algorithms of the form “under assumptions x , y , and z , the classifier learned has an error rate of at most O(f(m)) “. There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well. Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not genera

4 0.7016083 67 hunch net-2005-05-06-Don’t mix the solution into the problem

Introduction: A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning: “The learning problem is finding a good seperating hyperplane.” “The goal of learning is to minimize (y-p) 2 + C w 2 where y = the observation, p = the prediction and w = a parameter vector.” Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world. The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially l

5 0.663459 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting

Introduction: This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time. The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding: N = E x ~ D f(x) given samples from D and knowledge of f . Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q . In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula: E x ~ Q f(x) D(x)/Q(x) A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions. Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1 , which is typically substantially better than the sample complexity of the orig

6 0.65088773 6 hunch net-2005-01-27-Learning Complete Problems

7 0.6453439 332 hunch net-2008-12-23-Use of Learning Theory

8 0.64327657 351 hunch net-2009-05-02-Wielding a New Abstraction

9 0.63695133 78 hunch net-2005-06-06-Exact Online Learning for Classification

10 0.6356985 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

11 0.62829977 347 hunch net-2009-03-26-Machine Learning is too easy

12 0.61783057 170 hunch net-2006-04-06-Bounds greater than 1

13 0.61629373 183 hunch net-2006-06-14-Explorations of Exploration

14 0.61278379 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

15 0.6076169 45 hunch net-2005-03-22-Active learning

16 0.59764773 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

17 0.59527278 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

18 0.5926075 168 hunch net-2006-04-02-Mad (Neuro)science

19 0.59074783 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

20 0.58980477 235 hunch net-2007-03-03-All Models of Learning have Flaws

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.015), (1, 0.011), (3, 0.027), (24, 0.01), (27, 0.248), (38, 0.086), (49, 0.022), (53, 0.08), (55, 0.088), (72, 0.197), (77, 0.015), (94, 0.067), (95, 0.055)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.90735 429 hunch net-2011-04-06-COLT open questions

Introduction: Alina and Jake point out the COLT Call for Open Questions due May 11. In general, this is cool, and worth doing if you can come up with a crisp question. In my case, I particularly enjoyed crafting an open question with precisely a form such that a critic targeting my papers would be forced to confront their fallacy or make a case for the reward. But less esoterically, this is a way to get the attention of some very smart people focused on a problem that really matters, which is the real value.

2 0.90594435 78 hunch net-2005-06-06-Exact Online Learning for Classification

Introduction: Jacob Abernethy and I have found a computationally tractable method for computing an optimal (or near optimal depending on setting) master algorithm combining expert predictions addressing this open problem . A draft is here . The effect of this improvement seems to be about a factor of 2 decrease in the regret (= error rate minus best possible error rate) for the low error rate situation. (At large error rates, there may be no significant difference.) There are some unfinished details still to consider: When we remove all of the approximation slack from online learning, is the result a satisfying learning algorithm, in practice? I consider online learning is one of the more compelling methods of analyzing and deriving algorithms, but that expectation must be either met or not by this algorithm Some extra details: The algorithm is optimal given a small amount of side information ( k in the draft). What is the best way to remove this side information? The removal

same-blog 3 0.89330786 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

4 0.81917137 12 hunch net-2005-02-03-Learning Theory, by assumption

Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.

5 0.81610662 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

6 0.81564975 194 hunch net-2006-07-11-New Models

7 0.81376755 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

8 0.81007236 235 hunch net-2007-03-03-All Models of Learning have Flaws

9 0.80800724 95 hunch net-2005-07-14-What Learning Theory might do

10 0.80754924 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

11 0.80724192 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

12 0.80675763 370 hunch net-2009-09-18-Necessary and Sufficient Research

13 0.80663204 351 hunch net-2009-05-02-Wielding a New Abstraction

14 0.8057006 360 hunch net-2009-06-15-In Active Learning, the question changes

15 0.80518705 259 hunch net-2007-08-19-Choice of Metrics

16 0.80406213 19 hunch net-2005-02-14-Clever Methods of Overfitting

17 0.80367088 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

18 0.80342102 14 hunch net-2005-02-07-The State of the Reduction

19 0.80278784 343 hunch net-2009-02-18-Decision by Vetocracy

20 0.80220443 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”