hunch_net hunch_net-2005 hunch_net-2005-33 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s
sentIndex sentText sentNum sentScore
1 Yaroslav Bulatov says that we should think about regularization a bit. [sent-1, score-0.437]
2 It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. [sent-2, score-0.197]
3 Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. [sent-4, score-0.239]
4 This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. [sent-5, score-0.526]
5 Empirically, this often works much better than simply fitting the data. [sent-6, score-0.079]
6 Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. [sent-7, score-0.196]
7 Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . [sent-8, score-0.324]
8 There are numerous bounds of the form: assuming i. [sent-9, score-0.171]
9 samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a set of functions. [sent-12, score-0.544]
10 Unfortunately, we have never convincingly nailed the exact value of f() . [sent-13, score-0.236]
11 We can note that f() is always monotonically increasing with the complexity measure and so there exists a unique constant C such that f(complexity)=C*complexity at the value of complexity which minimizes the bound. [sent-14, score-1.319]
12 Empirical parameter tuning such as for the C constant in a support vector machine can be regarded as searching for this “right” tradeoff. [sent-15, score-0.469]
13 Computationally Regularization can be thought of as a computational shortcut to computing the f() above. [sent-16, score-0.249]
14 Hence, smoothness, convexity, and other computational constraints are important issues. [sent-17, score-0.081]
15 One thing which should be clear is that there is no one best method of regularization for all problems. [sent-18, score-0.376]
16 ” is another “ learning complete ” question since solving it perfectly implies solving the learning problem (For example consider the “regularizer” which assigns complexity 0 to the best prediction function and infinity to all others). [sent-20, score-0.735]
17 ” The choice of regularizer used when solving empirical problems is a degree of freedom with which prior information and biases can be incorporated in order to improve performance. [sent-23, score-1.018]
wordName wordTfidf (topN-words)
[('regularizer', 0.478), ('regularization', 0.376), ('complexity', 0.328), ('statistical', 0.124), ('constant', 0.122), ('solving', 0.118), ('empirically', 0.117), ('measure', 0.111), ('notion', 0.109), ('bulatov', 0.106), ('shortcut', 0.106), ('functionally', 0.106), ('monotonically', 0.106), ('smoothness', 0.106), ('samples', 0.105), ('empirical', 0.102), ('minimizes', 0.098), ('numerous', 0.098), ('regarded', 0.098), ('yaroslav', 0.093), ('convexity', 0.093), ('tuning', 0.093), ('convincingly', 0.089), ('incorporated', 0.089), ('infinity', 0.089), ('norm', 0.089), ('searching', 0.089), ('freedom', 0.085), ('biases', 0.082), ('perfectly', 0.082), ('error', 0.082), ('computational', 0.081), ('rate', 0.08), ('fitting', 0.079), ('increasing', 0.079), ('value', 0.075), ('explain', 0.075), ('assuming', 0.073), ('generalization', 0.072), ('exact', 0.072), ('unique', 0.072), ('parameter', 0.067), ('minimize', 0.066), ('degree', 0.064), ('optimizing', 0.064), ('hence', 0.064), ('computing', 0.062), ('partially', 0.061), ('says', 0.061), ('couple', 0.061)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 33 hunch net-2005-02-28-Regularization
Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s
2 0.17226627 123 hunch net-2005-10-16-Complexity: It’s all in your head
Introduction: One of the central concerns of learning is to understand and to prevent overfitting. Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. The Bayesian notion clearly captures what’s going on. Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. Most (all?) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. In a sense, “complexity” fundementally arises because probability distributions must sum to one. You can’t believe in all possibilities at the same time, or at least not equally. Rather you have to carefully spread the probability mass over the options you’d like to consider. Large complexity classes means that beliefs are spread thinly. In
3 0.15414295 35 hunch net-2005-03-04-The Big O and Constants in Learning
Introduction: The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n) . In learning theory, there are many statements about learning algorithms of the form “under assumptions x , y , and z , the classifier learned has an error rate of at most O(f(m)) “. There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well. Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not genera
4 0.13755475 34 hunch net-2005-03-02-Prior, “Prior” and Bias
Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt
5 0.13070594 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
Introduction: Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues. To my mind, the answer to the question “Are the core problems from machine learning different from the core problems of statistics?” is a clear Yes. The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP– which, for all we know, is true– then we would suddenly find almost all of our favorite machine learning problems considerably more tractable). If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize?” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e.g., in polynomial-time)?” With t
6 0.11862762 351 hunch net-2009-05-02-Wielding a New Abstraction
7 0.1167996 185 hunch net-2006-06-16-Regularization = Robustness
8 0.11430052 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
9 0.11191386 14 hunch net-2005-02-07-The State of the Reduction
10 0.10267827 45 hunch net-2005-03-22-Active learning
11 0.099103495 131 hunch net-2005-11-16-The Everything Ensemble Edge
12 0.09665966 235 hunch net-2007-03-03-All Models of Learning have Flaws
13 0.096495807 332 hunch net-2008-12-23-Use of Learning Theory
14 0.094704211 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?
15 0.091576658 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
16 0.091274358 41 hunch net-2005-03-15-The State of Tight Bounds
17 0.090720281 15 hunch net-2005-02-08-Some Links
18 0.090247497 170 hunch net-2006-04-06-Bounds greater than 1
19 0.088990331 218 hunch net-2006-11-20-Context and the calculation misperception
20 0.086098157 67 hunch net-2005-05-06-Don’t mix the solution into the problem
topicId topicWeight
[(0, 0.197), (1, 0.141), (2, 0.039), (3, -0.027), (4, 0.011), (5, -0.045), (6, 0.038), (7, 0.039), (8, 0.033), (9, -0.045), (10, -0.025), (11, 0.031), (12, 0.055), (13, -0.052), (14, 0.039), (15, -0.055), (16, -0.052), (17, 0.019), (18, 0.023), (19, -0.003), (20, 0.024), (21, 0.091), (22, 0.001), (23, 0.095), (24, -0.01), (25, -0.011), (26, 0.064), (27, -0.019), (28, -0.037), (29, -0.013), (30, 0.026), (31, 0.022), (32, 0.084), (33, 0.007), (34, -0.156), (35, -0.026), (36, 0.165), (37, 0.035), (38, -0.009), (39, 0.091), (40, -0.113), (41, -0.085), (42, 0.045), (43, -0.044), (44, -0.027), (45, 0.031), (46, -0.104), (47, -0.039), (48, -0.064), (49, 0.099)]
simIndex simValue blogId blogTitle
same-blog 1 0.96895778 33 hunch net-2005-02-28-Regularization
Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s
2 0.79611313 123 hunch net-2005-10-16-Complexity: It’s all in your head
Introduction: One of the central concerns of learning is to understand and to prevent overfitting. Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. The Bayesian notion clearly captures what’s going on. Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. Most (all?) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. In a sense, “complexity” fundementally arises because probability distributions must sum to one. You can’t believe in all possibilities at the same time, or at least not equally. Rather you have to carefully spread the probability mass over the options you’d like to consider. Large complexity classes means that beliefs are spread thinly. In
3 0.72619641 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
Introduction: This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time. The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding: N = E x ~ D f(x) given samples from D and knowledge of f . Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q . In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula: E x ~ Q f(x) D(x)/Q(x) A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions. Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1 , which is typically substantially better than the sample complexity of the orig
4 0.67922395 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
5 0.61314857 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
Introduction: Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues. To my mind, the answer to the question “Are the core problems from machine learning different from the core problems of statistics?” is a clear Yes. The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP– which, for all we know, is true– then we would suddenly find almost all of our favorite machine learning problems considerably more tractable). If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize?” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e.g., in polynomial-time)?” With t
6 0.60768723 170 hunch net-2006-04-06-Bounds greater than 1
7 0.60270071 35 hunch net-2005-03-04-The Big O and Constants in Learning
8 0.56733072 5 hunch net-2005-01-26-Watchword: Probability
9 0.55611974 351 hunch net-2009-05-02-Wielding a New Abstraction
10 0.5552665 34 hunch net-2005-03-02-Prior, “Prior” and Bias
11 0.54235375 62 hunch net-2005-04-26-To calibrate or not?
12 0.53623378 67 hunch net-2005-05-06-Don’t mix the solution into the problem
13 0.53491819 39 hunch net-2005-03-10-Breaking Abstractions
14 0.53060251 6 hunch net-2005-01-27-Learning Complete Problems
15 0.49224752 185 hunch net-2006-06-16-Regularization = Robustness
16 0.48860869 330 hunch net-2008-12-07-A NIPS paper
17 0.48304331 368 hunch net-2009-08-26-Another 10-year paper in Machine Learning
18 0.48216543 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
19 0.46636295 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
20 0.46337888 78 hunch net-2005-06-06-Exact Online Learning for Classification
topicId topicWeight
[(3, 0.03), (27, 0.294), (38, 0.021), (53, 0.087), (54, 0.277), (55, 0.038), (94, 0.117), (95, 0.031)]
simIndex simValue blogId blogTitle
1 0.96682119 335 hunch net-2009-01-08-Predictive Analytics World
Introduction: Carla Vicens and Eric Siegel contacted me about Predictive Analytics World in San Francisco February 18&19, which I wasn’t familiar with. A quick look at the agenda reveals several people I know working on applications of machine learning in businesses, covering deployed applications topics. It’s interesting to see a business-focused machine learning conference, as it says that we are succeeding as a field. If you are interested in deployed applications, you might attend. Eric and I did a quick interview by email. John > I’ve mostly published and participated in academic machine learning conferences like ICML, COLT, and NIPS. When I look at the set of speakers and subjects for your conference I think “machine learning for business”. Is that your understanding of things? What I’m trying to ask is: what do you view as the primary goal for this conference? Eric > You got it. This is the business event focused on the commercial deployment of technology developed at
2 0.9394654 336 hunch net-2009-01-19-Netflix prize within epsilon
Introduction: The competitors for the Netflix Prize are tantalizingly close winning the million dollar prize. This year, BellKor and Commendo Research sent a combined solution that won the progress prize . Reading the writeups 2 is instructive. Several aspects of solutions are taken for granted including stochastic gradient descent, ensemble prediction, and targeting residuals (a form of boosting). Relatively to last year, it appears that many approaches have added parameterizations, especially for the purpose of modeling through time. The big question is: will they make the big prize? At this point, the level of complexity in entering the competition is prohibitive, so perhaps only the existing competitors will continue to try. (This equation might change drastically if the teams open source their existing solutions, including parameter settings.) One fear is that the progress is asymptoting on the wrong side of the 10% threshold. In the first year, the teams progressed through
same-blog 3 0.91452134 33 hunch net-2005-02-28-Regularization
Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s
4 0.91437984 376 hunch net-2009-11-06-Yisong Yue on Self-improving Systems
Introduction: I’d like to point out Yisong Yue ‘s post on Self-improving systems , which is a nicely readable description of the necessity and potential of interactive learning to deal with the information overload problem that is endemic to the modern internet.
5 0.83693528 458 hunch net-2012-03-06-COLT-ICML Open Questions and ICML Instructions
Introduction: Sasha is the open problems chair for both COLT and ICML . Open problems will be presented in a joint session in the evening of the COLT/ICML overlap day. COLT has a history of open sessions, but this is new for ICML. If you have a difficult theoretically definable problem in machine learning, consider submitting it for review, due March 16 . You’ll benefit three ways: The effort of writing down a precise formulation of what you want often helps you understand the nature of the problem. Your problem will be officially published and citable. You might have it solved by some very intelligent bored people. The general idea could easily be applied to any problem which can be crisply stated with an easily verifiable solution, and we may consider expanding this in later years, but for this year all problems need to be of a theoretical variety. Joelle and I (and Mahdi , and Laurent ) finished an initial assignment of Program Committee and Area Chairs to pap
6 0.8302924 494 hunch net-2014-03-11-The New York ML Symposium, take 2
7 0.75660682 258 hunch net-2007-08-12-Exponentiated Gradient
8 0.75655776 41 hunch net-2005-03-15-The State of Tight Bounds
9 0.75654197 347 hunch net-2009-03-26-Machine Learning is too easy
10 0.75506186 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
11 0.75418085 351 hunch net-2009-05-02-Wielding a New Abstraction
12 0.75373167 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
13 0.7489897 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
14 0.74807137 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
15 0.74629372 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
16 0.7460357 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
17 0.74532956 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
18 0.74401486 143 hunch net-2005-12-27-Automated Labeling
19 0.74379563 252 hunch net-2007-07-01-Watchword: Online Learning
20 0.74375677 235 hunch net-2007-03-03-All Models of Learning have Flaws