hunch_net hunch_net-2010 hunch_net-2010-398 knowledge-graph by maker-knowledge-mining

398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility


meta infos for this blog

Source: html

Introduction: (I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.) Since Nemirovski’s Saint Flour lecture notes , numerous researchers have studied the following problem in least squares regression: predict as well as (MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions) (C) the best convex combination of these functions (i.e., model = convex hull of the d functions) (L) the best linear combination of these functions (i.e., model = linear span of the d functions) It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample. The practical use of these results seems rather limited to trivial


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 (I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity. [sent-1, score-0.066]

2 , model = convex hull of the d functions) (L) the best linear combination of these functions (i. [sent-4, score-0.909]

3 , model = linear span of the d functions) It is now well known (see, e. [sent-6, score-0.347]

4 , Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. [sent-8, score-1.215]

5 Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample. [sent-9, score-0.062]

6 The practical use of these results seems rather limited to trivial statements like: do not use the OLS estimator when the dimension d of the input vector is larger than n (here the d functions are the projections on each of the d components). [sent-10, score-0.588]

7 Nevertheless, it provides a rather easy way to prove that there exists a learning algorithm having an excess risk of order s (log d)/n, with respect to the best linear combination of s of the d functions (s-sparse linear model). [sent-11, score-1.857]

8 Indeed, it suffices to consider the algorithm which cuts the training set into two parts, say of equal size for simplicity, uses the first part to train linear estimators corresponding to every possible subset of s features. [sent-12, score-1.013]

9 Here you can use your favorite linear estimator (the empirical risk minimizer on a compact set or robust but more involved ones are possible rather than the OLS), as long as it solves (L) with minimal excess risk. [sent-13, score-1.58]

10 uses the second part to predict as well as the “d choose s” linear estimators built on the first part. [sent-14, score-0.649]

11 5 of my NIPS’07 paper , but you might prefer the progressive mixture rule or the algorithm of Guillaume Lecué and Shahar Mendelson . [sent-17, score-0.205]

12 Note that empirical risk minimization and cross-validation completely fail for this task with excess risk of order sqrt((log d)/n) instead of (log d)/n. [sent-18, score-1.317]

13 It is an easy exercise to combine the different excess risk bounds and obtain that the above procedure achieves an excess risk of s (log d)/n. [sent-19, score-1.853]

14 Naturally, the important limitation of the above procedure, which is often encountered when using classical model selection approach, is its computational intractability. [sent-21, score-0.249]

15 So this leaves open the following fundamental problem: is it possible to design a computationally efficient algorithm with the s (log d)/n guarantee without assuming low correlation between the explanatory variables? [sent-22, score-0.317]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('risk', 0.412), ('excess', 0.341), ('functions', 0.279), ('estimators', 0.239), ('ms', 0.239), ('log', 0.226), ('linear', 0.212), ('ols', 0.16), ('sqrt', 0.16), ('model', 0.135), ('estimator', 0.131), ('combination', 0.112), ('procedure', 0.103), ('favorite', 0.098), ('convex', 0.09), ('order', 0.084), ('prefer', 0.083), ('best', 0.081), ('choose', 0.077), ('set', 0.075), ('squares', 0.071), ('explanatory', 0.071), ('empirical', 0.068), ('possible', 0.066), ('numerous', 0.066), ('tweaking', 0.066), ('obtain', 0.066), ('training', 0.064), ('algorithm', 0.063), ('uses', 0.063), ('projections', 0.062), ('exercise', 0.062), ('suffered', 0.062), ('aggregate', 0.062), ('rather', 0.061), ('cuts', 0.059), ('achieves', 0.059), ('limitation', 0.059), ('progressive', 0.059), ('compact', 0.059), ('correlation', 0.059), ('equal', 0.059), ('part', 0.058), ('following', 0.058), ('solves', 0.057), ('combine', 0.057), ('simplicity', 0.055), ('classical', 0.055), ('corresponding', 0.055), ('dimension', 0.055)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

Introduction: (I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.) Since Nemirovski’s Saint Flour lecture notes , numerous researchers have studied the following problem in least squares regression: predict as well as (MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions) (C) the best convex combination of these functions (i.e., model = convex hull of the d functions) (L) the best linear combination of these functions (i.e., model = linear span of the d functions) It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample. The practical use of these results seems rather limited to trivial

2 0.17066656 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v

3 0.10749658 347 hunch net-2009-03-26-Machine Learning is too easy

Introduction: One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others. There are two fundamental reasons why this is possible. For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples. There is nothing like a unifying problem defining the field. In many other areas there

4 0.097206861 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

5 0.095637418 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

6 0.08809489 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

7 0.08725702 258 hunch net-2007-08-12-Exponentiated Gradient

8 0.086883351 345 hunch net-2009-03-08-Prediction Science

9 0.084920987 45 hunch net-2005-03-22-Active learning

10 0.083725788 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

11 0.083636001 248 hunch net-2007-06-19-How is Compressed Sensing going to change Machine Learning ?

12 0.083325014 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

13 0.082335897 129 hunch net-2005-11-07-Prediction Competitions

14 0.081813112 374 hunch net-2009-10-10-ALT 2009

15 0.081117459 259 hunch net-2007-08-19-Choice of Metrics

16 0.080405459 69 hunch net-2005-05-11-Visa Casualties

17 0.080214754 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

18 0.080153458 403 hunch net-2010-07-18-ICML & COLT 2010

19 0.080115542 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

20 0.07964325 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.185), (1, 0.106), (2, 0.041), (3, -0.056), (4, 0.006), (5, 0.005), (6, -0.047), (7, 0.001), (8, 0.046), (9, 0.019), (10, 0.057), (11, 0.014), (12, -0.038), (13, -0.009), (14, 0.051), (15, -0.007), (16, 0.045), (17, 0.032), (18, 0.05), (19, -0.016), (20, -0.035), (21, -0.006), (22, -0.015), (23, -0.0), (24, 0.035), (25, 0.003), (26, 0.027), (27, -0.069), (28, -0.004), (29, -0.04), (30, -0.023), (31, 0.056), (32, 0.088), (33, 0.058), (34, -0.037), (35, 0.069), (36, -0.007), (37, -0.002), (38, 0.044), (39, -0.087), (40, -0.002), (41, 0.037), (42, 0.029), (43, -0.063), (44, -0.038), (45, -0.068), (46, -0.018), (47, 0.004), (48, -0.088), (49, -0.036)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97912723 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

Introduction: (I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.) Since Nemirovski’s Saint Flour lecture notes , numerous researchers have studied the following problem in least squares regression: predict as well as (MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions) (C) the best convex combination of these functions (i.e., model = convex hull of the d functions) (L) the best linear combination of these functions (i.e., model = linear span of the d functions) It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample. The practical use of these results seems rather limited to trivial

2 0.62737912 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr

3 0.59339327 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v

4 0.59079415 189 hunch net-2006-07-05-more icml papers

Introduction: Here are a few other papers I enjoyed from ICML06. Topic Models: Dynamic Topic Models David Blei, John Lafferty A nice model for how topics in LDA type models can evolve over time, using a linear dynamical system on the natural parameters and a very clever structured variational approximation (in which the mean field parameters are pseudo-observations of a virtual LDS). Like all Blei papers, he makes it look easy, but it is extremely impressive. Pachinko Allocation Wei Li, Andrew McCallum A very elegant (but computationally challenging) model which induces correlation amongst topics using a multi-level DAG whose interior nodes are “super-topics” and “sub-topics” and whose leaves are the vocabulary words. Makes the slumbering monster of structure learning stir. Sequence Analysis (I missed these talks since I was chairing another session) Online Decoding of Markov Models with Latency Constraints Mukund Narasimhan, Paul Viola, Michael Shilman An “a

5 0.56717199 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

6 0.55834806 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

7 0.54647988 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction

8 0.54637045 385 hunch net-2009-12-27-Interesting things at NIPS 2009

9 0.53469068 87 hunch net-2005-06-29-Not EM for clustering at COLT

10 0.53375816 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

11 0.52903783 258 hunch net-2007-08-12-Exponentiated Gradient

12 0.52685142 77 hunch net-2005-05-29-Maximum Margin Mismatch?

13 0.52297962 309 hunch net-2008-07-10-Interesting papers, ICML 2008

14 0.51810199 248 hunch net-2007-06-19-How is Compressed Sensing going to change Machine Learning ?

15 0.51664603 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

16 0.51460916 97 hunch net-2005-07-23-Interesting papers at ACL

17 0.51149702 251 hunch net-2007-06-24-Interesting Papers at ICML 2007

18 0.51074445 374 hunch net-2009-10-10-ALT 2009

19 0.50731999 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

20 0.50407106 259 hunch net-2007-08-19-Choice of Metrics


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.022), (10, 0.028), (27, 0.211), (38, 0.064), (53, 0.054), (55, 0.067), (65, 0.325), (77, 0.015), (94, 0.067), (95, 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.86230326 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

Introduction: (I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.) Since Nemirovski’s Saint Flour lecture notes , numerous researchers have studied the following problem in least squares regression: predict as well as (MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions) (C) the best convex combination of these functions (i.e., model = convex hull of the d functions) (L) the best linear combination of these functions (i.e., model = linear span of the d functions) It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample. The practical use of these results seems rather limited to trivial

2 0.83395505 249 hunch net-2007-06-21-Presentation Preparation

Introduction: A big part of doing research is presenting it at a conference. Since many people start out shy of public presentations, this can be a substantial challenge. Here are a few notes which might be helpful when thinking about preparing a presentation on research. Motivate . Talks which don’t start by describing the problem to solve cause many people to zone out. Prioritize . It is typical that you have more things to say than time to say them, and many presenters fall into the failure mode of trying to say too much. This is an easy-to-understand failure mode as it’s very natural to want to include everything. A basic fact is: you can’t. Example of this are: Your slides are so densely full of equations and words that you can’t cover them. Your talk runs over and a moderator prioritizes for you by cutting you off. You motor-mouth through the presentation, and the information absorption rate of the audience prioritizes in some uncontrolled fashion. The rate of flow of c

3 0.72955537 432 hunch net-2011-04-20-The End of the Beginning of Active Learning

Introduction: This post is by Daniel Hsu and John Langford. In selective sampling style active learning, a learning algorithm chooses which examples to label. We now have an active learning algorithm that is: Efficient in label complexity, unlabeled complexity, and computational complexity. Competitive with supervised learning anywhere that supervised learning works. Compatible with online learning, with any optimization-based learning algorithm, with any loss function, with offline testing, and even with changing learning algorithms. Empirically effective. The basic idea is to combine disagreement region-based sampling with importance weighting : an example is selected to be labeled with probability proportional to how useful it is for distinguishing among near-optimal classifiers, and labeled examples are importance-weighted by the inverse of these probabilities. The combination of these simple ideas removes the sampling bias problem that has plagued many previous he

4 0.72064 403 hunch net-2010-07-18-ICML & COLT 2010

Introduction: The papers which interested me most at ICML and COLT 2010 were: Thomas Walsh , Kaushik Subramanian , Michael Littman and Carlos Diuk Generalizing Apprenticeship Learning across Hypothesis Classes . This paper formalizes and provides algorithms with guarantees for mixed-mode apprenticeship and traditional reinforcement learning algorithms, allowing RL algorithms that perform better than for either setting alone. István Szita and Csaba Szepesvári Model-based reinforcement learning with nearly tight exploration complexity bounds . This paper and another represent the frontier of best-known algorithm for Reinforcement Learning in a Markov Decision Process. James Martens Deep learning via Hessian-free optimization . About a new not-quite-online second order gradient algorithm for learning deep functional structures. Potentially this is very powerful because while people have often talked about end-to-end learning, it has rarely worked in practice. Chrisoph

5 0.60524136 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

6 0.60466886 360 hunch net-2009-06-15-In Active Learning, the question changes

7 0.60225457 194 hunch net-2006-07-11-New Models

8 0.60206646 343 hunch net-2009-02-18-Decision by Vetocracy

9 0.60198516 12 hunch net-2005-02-03-Learning Theory, by assumption

10 0.60159278 95 hunch net-2005-07-14-What Learning Theory might do

11 0.59891611 351 hunch net-2009-05-02-Wielding a New Abstraction

12 0.59862161 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

13 0.59722686 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

14 0.59704792 36 hunch net-2005-03-05-Funding Research

15 0.59670848 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

16 0.5963546 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

17 0.59589666 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

18 0.59560901 370 hunch net-2009-09-18-Necessary and Sufficient Research

19 0.59528691 259 hunch net-2007-08-19-Choice of Metrics

20 0.59400445 131 hunch net-2005-11-16-The Everything Ensemble Edge