hunch_net hunch_net-2006 hunch_net-2006-197 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).
sentIndex sentText sentNum sentScore
1 Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . [sent-1, score-0.516]
2 Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. [sent-2, score-1.175]
3 This bears a strong resemblance to a neural network. [sent-3, score-0.812]
4 The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). [sent-4, score-1.091]
5 There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed). [sent-5, score-1.063]
wordName wordTfidf (topN-words)
[('parameters', 0.283), ('neural', 0.261), ('ed', 0.251), ('environmental', 0.251), ('resemblance', 0.251), ('snelson', 0.251), ('characteristics', 0.209), ('category', 0.201), ('minima', 0.201), ('bears', 0.201), ('interpretation', 0.194), ('differences', 0.182), ('uncertainty', 0.169), ('aid', 0.162), ('fewer', 0.159), ('main', 0.151), ('competition', 0.148), ('speed', 0.144), ('local', 0.144), ('won', 0.142), ('predictive', 0.142), ('descent', 0.138), ('network', 0.138), ('probabilistic', 0.132), ('typical', 0.122), ('gradient', 0.119), ('algorithm', 0.112), ('design', 0.1), ('strong', 0.099), ('solve', 0.088), ('seem', 0.082), ('system', 0.081), ('perhaps', 0.078), ('using', 0.061), ('two', 0.057), ('might', 0.055), ('may', 0.05), ('problem', 0.045)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 197 hunch net-2006-07-17-A Winner
Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).
2 0.2052519 111 hunch net-2005-09-12-Fast Gradient Descent
Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be
3 0.1142295 199 hunch net-2006-07-26-Two more UAI papers of interest
Introduction: In addition to Ed Snelson’s paper, there were (at least) two other papers that caught my eye at UAI. One was this paper by Sanjoy Dasgupta, Daniel Hsu and Nakul Verma at UCSD which shows in a surprisingly general and strong way that almost all linear projections of any jointly distributed vector random variable with finite first and second moments look sphereical and unimodal (in fact look like a scale mixture of Gaussians). Great result, as you’d expect from Sanjoy. The other paper which I found intriguing but which I just haven’t groked yet is this beast by Manfred and Dima Kuzmin. You can check out the (beautiful) slides if that helps. I feel like there is something deep here, but my brain is too small to understand it. The COLT and last NIPS papers/slides are also on Manfred’s page. Hopefully someone here can illuminate.
4 0.10908284 167 hunch net-2006-03-27-Gradients everywhere
Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give
5 0.10704906 201 hunch net-2006-08-07-The Call of the Deep
Introduction: Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations? What is deep learning? Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features. Specifying this more rigorously turns out to be rather difficult. Consider the following cases: SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun ‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number
6 0.10370748 438 hunch net-2011-07-11-Interesting Neural Network Papers at ICML 2011
7 0.091327578 352 hunch net-2009-05-06-Machine Learning to AI
8 0.085677512 258 hunch net-2007-08-12-Exponentiated Gradient
9 0.082678638 268 hunch net-2007-10-19-Second Annual Reinforcement Learning Competition
10 0.08153493 267 hunch net-2007-10-17-Online as the new adjective
11 0.0806164 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
12 0.076137833 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
13 0.071881674 179 hunch net-2006-05-16-The value of the orthodox view of Boosting
14 0.069323428 152 hunch net-2006-01-30-Should the Input Representation be a Vector?
15 0.067410074 21 hunch net-2005-02-17-Learning Research Programs
16 0.064386733 177 hunch net-2006-05-05-An ICML reject
17 0.063628726 394 hunch net-2010-04-24-COLT Treasurer is now Phil Long
18 0.061877318 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms
19 0.061526787 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
20 0.060154941 276 hunch net-2007-12-10-Learning Track of International Planning Competition
topicId topicWeight
[(0, 0.112), (1, 0.05), (2, -0.028), (3, -0.002), (4, 0.066), (5, 0.009), (6, -0.081), (7, -0.014), (8, 0.026), (9, 0.028), (10, -0.099), (11, -0.035), (12, -0.051), (13, -0.12), (14, 0.066), (15, 0.144), (16, 0.094), (17, -0.003), (18, -0.005), (19, 0.014), (20, -0.036), (21, 0.024), (22, -0.004), (23, -0.025), (24, -0.058), (25, 0.038), (26, -0.007), (27, 0.009), (28, 0.011), (29, -0.089), (30, -0.069), (31, 0.067), (32, -0.075), (33, -0.136), (34, -0.076), (35, 0.03), (36, -0.132), (37, -0.005), (38, 0.008), (39, -0.014), (40, -0.021), (41, -0.041), (42, 0.061), (43, -0.042), (44, -0.071), (45, -0.026), (46, -0.004), (47, -0.039), (48, 0.071), (49, 0.098)]
simIndex simValue blogId blogTitle
same-blog 1 0.98102295 197 hunch net-2006-07-17-A Winner
Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).
2 0.81018889 111 hunch net-2005-09-12-Fast Gradient Descent
Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be
3 0.74942988 167 hunch net-2006-03-27-Gradients everywhere
Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give
4 0.70848674 179 hunch net-2006-05-16-The value of the orthodox view of Boosting
Introduction: The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page ). Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics , including a response by Yoav Freund and Robert Schapire . I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties. Comparing wit
5 0.59207428 258 hunch net-2007-08-12-Exponentiated Gradient
Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve
6 0.47776872 152 hunch net-2006-01-30-Should the Input Representation be a Vector?
7 0.47565696 438 hunch net-2011-07-11-Interesting Neural Network Papers at ICML 2011
8 0.4683328 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms
9 0.44414485 348 hunch net-2009-04-02-Asymmophobia
10 0.43975392 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
11 0.43816829 205 hunch net-2006-09-07-Objective and subjective interpretations of probability
12 0.41883835 16 hunch net-2005-02-09-Intuitions from applied learning
13 0.4178943 201 hunch net-2006-08-07-The Call of the Deep
14 0.41134581 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
15 0.39975992 32 hunch net-2005-02-27-Antilearning: When proximity goes bad
16 0.38598999 131 hunch net-2005-11-16-The Everything Ensemble Edge
17 0.37386689 345 hunch net-2009-03-08-Prediction Science
18 0.3727065 308 hunch net-2008-07-06-To Dual or Not
19 0.36230761 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
20 0.36177626 336 hunch net-2009-01-19-Netflix prize within epsilon
topicId topicWeight
[(16, 0.597), (27, 0.116), (55, 0.037), (94, 0.106)]
simIndex simValue blogId blogTitle
1 0.9654513 59 hunch net-2005-04-22-New Blog: [Lowerbounds,Upperbounds]
Introduction: Maverick Woo and the Aladdin group at CMU have started a CS theory-related blog here .
same-blog 2 0.93309909 197 hunch net-2006-07-17-A Winner
Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).
3 0.79993302 69 hunch net-2005-05-11-Visa Casualties
Introduction: For the Chicago 2005 machine learning summer school we are organizing, at least 5 international students can not come due to visa issues. There seem to be two aspects to visa issues: Inefficiency . The system rejected the student simply by being incapable of even starting to evaluate their visa in less than 1 month of time. Politics . Border controls became much tighter after the September 11 attack. Losing a big chunk of downtown of the largest city in a country will do that. What I (and the students) learned is that (1) is a much larger problem than (2). Only 1 prospective student seems to have achieved an explicit visa rejection. Fixing problem (1) should be a no-brainer, because the lag time almost surely indicates overload, and overload on border controls should worry even people concerned with (2). The obvious fixes to overload are “spend more money” and “make the system more efficient”. With respect to (2), (which is a more minor issue by the numbers) it i
4 0.7329548 176 hunch net-2006-05-01-A conversation between Theo and Pat
Introduction: Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. Theo (the thoeretician) Use an error correcting output code . Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them. Theo Oh? Is your problem a decision list? Pat No, I don’t think so. Theo Hmm. Are the classes well separated by axis aligned splits? Pat Err, maybe. I’m not sure. Theo Well, if they are, under the IID assumption I can tell you how many samples you need. Pat IID? The data is definitely not IID. Theo Oh dear. Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. Theo Ok. What do you know about your problem? Pat Not much. My friend just gave me the dataset. Theo Then, no one can help you. Pat (What a fuzzy thinker. Theo keeps jumping t
5 0.58205855 414 hunch net-2010-10-17-Partha Niyogi has died
Introduction: from brain cancer. I asked Misha who worked with him to write about it. Partha Niyogi, Louis Block Professor in Computer Science and Statistics at the University of Chicago passed away on October 1, 2010, aged 43. I first met Partha Niyogi almost exactly ten years ago when I was a graduate student in math and he had just started as a faculty in Computer Science and Statistics at the University of Chicago. Strangely, we first talked at length due to a somewhat convoluted mathematical argument in a paper on pattern recognition. I asked him some questions about the paper, and, even though the topic was new to him, he had put serious thought into it and we started regular meetings. We made significant progress and developed a line of research stemming initially just from trying to understand that one paper and to simplify one derivation. I think this was typical of Partha, showing both his intellectual curiosity and his intuition for the serendipitous; having a sense and focus fo
6 0.56702435 447 hunch net-2011-10-10-ML Symposium and ICML details
7 0.40509447 19 hunch net-2005-02-14-Clever Methods of Overfitting
8 0.32126635 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
9 0.27107489 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
10 0.26072496 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
11 0.25607827 371 hunch net-2009-09-21-Netflix finishes (and starts)
12 0.25470176 276 hunch net-2007-12-10-Learning Track of International Planning Competition
13 0.25395507 120 hunch net-2005-10-10-Predictive Search is Coming
14 0.25238734 343 hunch net-2009-02-18-Decision by Vetocracy
15 0.25132501 221 hunch net-2006-12-04-Structural Problems in NIPS Decision Making
16 0.24733582 229 hunch net-2007-01-26-Parallel Machine Learning Problems
17 0.24691314 346 hunch net-2009-03-18-Parallel ML primitives
18 0.24469961 106 hunch net-2005-09-04-Science in the Government
19 0.24340591 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
20 0.24260443 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science