hunch_net hunch_net-2008 hunch_net-2008-330 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y
sentIndex sentText sentNum sentScore
1 I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . [sent-1, score-0.345]
2 The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. [sent-2, score-1.645]
3 The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. [sent-4, score-0.196]
4 There are a couple workarounds for this computational bug: Approximate. [sent-5, score-0.233]
5 can be arbitrarily bad), and hence finicky in application. [sent-9, score-0.395]
6 You don’t really want a probability, you want the most probable choice which can be found more directly. [sent-11, score-0.456]
7 Energy based model update rules are an example of that approach and there are many other direct methods from supervised learning. [sent-12, score-0.516]
8 This is great when it applies, but sometimes a probability is actually needed. [sent-13, score-0.224]
9 This paper points out that a third approach can be viable empirically: use a self-normalizing structure. [sent-14, score-0.47]
10 It seems highly likely that this is true in other applications as well. [sent-15, score-0.094]
wordName wordTfidf (topN-words)
[('probability', 0.224), ('events', 0.212), ('words', 0.199), ('unstructured', 0.174), ('ada', 0.174), ('uncontrolled', 0.161), ('savings', 0.161), ('want', 0.152), ('finicky', 0.152), ('inefficient', 0.152), ('probable', 0.152), ('normalization', 0.145), ('hinton', 0.145), ('approximations', 0.139), ('resource', 0.139), ('arbitrarily', 0.139), ('bug', 0.134), ('viable', 0.134), ('computational', 0.133), ('energy', 0.13), ('constructing', 0.13), ('predict', 0.129), ('paper', 0.124), ('geoff', 0.123), ('favor', 0.12), ('predicted', 0.117), ('score', 0.115), ('rules', 0.115), ('direct', 0.11), ('applies', 0.11), ('third', 0.108), ('word', 0.104), ('automatically', 0.104), ('compute', 0.104), ('hence', 0.104), ('places', 0.104), ('approach', 0.104), ('beyond', 0.103), ('wanted', 0.101), ('couple', 0.1), ('update', 0.1), ('empirically', 0.095), ('highly', 0.094), ('probabilistic', 0.092), ('tree', 0.089), ('binary', 0.088), ('claim', 0.088), ('supervised', 0.087), ('huge', 0.087), ('computationally', 0.086)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 330 hunch net-2008-12-07-A NIPS paper
Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y
2 0.15261725 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
3 0.13931176 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
4 0.13390207 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
Introduction: This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . I found this paper very difficult to read, but it does have some point about a computational shortcut. This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimen
5 0.13158503 440 hunch net-2011-08-06-Interesting thing at UAI 2011
Introduction: I had a chance to attend UAI this year, where several papers interested me, including: Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture . We’ve already discussed this one , but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it. Yao-Liang Yu and Dale Schuurmans , Rank/norm regularization with closed-form solutions: Application to subspace clustering . This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally. Laurent Charlin , Richard Zemel and Craig Boutilier , A Framework for Optimizing Paper Matching . This is about what works in matching papers to reviewers, as has been tested at several previous
6 0.12539865 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
7 0.1026208 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
8 0.098142244 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy
9 0.097070202 235 hunch net-2007-03-03-All Models of Learning have Flaws
10 0.093394272 160 hunch net-2006-03-02-Why do people count for learning?
11 0.093333058 304 hunch net-2008-06-27-Reviewing Horror Stories
12 0.092400208 360 hunch net-2009-06-15-In Active Learning, the question changes
13 0.087865591 299 hunch net-2008-04-27-Watchword: Supervised Learning
14 0.086026341 385 hunch net-2009-12-27-Interesting things at NIPS 2009
15 0.083808012 113 hunch net-2005-09-19-NIPS Workshops
16 0.083765492 114 hunch net-2005-09-20-Workshop Proposal: Atomic Learning
17 0.083589628 388 hunch net-2010-01-24-Specializations of the Master Problem
18 0.083478332 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize
19 0.083464839 347 hunch net-2009-03-26-Machine Learning is too easy
20 0.082876049 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?
topicId topicWeight
[(0, 0.218), (1, 0.059), (2, 0.025), (3, -0.015), (4, 0.046), (5, 0.031), (6, 0.004), (7, 0.021), (8, 0.051), (9, -0.06), (10, 0.014), (11, -0.037), (12, -0.09), (13, -0.048), (14, -0.064), (15, -0.028), (16, -0.047), (17, 0.07), (18, 0.112), (19, -0.003), (20, -0.113), (21, 0.098), (22, 0.054), (23, 0.057), (24, -0.041), (25, 0.017), (26, 0.051), (27, -0.076), (28, -0.025), (29, 0.002), (30, -0.031), (31, 0.074), (32, -0.012), (33, -0.034), (34, -0.046), (35, 0.072), (36, 0.007), (37, 0.01), (38, 0.008), (39, 0.03), (40, -0.064), (41, -0.06), (42, -0.074), (43, 0.178), (44, 0.048), (45, 0.078), (46, -0.091), (47, -0.009), (48, -0.036), (49, 0.054)]
simIndex simValue blogId blogTitle
same-blog 1 0.97191268 330 hunch net-2008-12-07-A NIPS paper
Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y
2 0.7797128 440 hunch net-2011-08-06-Interesting thing at UAI 2011
Introduction: I had a chance to attend UAI this year, where several papers interested me, including: Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture . We’ve already discussed this one , but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it. Yao-Liang Yu and Dale Schuurmans , Rank/norm regularization with closed-form solutions: Application to subspace clustering . This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally. Laurent Charlin , Richard Zemel and Craig Boutilier , A Framework for Optimizing Paper Matching . This is about what works in matching papers to reviewers, as has been tested at several previous
3 0.67847234 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
4 0.67621386 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
5 0.58837396 140 hunch net-2005-12-14-More NIPS Papers II
Introduction: I thought this was a very good NIPS with many excellent papers. The following are a few NIPS papers which I liked and I hope to study more carefully when I get the chance. The list is not exhaustive and in no particular order… Preconditioner Approximations for Probabilistic Graphical Models. Pradeeep Ravikumar and John Lafferty. I thought the use of preconditioner methods from solving linear systems in the context of approximate inference was novel and interesting. The results look good and I’d like to understand the limitations. Rodeo: Sparse nonparametric regression in high dimensions. John Lafferty and Larry Wasserman. A very interesting approach to feature selection in nonparametric regression from a frequentist framework. The use of lengthscale variables in each dimension reminds me a lot of ‘Automatic Relevance Determination’ in Gaussian process regression — it would be interesting to compare Rodeo to ARD in GPs. Interpolating between types and tokens by estimating
6 0.5735305 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
7 0.56907761 218 hunch net-2006-11-20-Context and the calculation misperception
8 0.55544096 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
9 0.55233568 118 hunch net-2005-10-07-On-line learning of regular decision rules
10 0.52400303 39 hunch net-2005-03-10-Breaking Abstractions
11 0.51310235 77 hunch net-2005-05-29-Maximum Margin Mismatch?
12 0.5065729 97 hunch net-2005-07-23-Interesting papers at ACL
13 0.50176311 123 hunch net-2005-10-16-Complexity: It’s all in your head
14 0.48792744 33 hunch net-2005-02-28-Regularization
15 0.48195159 310 hunch net-2008-07-15-Interesting papers at COLT (and a bit of UAI & workshops)
16 0.47514305 302 hunch net-2008-05-25-Inappropriate Mathematics for Machine Learning
17 0.47374547 299 hunch net-2008-04-27-Watchword: Supervised Learning
18 0.46266091 139 hunch net-2005-12-11-More NIPS Papers
19 0.45746967 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three
20 0.44868726 90 hunch net-2005-07-07-The Limits of Learning Theory
topicId topicWeight
[(3, 0.059), (9, 0.344), (27, 0.224), (38, 0.038), (53, 0.061), (55, 0.067), (94, 0.112)]
simIndex simValue blogId blogTitle
1 0.95094168 399 hunch net-2010-05-20-Google Predict
Introduction: Slashdot points out Google Predict . I’m not privy to the details, but this has the potential to be extremely useful, as in many applications simply having an easy mechanism to apply existing learning algorithms can be extremely helpful. This differs goalwise from MLcomp —instead of public comparisons for research purposes, it’s about private utilization of good existing algorithms. It also differs infrastructurally, since a system designed to do this is much less awkward than using Amazon’s cloud computing. The latter implies that datasets several order of magnitude larger can be handled up to limits imposed by network and storage.
2 0.91081703 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009
Introduction: Here’s a list of papers that I found interesting at ICML / COLT / UAI in 2009. Elad Hazan and Comandur Seshadhri Efficient learning algorithms for changing environments at ICML. This paper shows how to adapt learning algorithms that compete with fixed predictors to compete with changing policies. The definition of regret they deal with seems particularly useful in many situation. Hal Daume , Unsupervised Search-based Structured Prediction at ICML. This paper shows a technique for reducing unsupervised learning to supervised learning which (a) make a fast unsupervised learning algorithm and (b) makes semisupervised learning both easy and highly effective. There were two papers with similar results on active learning in the KWIK framework for linear regression, both reducing the sample complexity to . One was Nicolo Cesa-Bianchi , Claudio Gentile , and Francesco Orabona Robust Bounds for Classification via Selective Sampling at ICML and the other was Thoma
same-blog 3 0.90937793 330 hunch net-2008-12-07-A NIPS paper
Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y
4 0.88781202 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
5 0.88722426 490 hunch net-2013-11-09-Graduates and Postdocs
Introduction: Several strong graduates are on the job market this year. Alekh Agarwal made the most scalable public learning algorithm as an intern two years ago. He has a deep and broad understanding of optimization and learning as well as the ability and will to make things happen programming-wise. I’ve been privileged to have Alekh visiting me in NY where he will be sorely missed. John Duchi created Adagrad which is a commonly helpful improvement over online gradient descent that is seeing wide adoption, including in Vowpal Wabbit . He has a similarly deep and broad understanding of optimization and learning with significant industry experience at Google . Alekh and John have often coauthored together. Stephane Ross visited me a year ago over the summer, implementing many new algorithms and working out the first scale free online update rule which is now the default in Vowpal Wabbit. Stephane is not on the market—Google robbed the cradle successfully I’m sure that
6 0.83847886 174 hunch net-2006-04-27-Conferences, Workshops, and Tutorials
7 0.67756975 403 hunch net-2010-07-18-ICML & COLT 2010
8 0.66154397 160 hunch net-2006-03-02-Why do people count for learning?
9 0.6493324 8 hunch net-2005-02-01-NIPS: Online Bayes
10 0.64150232 258 hunch net-2007-08-12-Exponentiated Gradient
11 0.63353509 259 hunch net-2007-08-19-Choice of Metrics
12 0.63353252 360 hunch net-2009-06-15-In Active Learning, the question changes
13 0.63325822 325 hunch net-2008-11-10-ICML Reviewing Criteria
14 0.63248396 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
15 0.62759399 309 hunch net-2008-07-10-Interesting papers, ICML 2008
16 0.6244294 345 hunch net-2009-03-08-Prediction Science
17 0.6208986 177 hunch net-2006-05-05-An ICML reject
18 0.6206758 183 hunch net-2006-06-14-Explorations of Exploration
19 0.61949646 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
20 0.61848027 347 hunch net-2009-03-26-Machine Learning is too easy