hunch_net hunch_net-2008 hunch_net-2008-330 knowledge-graph by maker-knowledge-mining

330 hunch net-2008-12-07-A NIPS paper


meta infos for this blog

Source: html

Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . [sent-1, score-0.345]

2 The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. [sent-2, score-1.645]

3 The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. [sent-4, score-0.196]

4 There are a couple workarounds for this computational bug: Approximate. [sent-5, score-0.233]

5 can be arbitrarily bad), and hence finicky in application. [sent-9, score-0.395]

6 You don’t really want a probability, you want the most probable choice which can be found more directly. [sent-11, score-0.456]

7 Energy based model update rules are an example of that approach and there are many other direct methods from supervised learning. [sent-12, score-0.516]

8 This is great when it applies, but sometimes a probability is actually needed. [sent-13, score-0.224]

9 This paper points out that a third approach can be viable empirically: use a self-normalizing structure. [sent-14, score-0.47]

10 It seems highly likely that this is true in other applications as well. [sent-15, score-0.094]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('probability', 0.224), ('events', 0.212), ('words', 0.199), ('unstructured', 0.174), ('ada', 0.174), ('uncontrolled', 0.161), ('savings', 0.161), ('want', 0.152), ('finicky', 0.152), ('inefficient', 0.152), ('probable', 0.152), ('normalization', 0.145), ('hinton', 0.145), ('approximations', 0.139), ('resource', 0.139), ('arbitrarily', 0.139), ('bug', 0.134), ('viable', 0.134), ('computational', 0.133), ('energy', 0.13), ('constructing', 0.13), ('predict', 0.129), ('paper', 0.124), ('geoff', 0.123), ('favor', 0.12), ('predicted', 0.117), ('score', 0.115), ('rules', 0.115), ('direct', 0.11), ('applies', 0.11), ('third', 0.108), ('word', 0.104), ('automatically', 0.104), ('compute', 0.104), ('hence', 0.104), ('places', 0.104), ('approach', 0.104), ('beyond', 0.103), ('wanted', 0.101), ('couple', 0.1), ('update', 0.1), ('empirically', 0.095), ('highly', 0.094), ('probabilistic', 0.092), ('tree', 0.089), ('binary', 0.088), ('claim', 0.088), ('supervised', 0.087), ('huge', 0.087), ('computationally', 0.086)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 330 hunch net-2008-12-07-A NIPS paper

Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y

2 0.15261725 218 hunch net-2006-11-20-Context and the calculation misperception

Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training

3 0.13931176 5 hunch net-2005-01-26-Watchword: Probability

Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in

4 0.13390207 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

Introduction: This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . I found this paper very difficult to read, but it does have some point about a computational shortcut. This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimen

5 0.13158503 440 hunch net-2011-08-06-Interesting thing at UAI 2011

Introduction: I had a chance to attend UAI this year, where several papers interested me, including: Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture . We’ve already discussed this one , but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it. Yao-Liang Yu and Dale Schuurmans , Rank/norm regularization with closed-form solutions: Application to subspace clustering . This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally. Laurent Charlin , Richard Zemel and Craig Boutilier , A Framework for Optimizing Paper Matching . This is about what works in matching papers to reviewers, as has been tested at several previous

6 0.12539865 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

7 0.1026208 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

8 0.098142244 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

9 0.097070202 235 hunch net-2007-03-03-All Models of Learning have Flaws

10 0.093394272 160 hunch net-2006-03-02-Why do people count for learning?

11 0.093333058 304 hunch net-2008-06-27-Reviewing Horror Stories

12 0.092400208 360 hunch net-2009-06-15-In Active Learning, the question changes

13 0.087865591 299 hunch net-2008-04-27-Watchword: Supervised Learning

14 0.086026341 385 hunch net-2009-12-27-Interesting things at NIPS 2009

15 0.083808012 113 hunch net-2005-09-19-NIPS Workshops

16 0.083765492 114 hunch net-2005-09-20-Workshop Proposal: Atomic Learning

17 0.083589628 388 hunch net-2010-01-24-Specializations of the Master Problem

18 0.083478332 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize

19 0.083464839 347 hunch net-2009-03-26-Machine Learning is too easy

20 0.082876049 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.218), (1, 0.059), (2, 0.025), (3, -0.015), (4, 0.046), (5, 0.031), (6, 0.004), (7, 0.021), (8, 0.051), (9, -0.06), (10, 0.014), (11, -0.037), (12, -0.09), (13, -0.048), (14, -0.064), (15, -0.028), (16, -0.047), (17, 0.07), (18, 0.112), (19, -0.003), (20, -0.113), (21, 0.098), (22, 0.054), (23, 0.057), (24, -0.041), (25, 0.017), (26, 0.051), (27, -0.076), (28, -0.025), (29, 0.002), (30, -0.031), (31, 0.074), (32, -0.012), (33, -0.034), (34, -0.046), (35, 0.072), (36, 0.007), (37, 0.01), (38, 0.008), (39, 0.03), (40, -0.064), (41, -0.06), (42, -0.074), (43, 0.178), (44, 0.048), (45, 0.078), (46, -0.091), (47, -0.009), (48, -0.036), (49, 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97191268 330 hunch net-2008-12-07-A NIPS paper

Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y

2 0.7797128 440 hunch net-2011-08-06-Interesting thing at UAI 2011

Introduction: I had a chance to attend UAI this year, where several papers interested me, including: Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture . We’ve already discussed this one , but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it. Yao-Liang Yu and Dale Schuurmans , Rank/norm regularization with closed-form solutions: Application to subspace clustering . This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally. Laurent Charlin , Richard Zemel and Craig Boutilier , A Framework for Optimizing Paper Matching . This is about what works in matching papers to reviewers, as has been tested at several previous

3 0.67847234 5 hunch net-2005-01-26-Watchword: Probability

Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in

4 0.67621386 62 hunch net-2005-04-26-To calibrate or not?

Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b

5 0.58837396 140 hunch net-2005-12-14-More NIPS Papers II

Introduction: I thought this was a very good NIPS with many excellent papers. The following are a few NIPS papers which I liked and I hope to study more carefully when I get the chance. The list is not exhaustive and in no particular order… Preconditioner Approximations for Probabilistic Graphical Models. Pradeeep Ravikumar and John Lafferty. I thought the use of preconditioner methods from solving linear systems in the context of approximate inference was novel and interesting. The results look good and I’d like to understand the limitations. Rodeo: Sparse nonparametric regression in high dimensions. John Lafferty and Larry Wasserman. A very interesting approach to feature selection in nonparametric regression from a frequentist framework. The use of lengthscale variables in each dimension reminds me a lot of ‘Automatic Relevance Determination’ in Gaussian process regression — it would be interesting to compare Rodeo to ARD in GPs. Interpolating between types and tokens by estimating

6 0.5735305 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

7 0.56907761 218 hunch net-2006-11-20-Context and the calculation misperception

8 0.55544096 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

9 0.55233568 118 hunch net-2005-10-07-On-line learning of regular decision rules

10 0.52400303 39 hunch net-2005-03-10-Breaking Abstractions

11 0.51310235 77 hunch net-2005-05-29-Maximum Margin Mismatch?

12 0.5065729 97 hunch net-2005-07-23-Interesting papers at ACL

13 0.50176311 123 hunch net-2005-10-16-Complexity: It’s all in your head

14 0.48792744 33 hunch net-2005-02-28-Regularization

15 0.48195159 310 hunch net-2008-07-15-Interesting papers at COLT (and a bit of UAI & workshops)

16 0.47514305 302 hunch net-2008-05-25-Inappropriate Mathematics for Machine Learning

17 0.47374547 299 hunch net-2008-04-27-Watchword: Supervised Learning

18 0.46266091 139 hunch net-2005-12-11-More NIPS Papers

19 0.45746967 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

20 0.44868726 90 hunch net-2005-07-07-The Limits of Learning Theory


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.059), (9, 0.344), (27, 0.224), (38, 0.038), (53, 0.061), (55, 0.067), (94, 0.112)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.95094168 399 hunch net-2010-05-20-Google Predict

Introduction: Slashdot points out Google Predict . I’m not privy to the details, but this has the potential to be extremely useful, as in many applications simply having an easy mechanism to apply existing learning algorithms can be extremely helpful. This differs goalwise from MLcomp —instead of public comparisons for research purposes, it’s about private utilization of good existing algorithms. It also differs infrastructurally, since a system designed to do this is much less awkward than using Amazon’s cloud computing. The latter implies that datasets several order of magnitude larger can be handled up to limits imposed by network and storage.

2 0.91081703 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009

Introduction: Here’s a list of papers that I found interesting at ICML / COLT / UAI in 2009. Elad Hazan and Comandur Seshadhri Efficient learning algorithms for changing environments at ICML. This paper shows how to adapt learning algorithms that compete with fixed predictors to compete with changing policies. The definition of regret they deal with seems particularly useful in many situation. Hal Daume , Unsupervised Search-based Structured Prediction at ICML. This paper shows a technique for reducing unsupervised learning to supervised learning which (a) make a fast unsupervised learning algorithm and (b) makes semisupervised learning both easy and highly effective. There were two papers with similar results on active learning in the KWIK framework for linear regression, both reducing the sample complexity to . One was Nicolo Cesa-Bianchi , Claudio Gentile , and Francesco Orabona Robust Bounds for Classification via Selective Sampling at ICML and the other was Thoma

same-blog 3 0.90937793 330 hunch net-2008-12-07-A NIPS paper

Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y

4 0.88781202 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is

5 0.88722426 490 hunch net-2013-11-09-Graduates and Postdocs

Introduction: Several strong graduates are on the job market this year. Alekh Agarwal made the most scalable public learning algorithm as an intern two years ago. He has a deep and broad understanding of optimization and learning as well as the ability and will to make things happen programming-wise. I’ve been privileged to have Alekh visiting me in NY where he will be sorely missed. John Duchi created Adagrad which is a commonly helpful improvement over online gradient descent that is seeing wide adoption, including in Vowpal Wabbit . He has a similarly deep and broad understanding of optimization and learning with significant industry experience at Google . Alekh and John have often coauthored together. Stephane Ross visited me a year ago over the summer, implementing many new algorithms and working out the first scale free online update rule which is now the default in Vowpal Wabbit. Stephane is not on the market—Google robbed the cradle successfully I’m sure that

6 0.83847886 174 hunch net-2006-04-27-Conferences, Workshops, and Tutorials

7 0.67756975 403 hunch net-2010-07-18-ICML & COLT 2010

8 0.66154397 160 hunch net-2006-03-02-Why do people count for learning?

9 0.6493324 8 hunch net-2005-02-01-NIPS: Online Bayes

10 0.64150232 258 hunch net-2007-08-12-Exponentiated Gradient

11 0.63353509 259 hunch net-2007-08-19-Choice of Metrics

12 0.63353252 360 hunch net-2009-06-15-In Active Learning, the question changes

13 0.63325822 325 hunch net-2008-11-10-ICML Reviewing Criteria

14 0.63248396 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

15 0.62759399 309 hunch net-2008-07-10-Interesting papers, ICML 2008

16 0.6244294 345 hunch net-2009-03-08-Prediction Science

17 0.6208986 177 hunch net-2006-05-05-An ICML reject

18 0.6206758 183 hunch net-2006-06-14-Explorations of Exploration

19 0.61949646 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

20 0.61848027 347 hunch net-2009-03-26-Machine Learning is too easy