hunch_net hunch_net-2009 hunch_net-2009-359 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re
sentIndex sentText sentNum sentScore
1 Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . [sent-1, score-0.166]
2 An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. [sent-2, score-0.327]
3 To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. [sent-3, score-0.466]
4 Existing approaches for deriving state have some limitations. [sent-5, score-0.223]
5 Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a. [sent-6, score-0.773]
6 Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. [sent-8, score-0.346]
7 Dynamic Bayesian Networks ( graphical models through time) require substantial a. [sent-9, score-0.418]
8 priori specification and often require the solution of difficult computational problems to use. [sent-10, score-0.462]
9 The Subspace-ID approach from control theory uses a linear representation, with the basic claim that it works well when all transformations are linear, but not so well when things are nonlinear. [sent-12, score-0.535]
10 ) In making this post, I ran across this two day tutorial which discusses extensions of this idea to nonlinear systems. [sent-14, score-0.419]
11 The point of this paper at ICML is that some dynamic systems (those which are “invertible”), can be decomposed into separate bounded resource prediction problems which, when solved, create an implicit definition of state. [sent-16, score-0.382]
12 This allows us to use any general purpose supervised learning algorithm to solve the state formation problem without requiring linearity or any specific representation. [sent-17, score-0.588]
13 It doesn’t require lots of prior specification & information when you have lots of data. [sent-20, score-0.812]
14 It leverages the huge amount of work that has gone into supervised learning algorithm design. [sent-21, score-0.181]
15 It works in controlled systems also, where the control is simply another observation. [sent-22, score-0.393]
16 It works with generalization from the start, rather than requiring the (often awkward) addition of generalization later. [sent-23, score-0.788]
17 It doesn’t require predicting everything in order to predict what you want. [sent-24, score-0.235]
18 It can work with very large observation spaces, and can even work better the larger the observation space, because larger observations imply more invertibility. [sent-25, score-0.678]
19 For those who aren’t note that this is (in some sense) a generalization of subspace ID, and hence that there are other applications of the approach known to work in practice. [sent-28, score-0.606]
20 It’s relatively rare to have a paper about a new approach to solving a problem as intractable as nonlinear dynamics has proved to be, so if you see a flaw please speak up. [sent-30, score-0.39]
wordName wordTfidf (topN-words)
[('generalization', 0.264), ('require', 0.235), ('specification', 0.227), ('observations', 0.166), ('intractable', 0.163), ('filters', 0.151), ('requiring', 0.146), ('dynamic', 0.139), ('nonlinear', 0.139), ('state', 0.137), ('lots', 0.132), ('linear', 0.125), ('control', 0.122), ('doesn', 0.116), ('purpose', 0.116), ('works', 0.114), ('tutorial', 0.112), ('substantial', 0.103), ('supervised', 0.098), ('summarizing', 0.098), ('invertible', 0.098), ('kalman', 0.098), ('subspace', 0.091), ('formation', 0.091), ('dubious', 0.091), ('particle', 0.091), ('observation', 0.089), ('approach', 0.088), ('information', 0.086), ('transformations', 0.086), ('decomposed', 0.086), ('extensions', 0.086), ('deriving', 0.086), ('generically', 0.086), ('disappointed', 0.086), ('larger', 0.084), ('work', 0.083), ('discusses', 0.082), ('em', 0.082), ('representational', 0.082), ('models', 0.08), ('known', 0.08), ('systems', 0.079), ('sense', 0.079), ('dead', 0.078), ('inevitable', 0.078), ('resource', 0.078), ('controlled', 0.078), ('parametric', 0.078), ('minima', 0.078)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re
2 0.18504468 388 hunch net-2010-01-24-Specializations of the Master Problem
Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of
3 0.17596497 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v
4 0.16065863 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto
5 0.14836198 169 hunch net-2006-04-05-What is state?
Introduction: In reinforcement learning (and sometimes other settings), there is a notion of “state”. Based upon the state various predictions are made such as “Which action should be taken next?” or “How much cumulative reward do I expect if I take some action from this state?” Given the importance of state, it is important to examine the meaning. There are actually several distinct options and it turns out the definition variation is very important in motivating different pieces of work. Newtonian State. State is the physical pose of the world. Under this definition, there are very many states, often too many for explicit representation. This is also the definition typically used in games. Abstracted State. State is an abstracted physical state of the world. “Is the door open or closed?” “Are you in room A or not?” The number of states is much smaller here. A basic issue here is: “How do you compute the state from observations?” Mathematical State. State is a sufficient stati
6 0.14680529 235 hunch net-2007-03-03-All Models of Learning have Flaws
7 0.13758403 351 hunch net-2009-05-02-Wielding a New Abstraction
8 0.13203353 237 hunch net-2007-04-02-Contextual Scaling
9 0.12823269 347 hunch net-2009-03-26-Machine Learning is too easy
10 0.12539865 330 hunch net-2008-12-07-A NIPS paper
11 0.11767701 332 hunch net-2008-12-23-Use of Learning Theory
12 0.11549646 343 hunch net-2009-02-18-Decision by Vetocracy
13 0.11260829 370 hunch net-2009-09-18-Necessary and Sufficient Research
14 0.10524583 454 hunch net-2012-01-30-ICML Posters and Scope
15 0.1032688 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
16 0.10272788 95 hunch net-2005-07-14-What Learning Theory might do
17 0.1023219 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
18 0.10194097 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use
19 0.10002732 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction
20 0.099302508 177 hunch net-2006-05-05-An ICML reject
topicId topicWeight
[(0, 0.282), (1, 0.084), (2, -0.017), (3, 0.027), (4, 0.086), (5, -0.026), (6, -0.007), (7, 0.035), (8, 0.065), (9, -0.025), (10, -0.021), (11, -0.059), (12, -0.037), (13, 0.08), (14, -0.021), (15, 0.011), (16, -0.01), (17, -0.034), (18, 0.064), (19, -0.01), (20, -0.044), (21, 0.056), (22, -0.004), (23, 0.031), (24, -0.078), (25, 0.047), (26, -0.039), (27, -0.038), (28, -0.055), (29, 0.021), (30, -0.028), (31, 0.01), (32, 0.091), (33, 0.026), (34, -0.053), (35, 0.074), (36, 0.067), (37, 0.031), (38, -0.049), (39, -0.003), (40, 0.011), (41, 0.041), (42, 0.007), (43, 0.043), (44, 0.033), (45, 0.045), (46, -0.098), (47, 0.034), (48, -0.049), (49, -0.137)]
simIndex simValue blogId blogTitle
same-blog 1 0.97181594 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re
2 0.7616694 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v
3 0.73399675 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto
4 0.69933248 169 hunch net-2006-04-05-What is state?
Introduction: In reinforcement learning (and sometimes other settings), there is a notion of “state”. Based upon the state various predictions are made such as “Which action should be taken next?” or “How much cumulative reward do I expect if I take some action from this state?” Given the importance of state, it is important to examine the meaning. There are actually several distinct options and it turns out the definition variation is very important in motivating different pieces of work. Newtonian State. State is the physical pose of the world. Under this definition, there are very many states, often too many for explicit representation. This is also the definition typically used in games. Abstracted State. State is an abstracted physical state of the world. “Is the door open or closed?” “Are you in room A or not?” The number of states is much smaller here. A basic issue here is: “How do you compute the state from observations?” Mathematical State. State is a sufficient stati
5 0.6631338 153 hunch net-2006-02-02-Introspectionism as a Disease
Introduction: In the AI-related parts of machine learning, it is often tempting to examine how you do things in order to imagine how a machine should do things. This is introspection, and it can easily go awry. I will call introspection gone awry introspectionism. Introspectionism is almost unique to AI (and the AI-related parts of machine learning) and it can lead to huge wasted effort in research. It’s easiest to show how introspectionism arises by an example. Suppose we want to solve the problem of navigating a robot from point A to point B given a camera. Then, the following research action plan might seem natural when you examine your own capabilities: Build an edge detector for still images. Build an object recognition system given the edge detector. Build a system to predict distance and orientation to objects given the object recognition system. Build a system to plan a path through the scene you construct from {object identification, distance, orientation} predictions.
6 0.65924823 351 hunch net-2009-05-02-Wielding a New Abstraction
7 0.65086764 330 hunch net-2008-12-07-A NIPS paper
8 0.62815011 100 hunch net-2005-08-04-Why Reinforcement Learning is Important
9 0.62414956 253 hunch net-2007-07-06-Idempotent-capable Predictors
10 0.61067313 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning
11 0.61036235 276 hunch net-2007-12-10-Learning Track of International Planning Competition
12 0.60519868 370 hunch net-2009-09-18-Necessary and Sufficient Research
13 0.60482395 215 hunch net-2006-10-22-Exemplar programming
14 0.60408992 347 hunch net-2009-03-26-Machine Learning is too easy
15 0.60370946 217 hunch net-2006-11-06-Data Linkage Problems
16 0.60284871 68 hunch net-2005-05-10-Learning Reductions are Reductionist
17 0.59956998 388 hunch net-2010-01-24-Specializations of the Master Problem
18 0.59835821 148 hunch net-2006-01-13-Benchmarks for RL
19 0.597633 168 hunch net-2006-04-02-Mad (Neuro)science
20 0.59742808 237 hunch net-2007-04-02-Contextual Scaling
topicId topicWeight
[(0, 0.025), (3, 0.017), (9, 0.022), (10, 0.014), (21, 0.025), (27, 0.249), (37, 0.016), (38, 0.066), (48, 0.031), (49, 0.103), (53, 0.085), (55, 0.072), (78, 0.013), (94, 0.131), (95, 0.052)]
simIndex simValue blogId blogTitle
same-blog 1 0.98085141 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
Introduction: Suppose we have a set of observations over time x 1 ,x 2 ,…,x t and want to predict some future event y t+1 . An inevitable problem arises, because learning a predictor h(x 1 ,…,x t ) of y t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required. Existing approaches for deriving state have some limitations. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front. Dynamic Bayesian Networks ( graphical models through time) require substantial a.priori specification and often re
2 0.96547651 348 hunch net-2009-04-02-Asymmophobia
Introduction: One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg max l w l x but it occurs in many other places as well. Empirically, breaking symmetry well seems to yield great algorithms. feature asymmetry For those who like t
3 0.96545208 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project
Introduction: Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github . For example, the lastest and greatest can be downloaded via: git clone git://github.com/JohnLangford/vowpal_wabbit.git If you aren’t familiar with git , it’s a distributed version control system which supports quick and easy branching, as well as reconciliation. This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes. As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience. The program can effectively scale up to batch-style training on sparse terafeature (i.e. 10 12 sparse feature) size datasets. The limiting factor is typically i/o. I started using the the real datasets from the large-scale learning workshop as a conve
4 0.93759161 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
Introduction: This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . I found this paper very difficult to read, but it does have some point about a computational shortcut. This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimen
5 0.93490434 37 hunch net-2005-03-08-Fast Physics for Learning
Introduction: While everyone is silently working on ICML submissions, I found this discussion about a fast physics simulator chip interesting from a learning viewpoint. In many cases, learning attempts to predict the outcome of physical processes. Access to a fast simulator for these processes might be quite helpful in predicting the outcome. Bayesian learning in particular may directly benefit while many other algorithms (like support vector machines) might have their speed greatly increased. The biggest drawback is that writing software for these odd architectures is always difficult and time consuming, but a several-orders-of-magnitude speedup might make that worthwhile.
6 0.93030369 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
7 0.92789447 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
8 0.92276192 95 hunch net-2005-07-14-What Learning Theory might do
9 0.92075431 143 hunch net-2005-12-27-Automated Labeling
10 0.91807038 237 hunch net-2007-04-02-Contextual Scaling
11 0.91477436 351 hunch net-2009-05-02-Wielding a New Abstraction
12 0.91350758 360 hunch net-2009-06-15-In Active Learning, the question changes
13 0.91332567 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
14 0.91295534 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
15 0.91219151 297 hunch net-2008-04-22-Taking the next step
16 0.91218013 235 hunch net-2007-03-03-All Models of Learning have Flaws
17 0.91124988 194 hunch net-2006-07-11-New Models
18 0.91089123 371 hunch net-2009-09-21-Netflix finishes (and starts)
19 0.91035008 347 hunch net-2009-03-26-Machine Learning is too easy
20 0.90907711 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”