hunch_net hunch_net-2007 hunch_net-2007-237 knowledge-graph by maker-knowledge-mining

237 hunch net-2007-04-02-Contextual Scaling


meta infos for this blog

Source: html

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. [sent-1, score-0.488]

2 A clever person realizes that some bit of information x 1 could be helpful. [sent-8, score-0.83]

3 Another clever person (or perhaps the same one) realizes that some other bit of information x 2 could be helpful. [sent-14, score-0.83]

4 … The previous step repeats for information x 3 ,…,x 100 . [sent-17, score-0.386]

5 It’s no longer possible to visualize the data but a human can still function as a learning algorithm by carefully tweaking parameters and testing with the right software support to learn h(x 1 ,…,x 100 ) . [sent-18, score-0.478]

6 Graphical models can sometimes help scale up counting based approaches. [sent-19, score-0.536]

7 The “human learning algorithm” approach starts breaking down, because it becomes hard to integrate new information sources in the context of all others. [sent-21, score-0.699]

8 People realize “we must automate this process of including new information to keep up”, and a learning algorithm is adopted. [sent-23, score-0.555]

9 Understanding the process of contextual scaling seems particularly helpful for teaching about machine learning. [sent-29, score-0.396]

10 It’s often the case that the switch to the last step could and should have happened before the the 100th bit of information was integrated. [sent-30, score-0.538]

11 Number of examples required is generally exponential in the number of features. [sent-33, score-0.43]

12 Counting based approaches with smoothing and some prior language (graphical models, bayes nets, etc…). [sent-36, score-0.574]

13 Number of examples required is no longer exponential, but can still be intractably large. [sent-37, score-0.405]

14 No particular number of examples required, but sane prior specification from a human may be required. [sent-40, score-0.65]

15 A similarity measure is a weaker form of prior information which can be substantially easier to specify. [sent-42, score-0.703]

16 “Just throw the new information as a feature and let the learning algorithms sort it out”. [sent-44, score-0.444]

17 At each step in this order, less effort is required to integrate new information. [sent-45, score-0.441]

18 Obviously, when specific prior information is available, we want to incorporate it. [sent-47, score-0.601]

19 Equally obviously, when specific prior information is not available, we want to be able to take advantage of new information which happens to be easily useful. [sent-48, score-0.977]

20 When we have so much information that counting could work, a learning algorithm should behave similar to counting (with smoothing). [sent-49, score-1.042]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('counting', 0.295), ('information', 0.284), ('prior', 0.235), ('contextual', 0.201), ('scaling', 0.195), ('based', 0.171), ('smoothing', 0.168), ('clever', 0.167), ('realizes', 0.149), ('human', 0.143), ('automation', 0.138), ('valued', 0.138), ('exponential', 0.137), ('required', 0.136), ('discrete', 0.13), ('similarity', 0.119), ('specification', 0.115), ('integrate', 0.111), ('ease', 0.105), ('step', 0.102), ('realize', 0.096), ('new', 0.092), ('still', 0.092), ('graphical', 0.091), ('longer', 0.091), ('remains', 0.09), ('examples', 0.086), ('could', 0.085), ('obviously', 0.084), ('algorithm', 0.083), ('specific', 0.082), ('person', 0.078), ('context', 0.075), ('saner', 0.074), ('becomes', 0.072), ('number', 0.071), ('predictor', 0.071), ('sometimes', 0.07), ('stage', 0.069), ('tweaking', 0.069), ('algorithms', 0.068), ('bit', 0.067), ('space', 0.066), ('real', 0.065), ('breaking', 0.065), ('breakthrough', 0.065), ('weaker', 0.065), ('condition', 0.065), ('boundary', 0.065), ('crafted', 0.065)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999964 237 hunch net-2007-04-02-Contextual Scaling

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s

2 0.1934406 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

3 0.16370802 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

4 0.15408714 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

5 0.15380429 95 hunch net-2005-07-14-What Learning Theory might do

Introduction: I wanted to expand on this post and some of the previous problems/research directions about where learning theory might make large strides. Why theory? The essential reason for theory is “intuition extension”. A very good applied learning person can master some particular application domain yielding the best computer algorithms for solving that problem. A very good theory can take the intuitions discovered by this and other applied learning people and extend them to new domains in a relatively automatic fashion. To do this, we take these basic intuitions and try to find a mathematical model that: Explains the basic intuitions. Makes new testable predictions about how to learn. Succeeds in so learning. This is “intuition extension”: taking what we have learned somewhere else and applying it in new domains. It is fundamentally useful to everyone because it increases the level of automation in solving problems. Where next for learning theory? I like the a

6 0.1455791 3 hunch net-2005-01-24-The Humanloop Spectrum of Machine Learning

7 0.14468879 160 hunch net-2006-03-02-Why do people count for learning?

8 0.13607034 347 hunch net-2009-03-26-Machine Learning is too easy

9 0.13592072 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

10 0.13203353 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

11 0.12948377 34 hunch net-2005-03-02-Prior, “Prior” and Bias

12 0.12741494 269 hunch net-2007-10-24-Contextual Bandits

13 0.12395934 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

14 0.11953644 400 hunch net-2010-06-13-The Good News on Exploration and Learning

15 0.11835603 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

16 0.11637949 217 hunch net-2006-11-06-Data Linkage Problems

17 0.1157499 43 hunch net-2005-03-18-Binomial Weighting

18 0.11396156 19 hunch net-2005-02-14-Clever Methods of Overfitting

19 0.11171436 345 hunch net-2009-03-08-Prediction Science

20 0.11028498 134 hunch net-2005-12-01-The Webscience Future


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.287), (1, 0.114), (2, -0.043), (3, 0.053), (4, 0.051), (5, -0.038), (6, -0.046), (7, 0.054), (8, 0.126), (9, -0.015), (10, -0.038), (11, -0.039), (12, 0.066), (13, 0.022), (14, 0.031), (15, -0.056), (16, 0.001), (17, -0.06), (18, 0.103), (19, -0.007), (20, -0.002), (21, -0.08), (22, 0.06), (23, 0.024), (24, 0.026), (25, 0.11), (26, 0.07), (27, 0.081), (28, 0.05), (29, -0.037), (30, -0.007), (31, -0.056), (32, 0.089), (33, -0.056), (34, 0.003), (35, -0.044), (36, -0.038), (37, -0.071), (38, -0.043), (39, 0.067), (40, 0.026), (41, -0.045), (42, 0.003), (43, -0.015), (44, -0.008), (45, 0.027), (46, -0.05), (47, -0.015), (48, 0.132), (49, -0.103)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96043944 237 hunch net-2007-04-02-Contextual Scaling

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s

2 0.79209507 217 hunch net-2006-11-06-Data Linkage Problems

Introduction: Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons. A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algo

3 0.72455835 160 hunch net-2006-03-02-Why do people count for learning?

Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat

4 0.72186863 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

5 0.70874453 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

6 0.69326818 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

7 0.68433177 312 hunch net-2008-08-04-Electoralmarkets.com

8 0.62488455 347 hunch net-2009-03-26-Machine Learning is too easy

9 0.61221838 34 hunch net-2005-03-02-Prior, “Prior” and Bias

10 0.61161292 253 hunch net-2007-07-06-Idempotent-capable Predictors

11 0.6114651 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

12 0.60392493 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

13 0.60322618 120 hunch net-2005-10-10-Predictive Search is Coming

14 0.60178846 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

15 0.5959602 153 hunch net-2006-02-02-Introspectionism as a Disease

16 0.58529657 68 hunch net-2005-05-10-Learning Reductions are Reductionist

17 0.58151591 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

18 0.57998556 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?

19 0.57863832 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

20 0.57723236 95 hunch net-2005-07-14-What Learning Theory might do


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.026), (16, 0.011), (21, 0.013), (27, 0.237), (38, 0.07), (49, 0.011), (53, 0.098), (55, 0.06), (64, 0.025), (77, 0.036), (82, 0.189), (94, 0.111), (95, 0.039)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.93040901 439 hunch net-2011-08-01-Interesting papers at COLT 2011

Introduction: Since John did not attend COLT this year, I have been volunteered to report back on the hot stuff at this year’s meeting. The conference seemed to have pretty high quality stuff this year, and I found plenty of interesting papers on all the three days. I’m gonna pick some of my favorites going through the program in a chronological order. The first session on matrices seemed interesting for two reasons. First, the papers were quite nice. But more interestingly, this is a topic that has had a lot of presence in Statistics and Compressed sensing literature recently. So it was good to see high-dimensional matrices finally make their entry at COLT. The paper of Ohad and Shai on Collaborative Filtering with the Trace Norm: Learning, Bounding, and Transducing provides non-trivial guarantees on trace norm regularization in an agnostic setup, while Rina and Nati show how Rademacher averages can be used to get sharper results for matrix completion problems in their paper Concentr

same-blog 2 0.92170191 237 hunch net-2007-04-02-Contextual Scaling

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s

3 0.82928008 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

4 0.82538134 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of

5 0.82395053 351 hunch net-2009-05-02-Wielding a New Abstraction

Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo

6 0.82298881 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

7 0.82267088 131 hunch net-2005-11-16-The Everything Ensemble Edge

8 0.81993937 235 hunch net-2007-03-03-All Models of Learning have Flaws

9 0.81684232 95 hunch net-2005-07-14-What Learning Theory might do

10 0.81499666 12 hunch net-2005-02-03-Learning Theory, by assumption

11 0.81448328 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

12 0.81283504 19 hunch net-2005-02-14-Clever Methods of Overfitting

13 0.8125574 143 hunch net-2005-12-27-Automated Labeling

14 0.81246585 347 hunch net-2009-03-26-Machine Learning is too easy

15 0.81152618 370 hunch net-2009-09-18-Necessary and Sufficient Research

16 0.81082481 41 hunch net-2005-03-15-The State of Tight Bounds

17 0.80943549 258 hunch net-2007-08-12-Exponentiated Gradient

18 0.80834639 259 hunch net-2007-08-19-Choice of Metrics

19 0.80770493 14 hunch net-2005-02-07-The State of the Reduction

20 0.80720073 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”