hunch_net hunch_net-2005 hunch_net-2005-62 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
sentIndex sentText sentNum sentScore
1 A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . [sent-1, score-0.965]
2 Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. [sent-2, score-0.239]
3 The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . [sent-3, score-0.425]
4 A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. [sent-4, score-0.729]
5 With statements like: (under minimal conditions) you can always make calibrated predictions. [sent-5, score-0.765]
6 Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. [sent-6, score-0.177]
7 A confusion of ends arises in the following way: We want good probabilistic predictions. [sent-7, score-0.629]
8 The “Therefore” step misses the fact that calibration is a necessary but not a sufficient characterization of good probabilities. [sent-10, score-0.599]
9 For example on the sequence “010101010…”, always predicting p=0. [sent-11, score-0.228]
10 This leads to the question: What is a sufficient characterization of good probabilities? [sent-13, score-0.492]
11 Small log probability: sum x log (1/p x ) I don’t yet understand which of these candidates is preferrable. [sent-16, score-0.472]
12 There is a sense in which none of them can be preferred. [sent-17, score-0.141]
13 In any complete prediction system, the probabilities are used in some manner, and there is some loss (or utility) associated with it’s use. [sent-18, score-0.522]
14 Depending on the sanity of the method using the probabilities, this may even imply that lieing about the probabilities is preferred. [sent-20, score-0.636]
15 Nevertheless, we can hope for a sane use of probabilities and a sufficient mechanism for predicting good probabilities might eventually result in good performance for any sane use. [sent-21, score-1.769]
wordName wordTfidf (topN-words)
[('calibrated', 0.502), ('probabilities', 0.465), ('characterization', 0.175), ('sane', 0.167), ('candidates', 0.167), ('ends', 0.161), ('predictions', 0.151), ('sufficient', 0.148), ('confusion', 0.139), ('therefore', 0.139), ('statements', 0.121), ('sum', 0.121), ('probabilistic', 0.106), ('small', 0.103), ('calibration', 0.1), ('infinitely', 0.1), ('predicting', 0.1), ('good', 0.096), ('utility', 0.093), ('proportion', 0.093), ('conclude', 0.093), ('neighborhood', 0.093), ('sanity', 0.093), ('log', 0.092), ('devoted', 0.088), ('probability', 0.086), ('strength', 0.084), ('sense', 0.081), ('misses', 0.08), ('strategies', 0.08), ('method', 0.078), ('predicts', 0.075), ('minimizing', 0.075), ('leads', 0.073), ('conditions', 0.071), ('minimal', 0.071), ('always', 0.071), ('arises', 0.068), ('eventually', 0.065), ('manner', 0.061), ('none', 0.06), ('property', 0.06), ('squared', 0.059), ('achieving', 0.059), ('depending', 0.059), ('want', 0.059), ('finite', 0.058), ('event', 0.058), ('sequence', 0.057), ('complete', 0.057)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
2 0.17853141 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
3 0.17355987 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
4 0.14576711 118 hunch net-2005-10-07-On-line learning of regular decision rules
Introduction: Many decision problems can be represented in the form FOR n =1,2,…: — Reality chooses a datum x n . — Decision Maker chooses his decision d n . — Reality chooses an observation y n . — Decision Maker suffers loss L ( y n , d n ). END FOR. The observation y n can be, for example, tomorrow’s stock price and the decision d n the number of shares Decision Maker chooses to buy. The datum x n ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data. Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D ( x ). Can we perform as well, or almost as well, as D , without knowing it? This is essentially a special case of the problem of on-line learning . This is a simple result of this kind. Suppose the data x n are taken from [0,1] and L ( y , d )=| y – d |. A norm || h || of a function h on
5 0.1151427 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
6 0.098719992 131 hunch net-2005-11-16-The Everything Ensemble Edge
7 0.09324082 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
8 0.083301932 14 hunch net-2005-02-07-The State of the Reduction
9 0.080391556 259 hunch net-2007-08-19-Choice of Metrics
10 0.080250919 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
11 0.077544853 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability
12 0.077043012 274 hunch net-2007-11-28-Computational Consequences of Classification
13 0.076706648 370 hunch net-2009-09-18-Necessary and Sufficient Research
14 0.070857711 388 hunch net-2010-01-24-Specializations of the Master Problem
15 0.070460208 245 hunch net-2007-05-12-Loss Function Semantics
16 0.070004903 77 hunch net-2005-05-29-Maximum Margin Mismatch?
17 0.068431385 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics
18 0.068343624 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
19 0.066722199 127 hunch net-2005-11-02-Progress in Active Learning
20 0.066564322 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
topicId topicWeight
[(0, 0.147), (1, 0.099), (2, 0.051), (3, -0.02), (4, -0.079), (5, -0.003), (6, 0.011), (7, 0.043), (8, 0.043), (9, -0.035), (10, -0.038), (11, 0.006), (12, 0.031), (13, -0.045), (14, 0.033), (15, -0.012), (16, -0.097), (17, -0.015), (18, 0.077), (19, 0.029), (20, -0.089), (21, 0.026), (22, 0.094), (23, 0.054), (24, 0.003), (25, -0.0), (26, 0.038), (27, -0.034), (28, -0.032), (29, -0.031), (30, 0.032), (31, 0.049), (32, -0.054), (33, -0.013), (34, -0.052), (35, 0.023), (36, -0.021), (37, 0.077), (38, 0.095), (39, 0.036), (40, -0.022), (41, -0.039), (42, -0.063), (43, 0.095), (44, -0.055), (45, 0.067), (46, -0.02), (47, 0.035), (48, -0.06), (49, 0.087)]
simIndex simValue blogId blogTitle
same-blog 1 0.96512508 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
2 0.80381107 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
3 0.75744909 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
4 0.73433542 118 hunch net-2005-10-07-On-line learning of regular decision rules
Introduction: Many decision problems can be represented in the form FOR n =1,2,…: — Reality chooses a datum x n . — Decision Maker chooses his decision d n . — Reality chooses an observation y n . — Decision Maker suffers loss L ( y n , d n ). END FOR. The observation y n can be, for example, tomorrow’s stock price and the decision d n the number of shares Decision Maker chooses to buy. The datum x n ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data. Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D ( x ). Can we perform as well, or almost as well, as D , without knowing it? This is essentially a special case of the problem of on-line learning . This is a simple result of this kind. Suppose the data x n are taken from [0,1] and L ( y , d )=| y – d |. A norm || h || of a function h on
5 0.63844824 330 hunch net-2008-12-07-A NIPS paper
Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y
6 0.63583332 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
7 0.60945314 302 hunch net-2008-05-25-Inappropriate Mathematics for Machine Learning
8 0.58989763 123 hunch net-2005-10-16-Complexity: It’s all in your head
9 0.58889133 160 hunch net-2006-03-02-Why do people count for learning?
10 0.57972658 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound
11 0.55932277 39 hunch net-2005-03-10-Breaking Abstractions
12 0.52849501 34 hunch net-2005-03-02-Prior, “Prior” and Bias
13 0.50690025 33 hunch net-2005-02-28-Regularization
14 0.4597837 440 hunch net-2011-08-06-Interesting thing at UAI 2011
15 0.45846578 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
16 0.44294003 70 hunch net-2005-05-12-Math on the Web
17 0.43197215 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
18 0.42715919 140 hunch net-2005-12-14-More NIPS Papers II
19 0.42344558 258 hunch net-2007-08-12-Exponentiated Gradient
20 0.42330039 43 hunch net-2005-03-18-Binomial Weighting
topicId topicWeight
[(0, 0.319), (1, 0.011), (3, 0.057), (10, 0.042), (27, 0.175), (53, 0.062), (55, 0.104), (94, 0.038), (95, 0.08)]
simIndex simValue blogId blogTitle
same-blog 1 0.90734971 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
2 0.90235454 473 hunch net-2012-09-29-Vowpal Wabbit, version 7.0
Introduction: A new version of VW is out . The primary changes are: Learning Reductions : I’ve wanted to get learning reductions working and we’ve finally done it. Not everything is implemented yet, but VW now supports direct: Multiclass Classification –oaa or –ect . Cost Sensitive Multiclass Classification –csoaa or –wap . Contextual Bandit Classification –cb . Sequential Structured Prediction –searn or –dagger In addition, it is now easy to build your own custom learning reductions for various plausible uses: feature diddling, custom structured prediction problems, or alternate learning reductions. This effort is far from done, but it is now in a generally useful state. Note that all learning reductions inherit the ability to do cluster parallel learning. Library interface : VW now has a basic library interface. The library provides most of the functionality of VW, with the limitation that it is monolithic and nonreentrant. These will be improved over
3 0.86998534 87 hunch net-2005-06-29-Not EM for clustering at COLT
Introduction: One standard approach for clustering data with a set of gaussians is using EM. Roughly speaking, you pick a set of k random guassians and then use alternating expectation maximization to (hopefully) find a set of guassians that “explain” the data well. This process is difficult to work with because EM can become “stuck” in local optima. There are various hacks like “rerun with t different random starting points”. One cool observation is that this can often be solved via other algorithm which do not suffer from local optima. This is an early paper which shows this. Ravi Kannan presented a new paper showing this is possible in a much more adaptive setting. A very rough summary of these papers is that by projecting into a lower dimensional space, it is computationally tractable to pick out the gross structure of the data. It is unclear how well these algorithms work in practice, but they might be effective, especially if used as a subroutine of the form: Projec
4 0.85218239 215 hunch net-2006-10-22-Exemplar programming
Introduction: There are many different abstractions for problem definition and solution. Here are a few examples: Functional programming: a set of functions are defined. The composed execution of these functions yields the solution. Linear programming: a set of constraints and a linear objective function are defined. An LP solver finds the constrained optimum. Quadratic programming: Like linear programming, but the language is a little more flexible (and the solution slower). Convex programming: like quadratic programming, but the language is more flexible (and the solutions even slower). Dynamic programming: a recursive definition of the problem is defined and then solved efficiently via caching tricks. SAT programming: A problem is specified as a satisfiability involving a conjunction of a disjunction of boolean variables. A general engine attempts to find a good satisfying assignment. For example Kautz’s blackbox planner. These abstractions have different tradeoffs betw
5 0.83042288 193 hunch net-2006-07-09-The Stock Prediction Machine Learning Problem
Introduction: …is discussed in this nytimes article . I generally expect such approaches to become more common since computers are getting faster, machine learning is getting better, and data is becoming more plentiful. This is another example where machine learning technology may have a huge economic impact. Some side notes: We-in-research know almost nothing about how these things are done (because it is typically a corporate secret). … but the limited discussion in the article seem naive from a machine learning viewpoint. The learning process used apparently often fails to take into account transaction costs. What little of the approaches is discussed appears modeling based. It seems plausible that more direct prediction methods can yield an edge. One difficulty with stock picking as a research topic is that it is inherently a zero sum game (for every winner, there is a loser). Much of the rest of research is positive sum (basically, everyone wins).
6 0.80123532 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?
7 0.65336132 133 hunch net-2005-11-28-A question of quantification
8 0.61417341 5 hunch net-2005-01-26-Watchword: Probability
9 0.58914399 492 hunch net-2013-12-01-NIPS tutorials and Vowpal Wabbit 7.4
10 0.58337229 220 hunch net-2006-11-27-Continuizing Solutions
11 0.57826376 105 hunch net-2005-08-23-(Dis)similarities between academia and open source programmers
12 0.57421452 484 hunch net-2013-06-16-Representative Reviewing
13 0.57396668 378 hunch net-2009-11-15-The Other Online Learning
14 0.57010877 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
15 0.56648922 466 hunch net-2012-06-05-ICML acceptance statistics
16 0.56543761 360 hunch net-2009-06-15-In Active Learning, the question changes
17 0.56536591 77 hunch net-2005-05-29-Maximum Margin Mismatch?
18 0.56515348 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?
19 0.56439555 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
20 0.56360579 177 hunch net-2006-05-05-An ICML reject