hunch_net hunch_net-2005 hunch_net-2005-118 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Many decision problems can be represented in the form FOR n =1,2,…: — Reality chooses a datum x n . — Decision Maker chooses his decision d n . — Reality chooses an observation y n . — Decision Maker suffers loss L ( y n , d n ). END FOR. The observation y n can be, for example, tomorrow’s stock price and the decision d n the number of shares Decision Maker chooses to buy. The datum x n ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data. Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D ( x ). Can we perform as well, or almost as well, as D , without knowing it? This is essentially a special case of the problem of on-line learning . This is a simple result of this kind. Suppose the data x n are taken from [0,1] and L ( y , d )=| y – d |. A norm || h || of a function h on
sentIndex sentText sentNum sentScore
1 Many decision problems can be represented in the form FOR n =1,2,…: — Reality chooses a datum x n . [sent-1, score-0.744]
2 — Decision Maker suffers loss L ( y n , d n ). [sent-4, score-0.134]
3 The observation y n can be, for example, tomorrow’s stock price and the decision d n the number of shares Decision Maker chooses to buy. [sent-6, score-0.65]
4 The datum x n ideally contains all information that might be relevant in making this decision. [sent-7, score-0.284]
5 We do not want to assume anything about the way Reality generates the observations and data. [sent-8, score-0.076]
6 Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D ( x ). [sent-9, score-0.83]
7 A norm || h || of a function h on [0,1] is defined by || h || 2 = (Integral 0 1 h ( t ) dt ) 2 + Integral 0 1 ( h '( t )) 2 dt . [sent-14, score-0.508]
8 Decision Maker has a strategy that guarantees, for all N and all D with finite || D ||, Sum n =1 N L ( y n , d n ) is at most Sum n =1 N L ( y n , D ( x n )) + (||2 D –1|| + 1) N 1/2 . [sent-15, score-0.135]
9 Therefore, Decision Maker is competitive with D provided the “mean slope” (Integral 0 1 ( D '( t )) 2 dt ) 1/2 of D is significantly less than N 1/2 . [sent-16, score-0.226]
10 This can be extended to general reproducing kernel Hilbert spaces of decision rules and to many other loss functions. [sent-17, score-0.458]
11 The standard definition (“ Kolmogorov’s axiomatic “) is that probability is a normalized measure. [sent-19, score-0.144]
12 The alternative is to define probability in terms of games rather than measure (this was started by von Mises and greatly advanced by Ville, who in 1939 replaced von Mises’s awkward gambling strategies with what he called martingales). [sent-20, score-0.764]
13 This game-theoretic probability allows one to state the most important laws of probability as continuous strategies for gambling against the forecaster: the gambler becomes rich if the forecasts violate the law. [sent-21, score-0.76]
14 In 2004 Akimichi Takemura noticed that for every such strategy the forecaster can play in such a way as not to lose a penny. [sent-22, score-0.344]
15 Takemura’s procedure is simple and constructive: for any continuous law of probability we have an explicit forecasting strategy that satisfies that law perfectly. [sent-23, score-1.176]
16 Takemura’s version was about binary observations, but it has been greatly extended since. [sent-25, score-0.167]
17 This would be easy if we knew the true probabilities for the observations. [sent-27, score-0.279]
18 At each step we would choose the decision with the smallest expected loss and the law of large numbers would allow us to prove that our goal would be achieved. [sent-28, score-0.924]
19 Alas, we do not know the true probabilities (if they exist at all). [sent-29, score-0.212]
20 But with defensive forecasting at our disposal we do not need them: the fake probabilities produced by defensive forecasting are ideal as far as the law of large numbers is concerned. [sent-30, score-1.4]
wordName wordTfidf (topN-words)
[('maker', 0.335), ('decision', 0.275), ('forecasting', 0.268), ('datum', 0.226), ('dt', 0.226), ('takemura', 0.226), ('defensive', 0.201), ('chooses', 0.189), ('law', 0.173), ('sum', 0.161), ('integral', 0.161), ('probabilities', 0.155), ('forecaster', 0.151), ('mises', 0.151), ('reality', 0.146), ('probability', 0.144), ('strategy', 0.135), ('continuous', 0.134), ('gambling', 0.117), ('von', 0.112), ('strategies', 0.107), ('extended', 0.103), ('procedure', 0.097), ('loss', 0.08), ('suppose', 0.078), ('observations', 0.076), ('numbers', 0.072), ('martingales', 0.067), ('shares', 0.067), ('smallest', 0.067), ('would', 0.067), ('greatly', 0.064), ('kolmogorov', 0.062), ('disposal', 0.062), ('observation', 0.061), ('ideally', 0.058), ('laws', 0.058), ('noticed', 0.058), ('price', 0.058), ('true', 0.057), ('norm', 0.056), ('violate', 0.056), ('goal', 0.056), ('result', 0.056), ('replaced', 0.054), ('represented', 0.054), ('mapping', 0.054), ('suffers', 0.054), ('games', 0.054), ('satisfies', 0.052)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 118 hunch net-2005-10-07-On-line learning of regular decision rules
Introduction: Many decision problems can be represented in the form FOR n =1,2,…: — Reality chooses a datum x n . — Decision Maker chooses his decision d n . — Reality chooses an observation y n . — Decision Maker suffers loss L ( y n , d n ). END FOR. The observation y n can be, for example, tomorrow’s stock price and the decision d n the number of shares Decision Maker chooses to buy. The datum x n ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data. Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D ( x ). Can we perform as well, or almost as well, as D , without knowing it? This is essentially a special case of the problem of on-line learning . This is a simple result of this kind. Suppose the data x n are taken from [0,1] and L ( y , d )=| y – d |. A norm || h || of a function h on
2 0.14576711 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
3 0.14352575 397 hunch net-2010-05-02-What’s the difference between gambling and rewarding good prediction?
Introduction: After a major financial crisis , there is much discussion about how finance has become a casino gambling with other’s money, keeping the winnings, and walking away when the money is lost. When thinking about financial reform, all the many losers in the above scenario are apt to take the view that this activity should be completely, or nearly completely curtailed. But, a more thoughtful view is that sometimes there is a real sense in which there are right and wrong decisions, and we as a society would really prefer that the people most likely to make right decisions are making them. A crucial question then is: “What is the difference between gambling and rewarding good prediction?” We discussed this before the financial crisis . The cheat-sheet sketch is that the online learning against an adversary problem, algorithm, and theorems, provide a good mathematical model for thinking about this question. What I would like to do here is map this onto various types of financial t
4 0.12812044 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
5 0.096707523 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
6 0.088686667 388 hunch net-2010-01-24-Specializations of the Master Problem
7 0.087454915 131 hunch net-2005-11-16-The Everything Ensemble Edge
8 0.08606673 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
9 0.084705472 235 hunch net-2007-03-03-All Models of Learning have Flaws
10 0.083167642 9 hunch net-2005-02-01-Watchword: Loss
11 0.082775019 112 hunch net-2005-09-14-The Predictionist Viewpoint
12 0.077825889 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
13 0.076343313 245 hunch net-2007-05-12-Loss Function Semantics
14 0.073798239 304 hunch net-2008-06-27-Reviewing Horror Stories
15 0.073530957 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
16 0.069840327 3 hunch net-2005-01-24-The Humanloop Spectrum of Machine Learning
17 0.069792137 100 hunch net-2005-08-04-Why Reinforcement Learning is Important
18 0.06892205 343 hunch net-2009-02-18-Decision by Vetocracy
19 0.067117944 123 hunch net-2005-10-16-Complexity: It’s all in your head
20 0.064499557 274 hunch net-2007-11-28-Computational Consequences of Classification
topicId topicWeight
[(0, 0.142), (1, 0.08), (2, 0.049), (3, -0.016), (4, -0.058), (5, 0.005), (6, -0.004), (7, 0.042), (8, 0.053), (9, -0.004), (10, 0.024), (11, -0.033), (12, 0.032), (13, -0.057), (14, -0.016), (15, -0.002), (16, -0.048), (17, -0.02), (18, 0.037), (19, 0.04), (20, -0.066), (21, 0.042), (22, 0.014), (23, 0.072), (24, -0.043), (25, 0.044), (26, -0.018), (27, -0.014), (28, 0.06), (29, 0.036), (30, 0.017), (31, 0.005), (32, -0.022), (33, 0.036), (34, -0.01), (35, 0.057), (36, 0.006), (37, 0.085), (38, 0.114), (39, -0.018), (40, -0.022), (41, 0.056), (42, -0.073), (43, 0.032), (44, -0.08), (45, 0.132), (46, -0.054), (47, 0.051), (48, -0.072), (49, 0.077)]
simIndex simValue blogId blogTitle
same-blog 1 0.97669911 118 hunch net-2005-10-07-On-line learning of regular decision rules
Introduction: Many decision problems can be represented in the form FOR n =1,2,…: — Reality chooses a datum x n . — Decision Maker chooses his decision d n . — Reality chooses an observation y n . — Decision Maker suffers loss L ( y n , d n ). END FOR. The observation y n can be, for example, tomorrow’s stock price and the decision d n the number of shares Decision Maker chooses to buy. The datum x n ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data. Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D ( x ). Can we perform as well, or almost as well, as D , without knowing it? This is essentially a special case of the problem of on-line learning . This is a simple result of this kind. Suppose the data x n are taken from [0,1] and L ( y , d )=| y – d |. A norm || h || of a function h on
2 0.73006415 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
3 0.65356076 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
4 0.54730827 397 hunch net-2010-05-02-What’s the difference between gambling and rewarding good prediction?
Introduction: After a major financial crisis , there is much discussion about how finance has become a casino gambling with other’s money, keeping the winnings, and walking away when the money is lost. When thinking about financial reform, all the many losers in the above scenario are apt to take the view that this activity should be completely, or nearly completely curtailed. But, a more thoughtful view is that sometimes there is a real sense in which there are right and wrong decisions, and we as a society would really prefer that the people most likely to make right decisions are making them. A crucial question then is: “What is the difference between gambling and rewarding good prediction?” We discussed this before the financial crisis . The cheat-sheet sketch is that the online learning against an adversary problem, algorithm, and theorems, provide a good mathematical model for thinking about this question. What I would like to do here is map this onto various types of financial t
5 0.5467428 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
6 0.53379357 123 hunch net-2005-10-16-Complexity: It’s all in your head
7 0.5107677 330 hunch net-2008-12-07-A NIPS paper
8 0.4701947 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability
9 0.46522132 302 hunch net-2008-05-25-Inappropriate Mathematics for Machine Learning
10 0.45879108 112 hunch net-2005-09-14-The Predictionist Viewpoint
11 0.44967973 39 hunch net-2005-03-10-Breaking Abstractions
12 0.44302812 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
13 0.42934713 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
14 0.42638925 185 hunch net-2006-06-16-Regularization = Robustness
15 0.4226577 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound
16 0.42226484 282 hunch net-2008-01-06-Research Political Issues
17 0.42167491 176 hunch net-2006-05-01-A conversation between Theo and Pat
18 0.42023975 70 hunch net-2005-05-12-Math on the Web
19 0.41689837 160 hunch net-2006-03-02-Why do people count for learning?
20 0.41155323 140 hunch net-2005-12-14-More NIPS Papers II
topicId topicWeight
[(0, 0.032), (3, 0.01), (27, 0.215), (38, 0.065), (53, 0.016), (55, 0.075), (77, 0.073), (89, 0.336), (94, 0.058), (95, 0.023)]
simIndex simValue blogId blogTitle
1 0.95369107 328 hunch net-2008-11-26-Efficient Reinforcement Learning in MDPs
Introduction: Claude Sammut is attempting to put together an Encyclopedia of Machine Learning . I volunteered to write one article on Efficient RL in MDPs , which I would like to invite comment on. Is something critical missing?
2 0.92390049 50 hunch net-2005-04-01-Basic computer science research takes a hit
Introduction: The New York Times has an interesting article about how DARPA has dropped funding for computer science to universities by about a factor of 2 over the last 5 years and become less directed towards basic research. Partially in response, the number of grant submissions to NSF has grown by a factor of 3 (with the NSF budget staying approximately constant in the interim). This is the sort of policy decision which may make sense for the defense department, but which means a large hit for basic research on information technology development in the US. For example “darpa funded the invention of the internet” is reasonably correct. This policy decision is particularly painful in the context of NSF budget cuts and the end of extensive phone monopoly funded research at Bell labs. The good news from a learning perspective is that (based on anecdotal evidence) much of the remaining funding is aimed at learning and learning-related fields. Methods of making good automated predictions obv
same-blog 3 0.87196064 118 hunch net-2005-10-07-On-line learning of regular decision rules
Introduction: Many decision problems can be represented in the form FOR n =1,2,…: — Reality chooses a datum x n . — Decision Maker chooses his decision d n . — Reality chooses an observation y n . — Decision Maker suffers loss L ( y n , d n ). END FOR. The observation y n can be, for example, tomorrow’s stock price and the decision d n the number of shares Decision Maker chooses to buy. The datum x n ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data. Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D ( x ). Can we perform as well, or almost as well, as D , without knowing it? This is essentially a special case of the problem of on-line learning . This is a simple result of this kind. Suppose the data x n are taken from [0,1] and L ( y , d )=| y – d |. A norm || h || of a function h on
4 0.85371387 84 hunch net-2005-06-22-Languages of Learning
Introduction: A language is a set of primitives which can be combined to succesfully create complex objects. Languages arise in all sorts of situations: mechanical construction, martial arts, communication, etc… Languages appear to be the key to succesfully creating complex objects—it is difficult to come up with any convincing example of a complex object which is not built using some language. Since languages are so crucial to success, it is interesting to organize various machine learning research programs by language. The most common language in machine learning are languages for representing the solution to machine learning. This includes: Bayes Nets and Graphical Models A language for representing probability distributions. The key concept supporting modularity is conditional independence. Michael Kearns has been working on extending this to game theory. Kernelized Linear Classifiers A language for representing linear separators, possibly in a large space. The key form of
5 0.82685733 480 hunch net-2013-03-22-I’m a bandit
Introduction: Sebastien Bubeck has a new ML blog focused on optimization and partial feedback which may interest people.
6 0.77927279 299 hunch net-2008-04-27-Watchword: Supervised Learning
7 0.64561343 36 hunch net-2005-03-05-Funding Research
8 0.5831992 388 hunch net-2010-01-24-Specializations of the Master Problem
9 0.58263046 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?
10 0.58214545 375 hunch net-2009-10-26-NIPS workshops
11 0.58203179 259 hunch net-2007-08-19-Choice of Metrics
12 0.57635105 269 hunch net-2007-10-24-Contextual Bandits
13 0.57346267 165 hunch net-2006-03-23-The Approximation Argument
14 0.57318383 235 hunch net-2007-03-03-All Models of Learning have Flaws
15 0.57214451 220 hunch net-2006-11-27-Continuizing Solutions
16 0.57071197 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
17 0.56784153 449 hunch net-2011-11-26-Giving Thanks
18 0.56555456 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
19 0.56548089 14 hunch net-2005-02-07-The State of the Reduction
20 0.56522059 351 hunch net-2009-05-02-Wielding a New Abstraction