hunch_net hunch_net-2005 hunch_net-2005-74 knowledge-graph by maker-knowledge-mining

74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?


meta infos for this blog

Source: html

Introduction: Suppose you are given a sequence of observations x 1 ,…,x T from some space and wish to predict a sequence of labels y 1 ,…,y T so as to minimize the Hamming loss: sum i=1 to T I(y i != c(x 1 ,…,x T ) i ) where c(x 1 ,…,x T ) i is the i th predicted component. For simplicity, suppose each label y i is in {0,1} . We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component y i independently since the loss is a linear combination of losses on each individual component i . From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te . This breakup into T different prediction problems is not the standard form of modularity in structured prediction. A more typical form of modularity is to predict y i given x i , y i-1 , y i+1 where the circularity (predicting given other


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Suppose you are given a sequence of observations x 1 ,…,x T from some space and wish to predict a sequence of labels y 1 ,…,y T so as to minimize the Hamming loss: sum i=1 to T I(y i ! [sent-1, score-0.69]

2 = c(x 1 ,…,x T ) i ) where c(x 1 ,…,x T ) i is the i th predicted component. [sent-2, score-0.177]

3 For simplicity, suppose each label y i is in {0,1} . [sent-3, score-0.123]

4 We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component y i independently since the loss is a linear combination of losses on each individual component i . [sent-4, score-1.845]

5 From a learning reductions viewpoint, we can learn a different classifier for each individual component. [sent-5, score-0.25]

6 An average error rate of e over these classifiers implies an expected Hamming loss of Te . [sent-6, score-0.601]

7 This breakup into T different prediction problems is not the standard form of modularity in structured prediction. [sent-7, score-0.743]

8 A more typical form of modularity is to predict y i given x i , y i-1 , y i+1 where the circularity (predicting given other predictions) is handled in various ways. [sent-8, score-1.001]

9 This is often represented with a graphical model like so: This form of modularity seems to be preferred for several reasons: Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance. [sent-9, score-0.718]

10 There may be computational advantages to learning to predict from fewer features. [sent-10, score-0.492]

11 (But note that handling the circularity is sometimes computationally difficult. [sent-11, score-0.331]

12 ) There may be sample complexity advantages to learning to predict from fewer features. [sent-12, score-0.492]

13 In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT 2 ) . [sent-15, score-1.142]

14 Matti Kaariainen convinced me this is not improvable for predictors of this form. [sent-16, score-0.206]

15 One is driven by the loss function while the other driven by simplicity of prediction descriptions. [sent-18, score-0.843]

16 Each has advantages and disadvantages from a practical viewpoint. [sent-19, score-0.289]

17 Is there a compelling algorithm for solving structured prediction which incorporated both intuitions? [sent-21, score-0.366]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('hamming', 0.423), ('loss', 0.253), ('circularity', 0.238), ('modularity', 0.238), ('advantages', 0.201), ('component', 0.185), ('individual', 0.168), ('simplicity', 0.164), ('driven', 0.164), ('predict', 0.157), ('fewer', 0.134), ('graphical', 0.129), ('suppose', 0.123), ('error', 0.122), ('rate', 0.12), ('sequence', 0.12), ('structured', 0.12), ('predictors', 0.118), ('average', 0.106), ('breakup', 0.106), ('accumulate', 0.106), ('reconciled', 0.106), ('th', 0.106), ('predicting', 0.105), ('form', 0.099), ('prediction', 0.098), ('matti', 0.098), ('et', 0.093), ('handled', 0.093), ('handling', 0.093), ('expressing', 0.088), ('incorporated', 0.088), ('convinced', 0.088), ('independently', 0.088), ('disadvantages', 0.088), ('given', 0.088), ('represented', 0.085), ('different', 0.082), ('preferred', 0.079), ('intuitions', 0.077), ('wish', 0.075), ('errors', 0.073), ('losses', 0.071), ('predicted', 0.071), ('minimize', 0.066), ('forms', 0.066), ('sum', 0.064), ('optimizing', 0.064), ('optimize', 0.063), ('compelling', 0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

Introduction: Suppose you are given a sequence of observations x 1 ,…,x T from some space and wish to predict a sequence of labels y 1 ,…,y T so as to minimize the Hamming loss: sum i=1 to T I(y i != c(x 1 ,…,x T ) i ) where c(x 1 ,…,x T ) i is the i th predicted component. For simplicity, suppose each label y i is in {0,1} . We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component y i independently since the loss is a linear combination of losses on each individual component i . From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te . This breakup into T different prediction problems is not the standard form of modularity in structured prediction. A more typical form of modularity is to predict y i given x i , y i-1 , y i+1 where the circularity (predicting given other

2 0.23505117 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

3 0.22387797 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

4 0.19429749 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

5 0.19392745 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

6 0.1545051 259 hunch net-2007-08-19-Choice of Metrics

7 0.15081684 274 hunch net-2007-11-28-Computational Consequences of Classification

8 0.1488315 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

9 0.13519831 14 hunch net-2005-02-07-The State of the Reduction

10 0.13089095 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

11 0.12893002 77 hunch net-2005-05-29-Maximum Margin Mismatch?

12 0.12228636 235 hunch net-2007-03-03-All Models of Learning have Flaws

13 0.11420114 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

14 0.11045328 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?

15 0.11034749 67 hunch net-2005-05-06-Don’t mix the solution into the problem

16 0.10735343 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

17 0.10528733 129 hunch net-2005-11-07-Prediction Competitions

18 0.10363545 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

19 0.10360167 218 hunch net-2006-11-20-Context and the calculation misperception

20 0.10326981 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.197), (1, 0.201), (2, 0.119), (3, -0.13), (4, -0.182), (5, 0.043), (6, -0.017), (7, 0.029), (8, 0.061), (9, -0.037), (10, -0.015), (11, -0.078), (12, -0.019), (13, 0.003), (14, 0.007), (15, -0.02), (16, 0.001), (17, 0.015), (18, -0.028), (19, -0.031), (20, 0.06), (21, -0.043), (22, -0.023), (23, 0.027), (24, 0.007), (25, -0.028), (26, 0.02), (27, 0.033), (28, -0.026), (29, -0.009), (30, -0.01), (31, 0.042), (32, 0.063), (33, -0.025), (34, -0.032), (35, 0.059), (36, 0.064), (37, -0.005), (38, -0.045), (39, -0.057), (40, -0.008), (41, -0.063), (42, 0.038), (43, -0.006), (44, 0.003), (45, -0.039), (46, 0.036), (47, 0.014), (48, -0.002), (49, -0.015)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97976476 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

Introduction: Suppose you are given a sequence of observations x 1 ,…,x T from some space and wish to predict a sequence of labels y 1 ,…,y T so as to minimize the Hamming loss: sum i=1 to T I(y i != c(x 1 ,…,x T ) i ) where c(x 1 ,…,x T ) i is the i th predicted component. For simplicity, suppose each label y i is in {0,1} . We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component y i independently since the loss is a linear combination of losses on each individual component i . From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te . This breakup into T different prediction problems is not the standard form of modularity in structured prediction. A more typical form of modularity is to predict y i given x i , y i-1 , y i+1 where the circularity (predicting given other

2 0.82315767 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

3 0.80678016 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

4 0.79394668 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

5 0.7931906 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

6 0.79217267 259 hunch net-2007-08-19-Choice of Metrics

7 0.76182014 245 hunch net-2007-05-12-Loss Function Semantics

8 0.65697628 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

9 0.65245026 67 hunch net-2005-05-06-Don’t mix the solution into the problem

10 0.60856885 103 hunch net-2005-08-18-SVM Adaptability

11 0.60013902 374 hunch net-2009-10-10-ALT 2009

12 0.59496742 129 hunch net-2005-11-07-Prediction Competitions

13 0.56346995 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

14 0.55918819 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

15 0.51314342 235 hunch net-2007-03-03-All Models of Learning have Flaws

16 0.51158911 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

17 0.50291514 218 hunch net-2006-11-20-Context and the calculation misperception

18 0.50045127 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

19 0.49871835 133 hunch net-2005-11-28-A question of quantification

20 0.49806404 77 hunch net-2005-05-29-Maximum Margin Mismatch?


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.416), (3, 0.032), (27, 0.29), (53, 0.09), (55, 0.056), (94, 0.014)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.90255308 473 hunch net-2012-09-29-Vowpal Wabbit, version 7.0

Introduction: A new version of VW is out . The primary changes are: Learning Reductions : I’ve wanted to get learning reductions working and we’ve finally done it. Not everything is implemented yet, but VW now supports direct: Multiclass Classification –oaa or –ect . Cost Sensitive Multiclass Classification –csoaa or –wap . Contextual Bandit Classification –cb . Sequential Structured Prediction –searn or –dagger In addition, it is now easy to build your own custom learning reductions for various plausible uses: feature diddling, custom structured prediction problems, or alternate learning reductions. This effort is far from done, but it is now in a generally useful state. Note that all learning reductions inherit the ability to do cluster parallel learning. Library interface : VW now has a basic library interface. The library provides most of the functionality of VW, with the limitation that it is monolithic and nonreentrant. These will be improved over

same-blog 2 0.87821001 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

Introduction: Suppose you are given a sequence of observations x 1 ,…,x T from some space and wish to predict a sequence of labels y 1 ,…,y T so as to minimize the Hamming loss: sum i=1 to T I(y i != c(x 1 ,…,x T ) i ) where c(x 1 ,…,x T ) i is the i th predicted component. For simplicity, suppose each label y i is in {0,1} . We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component y i independently since the loss is a linear combination of losses on each individual component i . From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te . This breakup into T different prediction problems is not the standard form of modularity in structured prediction. A more typical form of modularity is to predict y i given x i , y i-1 , y i+1 where the circularity (predicting given other

3 0.86962169 62 hunch net-2005-04-26-To calibrate or not?

Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b

4 0.85759974 215 hunch net-2006-10-22-Exemplar programming

Introduction: There are many different abstractions for problem definition and solution. Here are a few examples: Functional programming: a set of functions are defined. The composed execution of these functions yields the solution. Linear programming: a set of constraints and a linear objective function are defined. An LP solver finds the constrained optimum. Quadratic programming: Like linear programming, but the language is a little more flexible (and the solution slower). Convex programming: like quadratic programming, but the language is more flexible (and the solutions even slower). Dynamic programming: a recursive definition of the problem is defined and then solved efficiently via caching tricks. SAT programming: A problem is specified as a satisfiability involving a conjunction of a disjunction of boolean variables. A general engine attempts to find a good satisfying assignment. For example Kautz’s blackbox planner. These abstractions have different tradeoffs betw

5 0.85685921 87 hunch net-2005-06-29-Not EM for clustering at COLT

Introduction: One standard approach for clustering data with a set of gaussians is using EM. Roughly speaking, you pick a set of k random guassians and then use alternating expectation maximization to (hopefully) find a set of guassians that “explain” the data well. This process is difficult to work with because EM can become “stuck” in local optima. There are various hacks like “rerun with t different random starting points”. One cool observation is that this can often be solved via other algorithm which do not suffer from local optima. This is an early paper which shows this. Ravi Kannan presented a new paper showing this is possible in a much more adaptive setting. A very rough summary of these papers is that by projecting into a lower dimensional space, it is computationally tractable to pick out the gross structure of the data. It is unclear how well these algorithms work in practice, but they might be effective, especially if used as a subroutine of the form: Projec

6 0.82577646 193 hunch net-2006-07-09-The Stock Prediction Machine Learning Problem

7 0.73689079 133 hunch net-2005-11-28-A question of quantification

8 0.61558849 492 hunch net-2013-12-01-NIPS tutorials and Vowpal Wabbit 7.4

9 0.61282223 245 hunch net-2007-05-12-Loss Function Semantics

10 0.60781944 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

11 0.6039241 9 hunch net-2005-02-01-Watchword: Loss

12 0.60235715 220 hunch net-2006-11-27-Continuizing Solutions

13 0.59795493 14 hunch net-2005-02-07-The State of the Reduction

14 0.59771168 28 hunch net-2005-02-25-Problem: Online Learning

15 0.59604853 5 hunch net-2005-01-26-Watchword: Probability

16 0.5944131 258 hunch net-2007-08-12-Exponentiated Gradient

17 0.5934003 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

18 0.59136724 378 hunch net-2009-11-15-The Other Online Learning

19 0.59050596 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

20 0.58617675 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”