hunch_net hunch_net-2005 hunch_net-2005-77 knowledge-graph by maker-knowledge-mining

77 hunch net-2005-05-29-Maximum Margin Mismatch?


meta infos for this blog

Source: html

Introduction: John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003. There are some things I find odd about the paper. For instance, it says of probabilistic models “cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.” I’m aware of no such limitations. Also: “Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.” This is quite interesting and contradicts my own experience as well as that of a number of people I greatly respect . I wonder what the root cause is: perhaps there is something different abo


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 John makes a fascinating point about structured classification (and slightly scooped my post! [sent-1, score-0.194]

2 Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. [sent-3, score-0.194]

3 There are some things I find odd about the paper. [sent-5, score-0.149]

4 For instance, it says of probabilistic models “cannot handle high dimensional feature spaces and lack strong theoretical guarrantees. [sent-6, score-0.186]

5 Also: “Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used. [sent-8, score-0.36]

6 I wonder what the root cause is: perhaps there is something different about the data Ben+Carlos were working with? [sent-10, score-0.32]

7 The elegance of M3N, I think, is unrelated to this probabilistic/margin distinction. [sent-11, score-0.092]

8 M3N provided the first implementation of the margin concept that was computationally efficient for multiple output variables and provided a sample complexity result with a much weaker dependence than previous approaches. [sent-12, score-0.703]

9 Further, the authors carry out some nice experiments that speak well for the practicality of their approach. [sent-13, score-0.192]

10 In particular, M3N’s outperform Conditional Random Fields (CRFs) in terms of per-variable (Hamming) loss. [sent-14, score-0.08]

11 And I think this gets us to the crux of the matter, and ties back to John’s post. [sent-15, score-0.232]

12 CRFs are trained by a MAP approach that is effectively per sequence, while the loss function at run time we care about is per variable. [sent-16, score-0.657]

13 The mismatch the post title refers to is that, at test time, M3N’s are viterbi decoded: a per sequence decoding. [sent-17, score-0.909]

14 Intuitively, viterbi is an algorithm that only gets paid for its services when it classifies an entire sequence correctly. [sent-18, score-0.945]

15 This seems an odd mismatch, and makes one wonder: How well does a per-variable approach like the variable marginal likelihood approach mentioned previously of Roweis,Kakade, and Teh combined with runtime belief propagation compare with the M3N procedure? [sent-19, score-0.599]

16 Does the mismatch matter, and if so, is there a more appropriate decoding procedure like BP, appropriate for margin-trained methods? [sent-20, score-0.731]

17 And finally, it seems we need to answer John’s question convincingly: if you really care about per-variable probabilities or classifications, isn’t it possible that structuring the output space actually hurts? [sent-21, score-0.393]

18 (It seems clear to me that it can help when you insist on getting the entire sequence right, although perhaps others don’t concur with that. [sent-22, score-0.345]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('mismatch', 0.277), ('sequence', 0.225), ('viterbi', 0.185), ('crfs', 0.185), ('trained', 0.174), ('wonder', 0.16), ('john', 0.158), ('odd', 0.149), ('margin', 0.149), ('procedure', 0.145), ('per', 0.139), ('provided', 0.138), ('gets', 0.132), ('entire', 0.12), ('care', 0.116), ('structured', 0.113), ('appropriate', 0.111), ('output', 0.108), ('probabilistic', 0.105), ('matter', 0.105), ('hamming', 0.1), ('bp', 0.1), ('classifies', 0.1), ('practicality', 0.1), ('propagation', 0.1), ('services', 0.1), ('ties', 0.1), ('classifications', 0.092), ('structuring', 0.092), ('carry', 0.092), ('likelihood', 0.092), ('unrelated', 0.092), ('approach', 0.089), ('decoding', 0.087), ('ben', 0.087), ('carlos', 0.087), ('weaker', 0.087), ('convincingly', 0.083), ('root', 0.083), ('concept', 0.083), ('paid', 0.083), ('refers', 0.083), ('classification', 0.081), ('models', 0.081), ('marginal', 0.08), ('alone', 0.08), ('intuitively', 0.08), ('outperform', 0.08), ('probabilities', 0.077), ('cause', 0.077)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 77 hunch net-2005-05-29-Maximum Margin Mismatch?

Introduction: John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003. There are some things I find odd about the paper. For instance, it says of probabilistic models “cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.” I’m aware of no such limitations. Also: “Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.” This is quite interesting and contradicts my own experience as well as that of a number of people I greatly respect . I wonder what the root cause is: perhaps there is something different abo

2 0.17821932 189 hunch net-2006-07-05-more icml papers

Introduction: Here are a few other papers I enjoyed from ICML06. Topic Models: Dynamic Topic Models David Blei, John Lafferty A nice model for how topics in LDA type models can evolve over time, using a linear dynamical system on the natural parameters and a very clever structured variational approximation (in which the mean field parameters are pseudo-observations of a virtual LDS). Like all Blei papers, he makes it look easy, but it is extremely impressive. Pachinko Allocation Wei Li, Andrew McCallum A very elegant (but computationally challenging) model which induces correlation amongst topics using a multi-level DAG whose interior nodes are “super-topics” and “sub-topics” and whose leaves are the vocabulary words. Makes the slumbering monster of structure learning stir. Sequence Analysis (I missed these talks since I was chairing another session) Online Decoding of Markov Models with Latency Constraints Mukund Narasimhan, Paul Viola, Michael Shilman An “a

3 0.12893002 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

Introduction: Suppose you are given a sequence of observations x 1 ,…,x T from some space and wish to predict a sequence of labels y 1 ,…,y T so as to minimize the Hamming loss: sum i=1 to T I(y i != c(x 1 ,…,x T ) i ) where c(x 1 ,…,x T ) i is the i th predicted component. For simplicity, suppose each label y i is in {0,1} . We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component y i independently since the loss is a linear combination of losses on each individual component i . From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te . This breakup into T different prediction problems is not the standard form of modularity in structured prediction. A more typical form of modularity is to predict y i given x i , y i-1 , y i+1 where the circularity (predicting given other

4 0.11750229 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

Introduction: Following up on Hal Daume’s post and John’s post on cool and interesting things seen at NIPS I’ll post my own little list of neat papers here as well. Of course it’s going to be biased towards what I think is interesting. Also, I have to say that I hadn’t been able to see many papers this year at nips due to myself being too busy, so please feel free to contribute the papers that you liked 1. P. Mudigonda, V. Kolmogorov, P. Torr. An Analysis of Convex Relaxations for MAP Estimation. A surprising paper which shows that many of the more sophisticated convex relaxations that had been proposed recently turns out to be subsumed by the simplest LP relaxation. Be careful next time you try a cool new convex relaxation! 2. D. Sontag, T. Jaakkola. New Outer Bounds on the Marginal Polytope. The title says it all. The marginal polytope is the set of local marginal distributions over subsets of variables that are globally consistent in the sense that there is at least one distributio

5 0.099507019 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

6 0.09750393 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

7 0.097319178 101 hunch net-2005-08-08-Apprenticeship Reinforcement Learning for Control

8 0.095214911 448 hunch net-2011-10-24-2011 ML symposium and the bears

9 0.093051754 347 hunch net-2009-03-26-Machine Learning is too easy

10 0.091508403 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

11 0.088817313 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use

12 0.086866356 351 hunch net-2009-05-02-Wielding a New Abstraction

13 0.085947484 177 hunch net-2006-05-05-An ICML reject

14 0.085673168 218 hunch net-2006-11-20-Context and the calculation misperception

15 0.08406768 6 hunch net-2005-01-27-Learning Complete Problems

16 0.083787642 49 hunch net-2005-03-30-What can Type Theory teach us about Machine Learning?

17 0.083239689 139 hunch net-2005-12-11-More NIPS Papers

18 0.082692072 151 hunch net-2006-01-25-1 year

19 0.082270637 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

20 0.0821651 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.217), (1, 0.078), (2, 0.028), (3, -0.028), (4, 0.022), (5, 0.025), (6, -0.023), (7, -0.042), (8, 0.054), (9, -0.072), (10, -0.003), (11, -0.036), (12, -0.087), (13, -0.045), (14, 0.025), (15, -0.022), (16, -0.038), (17, 0.045), (18, 0.077), (19, -0.04), (20, -0.015), (21, 0.039), (22, -0.035), (23, -0.023), (24, 0.087), (25, 0.039), (26, -0.068), (27, 0.018), (28, 0.028), (29, 0.018), (30, -0.055), (31, -0.063), (32, -0.034), (33, -0.062), (34, -0.023), (35, 0.11), (36, 0.013), (37, 0.079), (38, 0.007), (39, -0.056), (40, 0.033), (41, -0.048), (42, 0.041), (43, 0.037), (44, -0.006), (45, -0.046), (46, 0.033), (47, 0.031), (48, -0.061), (49, 0.04)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97572035 77 hunch net-2005-05-29-Maximum Margin Mismatch?

Introduction: John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003. There are some things I find odd about the paper. For instance, it says of probabilistic models “cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.” I’m aware of no such limitations. Also: “Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.” This is quite interesting and contradicts my own experience as well as that of a number of people I greatly respect . I wonder what the root cause is: perhaps there is something different abo

2 0.73842371 189 hunch net-2006-07-05-more icml papers

Introduction: Here are a few other papers I enjoyed from ICML06. Topic Models: Dynamic Topic Models David Blei, John Lafferty A nice model for how topics in LDA type models can evolve over time, using a linear dynamical system on the natural parameters and a very clever structured variational approximation (in which the mean field parameters are pseudo-observations of a virtual LDS). Like all Blei papers, he makes it look easy, but it is extremely impressive. Pachinko Allocation Wei Li, Andrew McCallum A very elegant (but computationally challenging) model which induces correlation amongst topics using a multi-level DAG whose interior nodes are “super-topics” and “sub-topics” and whose leaves are the vocabulary words. Makes the slumbering monster of structure learning stir. Sequence Analysis (I missed these talks since I was chairing another session) Online Decoding of Markov Models with Latency Constraints Mukund Narasimhan, Paul Viola, Michael Shilman An “a

3 0.71716523 140 hunch net-2005-12-14-More NIPS Papers II

Introduction: I thought this was a very good NIPS with many excellent papers. The following are a few NIPS papers which I liked and I hope to study more carefully when I get the chance. The list is not exhaustive and in no particular order… Preconditioner Approximations for Probabilistic Graphical Models. Pradeeep Ravikumar and John Lafferty. I thought the use of preconditioner methods from solving linear systems in the context of approximate inference was novel and interesting. The results look good and I’d like to understand the limitations. Rodeo: Sparse nonparametric regression in high dimensions. John Lafferty and Larry Wasserman. A very interesting approach to feature selection in nonparametric regression from a frequentist framework. The use of lengthscale variables in each dimension reminds me a lot of ‘Automatic Relevance Determination’ in Gaussian process regression — it would be interesting to compare Rodeo to ARD in GPs. Interpolating between types and tokens by estimating

4 0.67934972 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

Introduction: Following up on Hal Daume’s post and John’s post on cool and interesting things seen at NIPS I’ll post my own little list of neat papers here as well. Of course it’s going to be biased towards what I think is interesting. Also, I have to say that I hadn’t been able to see many papers this year at nips due to myself being too busy, so please feel free to contribute the papers that you liked 1. P. Mudigonda, V. Kolmogorov, P. Torr. An Analysis of Convex Relaxations for MAP Estimation. A surprising paper which shows that many of the more sophisticated convex relaxations that had been proposed recently turns out to be subsumed by the simplest LP relaxation. Be careful next time you try a cool new convex relaxation! 2. D. Sontag, T. Jaakkola. New Outer Bounds on the Marginal Polytope. The title says it all. The marginal polytope is the set of local marginal distributions over subsets of variables that are globally consistent in the sense that there is at least one distributio

5 0.66872758 185 hunch net-2006-06-16-Regularization = Robustness

Introduction: The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math, argmax_p H(p) s.t. E_p[f_i] = c_i is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints). A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here . In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the expone

6 0.66862816 139 hunch net-2005-12-11-More NIPS Papers

7 0.63350314 144 hunch net-2005-12-28-Yet more nips thoughts

8 0.62354356 101 hunch net-2005-08-08-Apprenticeship Reinforcement Learning for Control

9 0.61376524 192 hunch net-2006-07-08-Some recent papers

10 0.57510293 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

11 0.57503337 440 hunch net-2011-08-06-Interesting thing at UAI 2011

12 0.54090625 330 hunch net-2008-12-07-A NIPS paper

13 0.53972137 97 hunch net-2005-07-23-Interesting papers at ACL

14 0.53438711 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

15 0.52429914 188 hunch net-2006-06-30-ICML papers

16 0.50629938 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize

17 0.50190002 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use

18 0.4999339 444 hunch net-2011-09-07-KDD and MUCMD 2011

19 0.48357964 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

20 0.48113209 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.021), (3, 0.041), (27, 0.196), (30, 0.013), (38, 0.049), (43, 0.272), (53, 0.088), (55, 0.149), (67, 0.014), (94, 0.054), (95, 0.029)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.89555246 241 hunch net-2007-04-28-The Coming Patent Apocalypse

Introduction: Many people in computer science believe that patents are problematic. The truth is even worse—the patent system in the US is fundamentally broken in ways that will require much more significant reform than is being considered now . The myth of the patent is the following: Patents are a mechanism for inventors to be compensated according to the value of their inventions while making the invention available to all. This myth sounds pretty desirable, but the reality is a strange distortion slowly leading towards collapse. There are many problems associated with patents, but I would like to focus on just two of them: Patent Trolls The way that patents have generally worked over the last several decades is that they were a tool of large companies. Large companies would amass a large number of patents and then cross-license each other’s patents—in effect saying “we agree to owe each other nothing”. Smaller companies would sometimes lose in this game, essentially because they

same-blog 2 0.89190292 77 hunch net-2005-05-29-Maximum Margin Mismatch?

Introduction: John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003. There are some things I find odd about the paper. For instance, it says of probabilistic models “cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.” I’m aware of no such limitations. Also: “Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.” This is quite interesting and contradicts my own experience as well as that of a number of people I greatly respect . I wonder what the root cause is: perhaps there is something different abo

3 0.6946556 437 hunch net-2011-07-10-ICML 2011 and the future

Introduction: Unfortunately, I ended up sick for much of this ICML. I did manage to catch one interesting paper: Richard Socher , Cliff Lin , Andrew Y. Ng , and Christopher D. Manning Parsing Natural Scenes and Natural Language with Recursive Neural Networks . I invited Richard to share his list of interesting papers, so hopefully we’ll hear from him soon. In the meantime, Paul and Hal have posted some lists. the future Joelle and I are program chairs for ICML 2012 in Edinburgh , which I previously enjoyed visiting in 2005 . This is a huge responsibility, that we hope to accomplish well. A part of this (perhaps the most fun part), is imagining how we can make ICML better. A key and critical constraint is choosing things that can be accomplished. So far we have: Colocation . The first thing we looked into was potential colocations. We quickly discovered that many other conferences precomitted their location. For the future, getting a colocation with ACL or SIGI

4 0.691742 484 hunch net-2013-06-16-Representative Reviewing

Introduction: When thinking about how best to review papers, it seems helpful to have some conception of what good reviewing is. As far as I can tell, this is almost always only discussed in the specific context of a paper (i.e. your rejected paper), or at most an area (i.e. what a “good paper” looks like for that area) rather than general principles. Neither individual papers or areas are sufficiently general for a large conference—every paper differs in the details, and what if you want to build a new area and/or cross areas? An unavoidable reason for reviewing is that the community of research is too large. In particular, it is not possible for a researcher to read every paper which someone thinks might be of interest. This reason for reviewing exists independent of constraints on rooms or scheduling formats of individual conferences. Indeed, history suggests that physical constraints are relatively meaningless over the long term — growing conferences simply use more rooms and/or change fo

5 0.68780345 452 hunch net-2012-01-04-Why ICML? and the summer conferences

Introduction: Here’s a quick reference for summer ML-related conferences sorted by due date: Conference Due date Location Reviewing KDD Feb 10 August 12-16, Beijing, China Single Blind COLT Feb 14 June 25-June 27, Edinburgh, Scotland Single Blind? (historically) ICML Feb 24 June 26-July 1, Edinburgh, Scotland Double Blind, author response, zero SPOF UAI March 30 August 15-17, Catalina Islands, California Double Blind, author response Geographically, this is greatly dispersed and the UAI/KDD conflict is unfortunate. Machine Learning conferences are triannual now, between NIPS , AIStat , and ICML . This has not always been the case: the academic default is annual summer conferences, then NIPS started with a December conference, and now AIStat has grown into an April conference. However, the first claim is not quite correct. NIPS and AIStat have few competing venues while ICML implicitly competes with many other conf

6 0.68672436 116 hunch net-2005-09-30-Research in conferences

7 0.68623883 40 hunch net-2005-03-13-Avoiding Bad Reviewing

8 0.68087918 403 hunch net-2010-07-18-ICML & COLT 2010

9 0.67578882 466 hunch net-2012-06-05-ICML acceptance statistics

10 0.67573875 225 hunch net-2007-01-02-Retrospective

11 0.6755842 204 hunch net-2006-08-28-Learning Theory standards for NIPS 2006

12 0.6721428 134 hunch net-2005-12-01-The Webscience Future

13 0.67110837 343 hunch net-2009-02-18-Decision by Vetocracy

14 0.67056572 454 hunch net-2012-01-30-ICML Posters and Scope

15 0.6702888 89 hunch net-2005-07-04-The Health of COLT

16 0.66906536 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?

17 0.66864979 44 hunch net-2005-03-21-Research Styles in Machine Learning

18 0.66841638 297 hunch net-2008-04-22-Taking the next step

19 0.66727227 461 hunch net-2012-04-09-ICML author feedback is open

20 0.6663692 256 hunch net-2007-07-20-Motivation should be the Responsibility of the Reviewer