hunch_net hunch_net-2010 hunch_net-2010-420 knowledge-graph by maker-knowledge-mining

420 hunch net-2010-12-26-NIPS 2010


meta infos for this blog

Source: html

Introduction: I enjoyed attending NIPS this year, with several things interesting me. For the conference itself: Peter Welinder , Steve Branson , Serge Belongie , and Pietro Perona , The Multidimensional Wisdom of Crowds . This paper is about using mechanical turk to get label information, with results superior to a majority vote approach. David McAllester , Tamir Hazan , and Joseph Keshet Direct Loss Minimization for Structured Prediction . This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition. Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades . This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested. Alan Fern and Prasad Tadepalli , A Computational Decision Theory for Interactive Assistants . This paper carves out some


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition. [sent-5, score-0.293]

2 This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. [sent-7, score-0.282]

3 There were several other papers on cascades which are worth looking at if interested. [sent-8, score-0.166]

4 It’s good to see people moving beyond MDPs, which at this point are both well understood and limited. [sent-11, score-0.128]

5 This paper is about a natural and relatively unexplored, and potentially dominating approach for achieving differential privacy and learning. [sent-13, score-0.335]

6 The CtF workshop could have been named “Integrating breadth first search and learning”. [sent-17, score-0.186]

7 I was somewhat (I hope not too) pesky, discussing Searn repeatedly during questions, since it seems quite plausible that a good application of Searn would compete with and plausibly improve on results from several of the talks. [sent-18, score-0.372]

8 Eventually, I hope the conventional wisdom shifts to a belief that search and learning must be integrated for efficiency and robustness reasons. [sent-19, score-0.412]

9 The level of agreement in approaches at the LCCC workshop was much lower, with people discussing many radically different approaches. [sent-22, score-0.33]

10 Should data be organized by feature partition or example partition? [sent-23, score-0.208]

11 Fernando points out that features often scale sublinearly in the number of examples, implying that an example partition addresses scale better. [sent-24, score-0.816]

12 However, basic learning theory tells us that if the number of parameters scales sublinearly in the number of examples, then the value of additional samples asymptotes, implying a mismatched solution design. [sent-25, score-0.275]

13 My experience is that a ‘not enough features’ problem can be dealt with by throwing all the missing features you couldn’t properly previously use, for example personalization . [sent-26, score-0.188]

14 How can we best leverage existing robust distributed filesystem/MapReduce frameworks? [sent-27, score-0.138]

15 There was near unanimity on the belief that MapReduce itself is of limited value for machine learning, but the step forward is unclear. [sent-28, score-0.17]

16 I liked what Markus said: that no one wants to abandon the ideas of robustly storing data and moving small amounts of code to large amounts of data. [sent-29, score-0.435]

17 The best way to leverage this capability to build great algorithms remains unclear to me. [sent-30, score-0.138]

18 Every speaker was in agreement that their approach was faster, but there was great disagreement about what “fast” meant in an absolute sense. [sent-31, score-0.343]

19 This forced me to think about an absolute measure of (input complexity)/(time) where we see results between 100 features/s and 10*10 6 features/s being considered “fast” depending on who is speaking. [sent-32, score-0.23]

20 I hope we’ll discover convincing answers to these questions in the near future. [sent-35, score-0.172]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('partition', 0.208), ('sublinearly', 0.187), ('cascades', 0.166), ('lccc', 0.154), ('wisdom', 0.145), ('absolute', 0.145), ('differential', 0.145), ('leverage', 0.138), ('moving', 0.128), ('agreement', 0.121), ('searn', 0.118), ('fast', 0.115), ('amounts', 0.115), ('scale', 0.114), ('discussing', 0.112), ('privacy', 0.107), ('features', 0.105), ('optimizing', 0.1), ('loss', 0.099), ('workshop', 0.097), ('structured', 0.094), ('hope', 0.09), ('search', 0.089), ('belief', 0.088), ('implying', 0.088), ('plausible', 0.085), ('results', 0.085), ('belongie', 0.083), ('branson', 0.083), ('serge', 0.083), ('mdps', 0.083), ('frank', 0.083), ('mcsherry', 0.083), ('personalization', 0.083), ('pesky', 0.083), ('multidimensional', 0.083), ('oliver', 0.083), ('disparity', 0.083), ('dominating', 0.083), ('cascade', 0.083), ('near', 0.082), ('fern', 0.077), ('refine', 0.077), ('unexplored', 0.077), ('couldn', 0.077), ('frameworks', 0.077), ('speaker', 0.077), ('joseph', 0.077), ('robustly', 0.077), ('markus', 0.077)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 420 hunch net-2010-12-26-NIPS 2010

Introduction: I enjoyed attending NIPS this year, with several things interesting me. For the conference itself: Peter Welinder , Steve Branson , Serge Belongie , and Pietro Perona , The Multidimensional Wisdom of Crowds . This paper is about using mechanical turk to get label information, with results superior to a majority vote approach. David McAllester , Tamir Hazan , and Joseph Keshet Direct Loss Minimization for Structured Prediction . This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition. Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades . This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested. Alan Fern and Prasad Tadepalli , A Computational Decision Theory for Interactive Assistants . This paper carves out some

2 0.23787352 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

Introduction: At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered. There are many different kinds of scaling. Scaling in examples This is the most basic kind of scaling. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss , logistic loss , Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonath

3 0.15458025 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of

4 0.14760044 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

5 0.1350348 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

6 0.13161792 406 hunch net-2010-08-22-KDD 2010

7 0.13054655 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design

8 0.12830842 235 hunch net-2007-03-03-All Models of Learning have Flaws

9 0.12363747 345 hunch net-2009-03-08-Prediction Science

10 0.12288994 177 hunch net-2006-05-05-An ICML reject

11 0.12240991 279 hunch net-2007-12-19-Cool and interesting things seen at NIPS

12 0.12109784 245 hunch net-2007-05-12-Loss Function Semantics

13 0.12057643 332 hunch net-2008-12-23-Use of Learning Theory

14 0.11442544 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

15 0.11235071 42 hunch net-2005-03-17-Going all the Way, Sometimes

16 0.11221862 454 hunch net-2012-01-30-ICML Posters and Scope

17 0.11081999 343 hunch net-2009-02-18-Decision by Vetocracy

18 0.11034687 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

19 0.10855101 390 hunch net-2010-03-12-Netflix Challenge 2 Canceled

20 0.10523346 444 hunch net-2011-09-07-KDD and MUCMD 2011


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.284), (1, 0.069), (2, -0.053), (3, -0.085), (4, 0.02), (5, 0.125), (6, -0.095), (7, -0.009), (8, 0.021), (9, 0.03), (10, -0.015), (11, -0.007), (12, -0.068), (13, 0.037), (14, -0.09), (15, 0.014), (16, -0.02), (17, 0.075), (18, 0.021), (19, -0.057), (20, 0.029), (21, -0.021), (22, -0.022), (23, 0.016), (24, 0.017), (25, 0.007), (26, 0.034), (27, 0.068), (28, 0.014), (29, -0.052), (30, -0.123), (31, -0.018), (32, -0.054), (33, 0.007), (34, -0.068), (35, -0.081), (36, 0.012), (37, 0.03), (38, 0.075), (39, 0.073), (40, 0.089), (41, -0.04), (42, -0.065), (43, -0.079), (44, 0.091), (45, 0.003), (46, -0.032), (47, -0.035), (48, -0.061), (49, -0.038)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96256113 420 hunch net-2010-12-26-NIPS 2010

Introduction: I enjoyed attending NIPS this year, with several things interesting me. For the conference itself: Peter Welinder , Steve Branson , Serge Belongie , and Pietro Perona , The Multidimensional Wisdom of Crowds . This paper is about using mechanical turk to get label information, with results superior to a majority vote approach. David McAllester , Tamir Hazan , and Joseph Keshet Direct Loss Minimization for Structured Prediction . This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition. Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades . This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested. Alan Fern and Prasad Tadepalli , A Computational Decision Theory for Interactive Assistants . This paper carves out some

2 0.80308974 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

Introduction: At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered. There are many different kinds of scaling. Scaling in examples This is the most basic kind of scaling. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss , logistic loss , Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonath

3 0.6430496 406 hunch net-2010-08-22-KDD 2010

Introduction: There were several papers that seemed fairly interesting at KDD this year . The ones that caught my attention are: Xin Jin , Mingyang Zhang, Nan Zhang , and Gautam Das , Versatile Publishing For Privacy Preservation . This paper provides a conservative method for safely determining which data is publishable from any complete source of information (for example, a hospital) such that it does not violate privacy rules in a natural language. It is not differentially private, so no external sources of join information can exist. However, it is a mechanism for publishing data rather than (say) the output of a learning algorithm. Arik Friedman Assaf Schuster , Data Mining with Differential Privacy . This paper shows how to create effective differentially private decision trees. Progress in differentially private datamining is pretty impressive, as it was defined in 2006 . David Chan, Rong Ge, Ori Gershony, Tim Hesterberg , Diane Lambert , Evaluating Online Ad Camp

4 0.60861796 310 hunch net-2008-07-15-Interesting papers at COLT (and a bit of UAI & workshops)

Introduction: Here are a few papers from COLT 2008 that I found interesting. Maria-Florina Balcan , Steve Hanneke , and Jenn Wortman , The True Sample Complexity of Active Learning . This paper shows that in an asymptotic setting, active learning is always better than supervised learning (although the gap may be small). This is evidence that the only thing in the way of universal active learning is us knowing how to do it properly. Nir Ailon and Mehryar Mohri , An Efficient Reduction of Ranking to Classification . This paper shows how to robustly rank n objects with n log(n) classifications using a quicksort based algorithm. The result is applicable to many ranking loss functions and has implications for others. Michael Kearns and Jennifer Wortman . Learning from Collective Behavior . This is about learning in a new model, where the goal is to predict how a collection of interacting agents behave. One claim is that learning in this setting can be reduced to IID lear

5 0.60861427 138 hunch net-2005-12-09-Some NIPS papers

Introduction: Here is a set of papers that I found interesting (and why). A PAC-Bayes approach to the Set Covering Machine improves the set covering machine. The set covering machine approach is a new way to do classification characterized by a very close connection between theory and algorithm. At this point, the approach seems to be competing well with SVMs in about all dimensions: similar computational speed, similar accuracy, stronger learning theory guarantees, more general information source (a kernel has strictly more structure than a metric), and more sparsity. Developing a classification algorithm is not very easy, but the results so far are encouraging. Off-Road Obstacle Avoidance through End-to-End Learning and Learning Depth from Single Monocular Images both effectively showed that depth information can be predicted from camera images (using notably different techniques). This ability is strongly enabling because cameras are cheap, tiny, light, and potentially provider lo

6 0.59975928 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design

7 0.59637731 444 hunch net-2011-09-07-KDD and MUCMD 2011

8 0.58411133 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

9 0.57215255 345 hunch net-2009-03-08-Prediction Science

10 0.57133776 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

11 0.56684381 200 hunch net-2006-08-03-AOL’s data drop

12 0.56627703 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

13 0.56427741 348 hunch net-2009-04-02-Asymmophobia

14 0.56241202 364 hunch net-2009-07-11-Interesting papers at KDD

15 0.55993271 177 hunch net-2006-05-05-An ICML reject

16 0.5564962 260 hunch net-2007-08-25-The Privacy Problem

17 0.55211973 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

18 0.54908472 256 hunch net-2007-07-20-Motivation should be the Responsibility of the Reviewer

19 0.54720289 279 hunch net-2007-12-19-Cool and interesting things seen at NIPS

20 0.54699063 423 hunch net-2011-02-02-User preferences for search engines


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(10, 0.027), (27, 0.18), (30, 0.024), (38, 0.043), (49, 0.019), (53, 0.06), (55, 0.083), (64, 0.323), (67, 0.022), (94, 0.097), (95, 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.94785869 155 hunch net-2006-02-07-Pittsburgh Mind Reading Competition

Introduction: Francisco Pereira points out a fun Prediction Competition . Francisco says: DARPA is sponsoring a competition to analyze data from an unusual functional Magnetic Resonance Imaging experiment. Subjects watch videos inside the scanner while fMRI data are acquired. Unbeknownst to these subjects, the videos have been seen by a panel of other subjects that labeled each instant with labels in categories such as representation (are there tools, body parts, motion, sound), location, presence of actors, emotional content, etc. The challenge is to predict all of these different labels on an instant-by-instant basis from the fMRI data. A few reasons why this is particularly interesting: This is beyond the current state of the art, but not inconceivably hard. This is a new type of experiment design current analysis methods cannot deal with. This is an opportunity to work with a heavily examined and preprocessed neuroimaging dataset. DARPA is offering prizes!

2 0.94734961 442 hunch net-2011-08-20-The Large Scale Learning Survey Tutorial

Introduction: Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially. This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD , tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters. This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth (Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no

3 0.87610584 210 hunch net-2006-09-28-Programming Languages for Machine Learning Implementations

Introduction: Machine learning algorithms have a much better chance of being widely adopted if they are implemented in some easy-to-use code. There are several important concerns associated with machine learning which stress programming languages on the ease-of-use vs. speed frontier. Speed The rate at which data sources are growing seems to be outstripping the rate at which computational power is growing, so it is important that we be able to eak out every bit of computational power. Garbage collected languages ( java , ocaml , perl and python ) often have several issues here. Garbage collection often implies that floating point numbers are “boxed”: every float is represented by a pointer to a float. Boxing can cause an order of magnitude slowdown because an extra nonlocalized memory reference is made, and accesses to main memory can are many CPU cycles long. Garbage collection often implies that considerably more memory is used than is necessary. This has a variable effect. I

same-blog 4 0.8741836 420 hunch net-2010-12-26-NIPS 2010

Introduction: I enjoyed attending NIPS this year, with several things interesting me. For the conference itself: Peter Welinder , Steve Branson , Serge Belongie , and Pietro Perona , The Multidimensional Wisdom of Crowds . This paper is about using mechanical turk to get label information, with results superior to a majority vote approach. David McAllester , Tamir Hazan , and Joseph Keshet Direct Loss Minimization for Structured Prediction . This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition. Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades . This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested. Alan Fern and Prasad Tadepalli , A Computational Decision Theory for Interactive Assistants . This paper carves out some

5 0.85139722 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design

Introduction: This is a summary of the workshop on Learning Problem Design which Alina and I ran at NIPS this year. The first question many people have is “What is learning problem design?” This workshop is about admitting that solving learning problems does not start with labeled data, but rather somewhere before. When humans are hired to produce labels, this is usually not a serious problem because you can tell them precisely what semantics you want the labels to have, and we can fix some set of features in advance. However, when other methods are used this becomes more problematic. This focus is important for Machine Learning because there are very large quantities of data which are not labeled by a hired human. The title of the workshop was a bit ambitious, because a workshop is not long enough to synthesize a diversity of approaches into a coherent set of principles. For me, the posters at the end of the workshop were quite helpful in getting approaches to gel. Here are some an

6 0.84116256 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

7 0.79938304 291 hunch net-2008-03-07-Spock Challenge Winners

8 0.65311885 343 hunch net-2009-02-18-Decision by Vetocracy

9 0.61803287 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

10 0.61370975 49 hunch net-2005-03-30-What can Type Theory teach us about Machine Learning?

11 0.61180341 424 hunch net-2011-02-17-What does Watson mean?

12 0.60954857 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

13 0.60794795 351 hunch net-2009-05-02-Wielding a New Abstraction

14 0.60185021 256 hunch net-2007-07-20-Motivation should be the Responsibility of the Reviewer

15 0.60154229 360 hunch net-2009-06-15-In Active Learning, the question changes

16 0.59948534 301 hunch net-2008-05-23-Three levels of addressing the Netflix Prize

17 0.59648132 423 hunch net-2011-02-02-User preferences for search engines

18 0.59581321 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

19 0.59519285 371 hunch net-2009-09-21-Netflix finishes (and starts)

20 0.59417725 141 hunch net-2005-12-17-Workshops as Franchise Conferences