hunch_net hunch_net-2006 hunch_net-2006-164 knowledge-graph by maker-knowledge-mining

164 hunch net-2006-03-17-Multitask learning is Black-Boxable

meta infos for this blog

Source: html

Introduction: Multitask learning is the problem of jointly predicting multiple labels simultaneously with one system. A basic question is whether or not multitask learning can be decomposed into one (or more) single prediction problems . It seems the answer to this is “yes”, in a fairly straightforward manner. The basic idea is that a controlled input feature is equivalent to an extra output. Suppose we have some process generating examples: (x,y 1 ,y 2 ) in S where y 1 and y 2 are labels for two different tasks. Then, we could reprocess the data to the form S b (S) = {((x,i),y i ): (x,y 1 ,y 2 ) in S, i in {1,2}} and then learn a classifier c:X x {1,2} -> Y . Note that (x,i) is the (composite) input. At testing time, given an input x , we can query c for the predicted values of y 1 and y 2 using (x,1) and (x,2) . A strong form of equivalence can be stated between these tasks. In particular, suppose we have a multitask learning algorithm ML which learns a multitask

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Multitask learning is the problem of jointly predicting multiple labels simultaneously with one system. [sent-1, score-0.424]

2 A basic question is whether or not multitask learning can be decomposed into one (or more) single prediction problems . [sent-2, score-0.996]

3 It seems the answer to this is “yes”, in a fairly straightforward manner. [sent-3, score-0.259]

4 The basic idea is that a controlled input feature is equivalent to an extra output. [sent-4, score-0.696]

5 Suppose we have some process generating examples: (x,y 1 ,y 2 ) in S where y 1 and y 2 are labels for two different tasks. [sent-5, score-0.241]

6 Then, we could reprocess the data to the form S b (S) = {((x,i),y i ): (x,y 1 ,y 2 ) in S, i in {1,2}} and then learn a classifier c:X x {1,2} -> Y . [sent-6, score-0.084]

7 At testing time, given an input x , we can query c for the predicted values of y 1 and y 2 using (x,1) and (x,2) . [sent-8, score-0.49]

8 A strong form of equivalence can be stated between these tasks. [sent-9, score-0.282]

9 In particular, suppose we have a multitask learning algorithm ML which learns a multitask predictor m:X -> Y x Y . [sent-10, score-1.485]

10 Then the following theorem can be proved: For all ML for all S , there exists an inverse reduction S m such that ML(S) = ML(S m (S b (S)) . [sent-11, score-0.118]

11 In other words, no information is lost in the transformation S b which means everything which was learnable previously remains learnable. [sent-12, score-0.547]

12 This may not be the final answer to the question because there may be some algorithm-dependent (mis)behavior associated with controlled feature i . [sent-13, score-0.82]

13 It may also be the case that single task classification is computationally distinguishable from multitask classification. [sent-14, score-1.071]

14 Certainly, computational concerns are one of the reasons specialized multitask classification algorithms exist. [sent-15, score-0.903]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('multitask', 0.605), ('ml', 0.243), ('controlled', 0.216), ('suppose', 0.157), ('labels', 0.15), ('input', 0.14), ('transformation', 0.125), ('composite', 0.125), ('single', 0.118), ('decomposed', 0.118), ('learns', 0.118), ('equivalence', 0.118), ('mis', 0.118), ('jointly', 0.118), ('inverse', 0.118), ('feature', 0.115), ('classification', 0.11), ('learnable', 0.108), ('specialized', 0.101), ('answer', 0.098), ('behavior', 0.093), ('query', 0.093), ('proved', 0.093), ('straightforward', 0.093), ('predicted', 0.091), ('generating', 0.091), ('task', 0.091), ('simultaneously', 0.089), ('concerns', 0.087), ('question', 0.084), ('form', 0.084), ('lost', 0.084), ('testing', 0.084), ('values', 0.082), ('remains', 0.081), ('may', 0.08), ('stated', 0.08), ('equivalent', 0.078), ('words', 0.077), ('previously', 0.077), ('extra', 0.076), ('associated', 0.074), ('final', 0.073), ('yes', 0.072), ('everything', 0.072), ('basic', 0.071), ('fairly', 0.068), ('certainly', 0.068), ('predicting', 0.067), ('computationally', 0.067)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 164 hunch net-2006-03-17-Multitask learning is Black-Boxable

2 0.32959321 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?

Introduction: Multitask learning is the learning to predict multiple outputs given the same input. Mathematically, we might think of this as trying to learn a function f:X -> {0,1} n . Structured learning is similar at this level of abstraction. Many people have worked on solving multitask learning (for example Rich Caruana ) using methods which share an internal representation. On other words, the the computation and learning of the i th prediction is shared with the computation and learning of the j th prediction. Another way to ask this question is: can we avoid sharing the internal representation? For example, it might be feasible to solve multitask learning by some process feeding the i th prediction f(x) i into the j th predictor f(x,f(x) i ) j , If the answer is “no”, then it implies we can not take binary classification as a basic primitive in the process of solving prediction problems. If the answer is “yes”, then we can reuse binary classification algorithms to

3 0.11693567 161 hunch net-2006-03-05-“Structural” Learning

Introduction: Fernando Pereira pointed out Ando and Zhang ‘s paper on “structural” learning. Structural learning is multitask learning on subproblems created from unlabeled data. The basic idea is to take a look at the unlabeled data and create many supervised problems. On text data, which they test on, these subproblems might be of the form “Given surrounding words predict the middle word”. The hope here is that successfully predicting on these subproblems is relevant to the prediction of your core problem. In the long run, the precise mechanism used (essentially, linear predictors with parameters tied by a common matrix) and the precise problems formed may not be critical. What seems critical is that the hope is realized: the technique provides a significant edge in practice. Some basic questions about this approach are: Are there effective automated mechanisms for creating the subproblems? Is it necessary to use a shared representation?

4 0.1164578 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

5 0.10114197 45 hunch net-2005-03-22-Active learning

Introduction: Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework. Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {h w }: h w (x) = 1 if x > w, and 0 otherwise. VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (

6 0.098771855 253 hunch net-2007-07-06-Idempotent-capable Predictors

7 0.092674427 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

8 0.088870659 332 hunch net-2008-12-23-Use of Learning Theory

9 0.086072087 313 hunch net-2008-08-18-Radford Neal starts a blog

10 0.084706247 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

11 0.082822524 2 hunch net-2005-01-24-Holy grails of machine learning?

12 0.080998629 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

13 0.080380395 90 hunch net-2005-07-07-The Limits of Learning Theory

14 0.080372244 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

15 0.078137167 192 hunch net-2006-07-08-Some recent papers

16 0.077756599 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

17 0.075819761 218 hunch net-2006-11-20-Context and the calculation misperception

18 0.073437288 360 hunch net-2009-06-15-In Active Learning, the question changes

19 0.07198409 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

20 0.069796145 14 hunch net-2005-02-07-The State of the Reduction

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.17), (1, 0.096), (2, -0.0), (3, -0.026), (4, 0.01), (5, -0.032), (6, 0.024), (7, -0.038), (8, 0.005), (9, -0.098), (10, -0.064), (11, -0.033), (12, 0.053), (13, 0.045), (14, -0.048), (15, 0.024), (16, 0.043), (17, 0.004), (18, -0.027), (19, 0.045), (20, -0.006), (21, 0.005), (22, 0.146), (23, 0.021), (24, 0.038), (25, -0.165), (26, 0.015), (27, 0.045), (28, -0.034), (29, -0.036), (30, -0.036), (31, -0.1), (32, -0.024), (33, -0.11), (34, -0.118), (35, 0.107), (36, -0.022), (37, 0.135), (38, -0.085), (39, -0.143), (40, -0.017), (41, -0.03), (42, -0.03), (43, 0.079), (44, 0.094), (45, -0.161), (46, 0.067), (47, -0.017), (48, -0.048), (49, -0.106)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97520828 164 hunch net-2006-03-17-Multitask learning is Black-Boxable

2 0.84445584 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?

3 0.60935563 402 hunch net-2010-07-02-MetaOptimize

Introduction: Joseph Turian creates MetaOptimize for discussion of NLP and ML on big datasets. This includes a blog , but perhaps more importantly a question and answer section . Iâ€™m hopeful it will take off.

4 0.5744946 161 hunch net-2006-03-05-“Structural” Learning

5 0.54817343 90 hunch net-2005-07-07-The Limits of Learning Theory

Introduction: Suppose we had an infinitely powerful mathematician sitting in a room and proving theorems about learning. Could he solve machine learning? The answer is “no”. This answer is both obvious and sometimes underappreciated. There are several ways to conclude that some bias is necessary in order to succesfully learn. For example, suppose we are trying to solve classification. At prediction time, we observe some features X and want to make a prediction of either 0 or 1 . Bias is what makes us prefer one answer over the other based on past experience. In order to learn we must: Have a bias. Always predicting 0 is as likely as 1 is useless. Have the “right” bias. Predicting 1 when the answer is 0 is also not helpful. The implication of “have a bias” is that we can not design effective learning algorithms with “a uniform prior over all possibilities”. The implication of “have the ‘right’ bias” is that our mathematician fails since “right” is defined wi

6 0.54319441 348 hunch net-2009-04-02-Asymmophobia

7 0.53687108 405 hunch net-2010-08-21-Rob Schapire at NYC ML Meetup

8 0.50484234 253 hunch net-2007-07-06-Idempotent-capable Predictors

9 0.49126557 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

10 0.48294929 168 hunch net-2006-04-02-Mad (Neuro)science

11 0.48128688 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction

12 0.47369143 257 hunch net-2007-07-28-Asking questions

13 0.45327818 143 hunch net-2005-12-27-Automated Labeling

14 0.44983399 313 hunch net-2008-08-18-Radford Neal starts a blog

15 0.4366791 6 hunch net-2005-01-27-Learning Complete Problems

16 0.43369406 77 hunch net-2005-05-29-Maximum Margin Mismatch?

17 0.41638473 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?

18 0.41314408 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

19 0.40444285 217 hunch net-2006-11-06-Data Linkage Problems

20 0.39901847 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.038), (27, 0.307), (53, 0.051), (55, 0.116), (78, 0.285), (94, 0.088)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.93676925 316 hunch net-2008-09-04-Fall ML Conferences

Introduction: If you are in the New York area and interested in machine learning, consider submitting a 2 page abstract to the ML symposium by tomorrow (Sept 5th) midnight. Itâ€™s a fun one day affair on October 10 in an awesome location overlooking the world trade center site. A bit further off (but a real conference) is the AI and Stats deadline on November 5, to be held in Florida April 16-19.

same-blog 2 0.91765654 164 hunch net-2006-03-17-Multitask learning is Black-Boxable

3 0.87508321 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

Introduction: I just released Vowpal Wabbit 6.0 . Since the last version: VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh . Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point. The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs. The new matrix factoriz

4 0.85258079 297 hunch net-2008-04-22-Taking the next step

Introduction: At the last ICML , Tom Dietterich asked me to look into systems for commenting on papers. I’ve been slow getting to this, but it’s relevant now. The essential observation is that we now have many tools for online collaboration, but they are not yet much used in academic research. If we can find the right way to use them, then perhaps great things might happen, with extra kudos to the first conference that manages to really create an online community. Various conferences have been poking at this. For example, UAI has setup a wiki , COLT has started using Joomla , with some dynamic content, and AAAI has been setting up a “ student blog “. Similarly, Dinoj Surendran setup a twiki for the Chicago Machine Learning Summer School , which was quite useful for coordinating events and other things. I believe the most important thing is a willingness to experiment. A good place to start seems to be enhancing existing conference websites. For example, the ICML 2007 papers pag

5 0.78775001 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

6 0.75099391 304 hunch net-2008-06-27-Reviewing Horror Stories

7 0.74732721 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

8 0.74587041 220 hunch net-2006-11-27-Continuizing Solutions

9 0.7447511 293 hunch net-2008-03-23-Interactive Machine Learning

10 0.74439919 183 hunch net-2006-06-14-Explorations of Exploration

11 0.74371195 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

12 0.74366581 351 hunch net-2009-05-02-Wielding a New Abstraction

13 0.74308461 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?

14 0.74252659 252 hunch net-2007-07-01-Watchword: Online Learning

15 0.74250031 258 hunch net-2007-08-12-Exponentiated Gradient

16 0.74198115 378 hunch net-2009-11-15-The Other Online Learning

17 0.74127352 343 hunch net-2009-02-18-Decision by Vetocracy

18 0.74102694 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

19 0.74096644 347 hunch net-2009-03-26-Machine Learning is too easy

20 0.74044448 352 hunch net-2009-05-06-Machine Learning to AI