hunch_net hunch_net-2007 hunch_net-2007-253 knowledge-graph by maker-knowledge-mining

253 hunch net-2007-07-06-Idempotent-capable Predictors

meta infos for this blog

Source: html

Introduction: One way to distinguish different learning algorithms is by their ability or inability to easily use an input variable as the predicted output. This is desirable for at least two reasons: Modularity If we want to build complex learning systems via reuse of a subsystem, it’s important to have compatible I/O. “Prior” knowledge Machine learning is often applied in situations where we do have some knowledge of what the right solution is, often in the form of an existing system. In such situations, it’s good to start with a learning algorithm that can be at least as good as any existing system. When doing classification, most learning algorithms can do this. For example, a decision tree can split on a feature, and then classify. The real differences come up when we attempt regression. Many of the algorithms we know and commonly use are not idempotent predictors. Logistic regressors can not be idempotent, because all input features are mapped through a nonlinearity.

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One way to distinguish different learning algorithms is by their ability or inability to easily use an input variable as the predicted output. [sent-1, score-0.842]

2 This is desirable for at least two reasons: Modularity If we want to build complex learning systems via reuse of a subsystem, it’s important to have compatible I/O. [sent-2, score-0.682]

3 “Prior” knowledge Machine learning is often applied in situations where we do have some knowledge of what the right solution is, often in the form of an existing system. [sent-3, score-0.46]

4 In such situations, it’s good to start with a learning algorithm that can be at least as good as any existing system. [sent-4, score-0.258]

5 When doing classification, most learning algorithms can do this. [sent-5, score-0.078]

6 For example, a decision tree can split on a feature, and then classify. [sent-6, score-0.26]

7 The real differences come up when we attempt regression. [sent-7, score-0.146]

8 Many of the algorithms we know and commonly use are not idempotent predictors. [sent-8, score-0.789]

9 Logistic regressors can not be idempotent, because all input features are mapped through a nonlinearity. [sent-9, score-0.741]

10 Linear regressors can be idempotent—they just set the weight on one input feature to 1 and other features to 0 . [sent-10, score-0.895]

11 Regression trees are not idempotent, or (at least) not easily idempotent. [sent-11, score-0.201]

12 In order to predict the same as an input feature, that input feature must be split many times. [sent-12, score-0.995]

13 Bayesian approaches may or may not be easily idempotent, depending on the structure of the Bayesian Prior. [sent-13, score-0.481]

14 It isn’t clear how important the idempotent-capable property is. [sent-14, score-0.168]

15 Successive approximation approaches such as boosting can approximate it in a fairly automatic maner. [sent-15, score-0.437]

16 It may be of substantial importance for large modular systems where efficiency is important. [sent-16, score-0.411]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('idempotent', 0.647), ('input', 0.299), ('regressors', 0.23), ('split', 0.201), ('feature', 0.196), ('situations', 0.138), ('easily', 0.134), ('mapped', 0.115), ('knowledge', 0.113), ('least', 0.111), ('modular', 0.106), ('inability', 0.1), ('logistic', 0.1), ('important', 0.099), ('features', 0.097), ('reuse', 0.096), ('bayesian', 0.096), ('existing', 0.096), ('systems', 0.092), ('compatible', 0.092), ('approaches', 0.09), ('modularity', 0.086), ('differences', 0.084), ('distinguish', 0.081), ('efficiency', 0.081), ('algorithms', 0.078), ('predicted', 0.078), ('automatic', 0.076), ('desirable', 0.074), ('weight', 0.073), ('approximation', 0.073), ('approximate', 0.072), ('variable', 0.072), ('property', 0.069), ('may', 0.068), ('boosting', 0.068), ('depending', 0.068), ('trees', 0.067), ('commonly', 0.064), ('importance', 0.064), ('attempt', 0.062), ('regression', 0.062), ('build', 0.062), ('tree', 0.059), ('fairly', 0.058), ('complex', 0.056), ('structure', 0.053), ('prior', 0.052), ('start', 0.051), ('linear', 0.049)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 253 hunch net-2007-07-06-Idempotent-capable Predictors

2 0.1115444 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

Introduction: Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds. To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values). On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix . In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted. A vector window and a matrix both imply a notion of locality which is being actively and

3 0.10610753 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

4 0.098771855 164 hunch net-2006-03-17-Multitask learning is Black-Boxable

Introduction: Multitask learning is the problem of jointly predicting multiple labels simultaneously with one system. A basic question is whether or not multitask learning can be decomposed into one (or more) single prediction problems . It seems the answer to this is “yes”, in a fairly straightforward manner. The basic idea is that a controlled input feature is equivalent to an extra output. Suppose we have some process generating examples: (x,y 1 ,y 2 ) in S where y 1 and y 2 are labels for two different tasks. Then, we could reprocess the data to the form S b (S) = {((x,i),y i ): (x,y 1 ,y 2 ) in S, i in {1,2}} and then learn a classifier c:X x {1,2} -> Y . Note that (x,i) is the (composite) input. At testing time, given an input x , we can query c for the predicted values of y 1 and y 2 using (x,1) and (x,2) . A strong form of equivalence can be stated between these tasks. In particular, suppose we have a multitask learning algorithm ML which learns a multitask

5 0.098272815 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of

6 0.09549056 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

7 0.09387996 6 hunch net-2005-01-27-Learning Complete Problems

8 0.093474999 332 hunch net-2008-12-23-Use of Learning Theory

9 0.092243187 235 hunch net-2007-03-03-All Models of Learning have Flaws

10 0.091336653 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

11 0.089618772 95 hunch net-2005-07-14-What Learning Theory might do

12 0.086713612 201 hunch net-2006-08-07-The Call of the Deep

13 0.081118315 218 hunch net-2006-11-20-Context and the calculation misperception

14 0.075722098 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

15 0.075446092 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

16 0.075407803 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

17 0.074789472 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

18 0.074143976 160 hunch net-2006-03-02-Why do people count for learning?

19 0.072177246 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

20 0.071473129 84 hunch net-2005-06-22-Languages of Learning

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.16), (1, 0.097), (2, -0.007), (3, -0.007), (4, 0.05), (5, -0.02), (6, -0.026), (7, 0.025), (8, 0.057), (9, 0.007), (10, -0.063), (11, -0.096), (12, 0.016), (13, -0.044), (14, 0.029), (15, 0.014), (16, -0.003), (17, -0.013), (18, -0.002), (19, 0.024), (20, 0.022), (21, 0.033), (22, 0.084), (23, 0.0), (24, -0.047), (25, -0.036), (26, -0.008), (27, 0.02), (28, -0.018), (29, -0.045), (30, -0.034), (31, -0.084), (32, 0.008), (33, 0.003), (34, -0.044), (35, 0.07), (36, 0.017), (37, 0.011), (38, -0.01), (39, 0.065), (40, 0.027), (41, 0.018), (42, 0.035), (43, -0.015), (44, -0.012), (45, -0.069), (46, 0.032), (47, -0.042), (48, 0.008), (49, -0.083)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.93985868 253 hunch net-2007-07-06-Idempotent-capable Predictors

2 0.74246949 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

3 0.73849589 348 hunch net-2009-04-02-Asymmophobia

Introduction: One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg max l w l x but it occurs in many other places as well. Empirically, breaking symmetry well seems to yield great algorithms. feature asymmetry For those who like t

4 0.63721269 6 hunch net-2005-01-27-Learning Complete Problems

Introduction: Let’s define a learning problem as making predictions given past data. There are several ways to attack the learning problem which seem to be equivalent to solving the learning problem. Find the Invariant This viewpoint says that learning is all about learning (or incorporating) transformations of objects that do not change the correct prediction. The best possible invariant is the one which says “all things of the same class are the same”. Finding this is equivalent to learning. This viewpoint is particularly common when working with image features. Feature Selection This viewpoint says that the way to learn is by finding the right features to input to a learning algorithm. The best feature is the one which is the class to predict. Finding this is equivalent to learning for all reasonable learning algorithms. This viewpoint is common in several applications of machine learning. See Gilad’s and Bianca’s comments . Find the Representation This is almost the same a

5 0.63036716 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

6 0.62498105 217 hunch net-2006-11-06-Data Linkage Problems

7 0.61999679 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

8 0.6157977 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?

9 0.61107576 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

10 0.61042219 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

11 0.60693747 16 hunch net-2005-02-09-Intuitions from applied learning

12 0.58449447 164 hunch net-2006-03-17-Multitask learning is Black-Boxable

13 0.579705 235 hunch net-2007-03-03-All Models of Learning have Flaws

14 0.57685232 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

15 0.5741238 237 hunch net-2007-04-02-Contextual Scaling

16 0.56810302 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

17 0.55620611 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

18 0.55214334 165 hunch net-2006-03-23-The Approximation Argument

19 0.55159581 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

20 0.54962343 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(10, 0.044), (17, 0.265), (27, 0.21), (38, 0.029), (53, 0.107), (55, 0.051), (77, 0.024), (94, 0.14)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.87328625 366 hunch net-2009-08-03-Carbon in Computer Science Research

Introduction: Al Gore ‘s film and gradually more assertive and thorough science has managed to mostly shift the debate on climate change from “Is it happening?” to “What should be done?” In that context, it’s worthwhile to think a bit about what can be done within computer science research. There are two things we can think about: Doing Research At a cartoon level, computer science research consists of some combination of commuting to&from; work, writing programs, running them on computers, writing papers, and presenting them at conferences. A typical computer has a power usage on the order of 100 Watts, which works out to 2.4 kiloWatt-hours/day. Looking up David MacKay ‘s reference on power usage per person , it becomes clear that this is a relatively minor part of the lifestyle, although it could become substantial if many more computers are required. Much larger costs are associated with commuting (which is in common with many people) and attending conferences. Since local commuti

same-blog 2 0.87130904 253 hunch net-2007-07-06-Idempotent-capable Predictors

3 0.81315863 143 hunch net-2005-12-27-Automated Labeling

Introduction: One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this: Unsupervised Learning . You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way. Active Learning . You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy. It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows: Identify some subset of observed values to predict

4 0.77867866 377 hunch net-2009-11-09-NYAS ML Symposium this year.

Introduction: The NYAS ML symposium grew again this year to 170 participants, despite the need to outsmart or otherwise tunnel through a crowd . Perhaps the most distinct talk was by Bob Bell on various aspects of the Netflix prize competition. I also enjoyed several student posters including Matt Hoffman ‘s cool examples of blind source separation for music. I’m somewhat surprised how much the workshop has grown, as it is now comparable in size to a small conference, although in style more similar to a workshop. At some point as an event grows, it becomes owned by the community rather than the organizers, so if anyone has suggestions on improving it, speak up and be heard.

5 0.73269361 313 hunch net-2008-08-18-Radford Neal starts a blog

Introduction: here on statistics, ML, CS, and other things he knows well.

6 0.71093971 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

7 0.69688421 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

8 0.69419491 95 hunch net-2005-07-14-What Learning Theory might do

9 0.69217283 332 hunch net-2008-12-23-Use of Learning Theory

10 0.69173348 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

11 0.69081903 237 hunch net-2007-04-02-Contextual Scaling

12 0.6903922 347 hunch net-2009-03-26-Machine Learning is too easy

13 0.68814832 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

14 0.68608963 258 hunch net-2007-08-12-Exponentiated Gradient

15 0.68534672 78 hunch net-2005-06-06-Exact Online Learning for Classification

16 0.68362355 229 hunch net-2007-01-26-Parallel Machine Learning Problems

17 0.68266714 351 hunch net-2009-05-02-Wielding a New Abstraction

18 0.68184483 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

19 0.68133336 177 hunch net-2006-05-05-An ICML reject

20 0.68109947 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning