hunch_net hunch_net-2009 hunch_net-2009-348 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg max l w l x but it occurs in many other places as well. Empirically, breaking symmetry well seems to yield great algorithms. feature asymmetry For those who like t
sentIndex sentText sentNum sentScore
1 One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. [sent-1, score-0.981]
2 feature symmetry Every feature is treated the same. [sent-4, score-1.08]
3 example symmetry Every example is treated the same. [sent-7, score-0.968]
4 label symmetry Every label is treated the same. [sent-9, score-0.982]
5 Empirically, breaking symmetry well seems to yield great algorithms. [sent-11, score-1.241]
6 feature asymmetry For those who like the “boosting is stepwise additive regression on exponential loss” viewpoint ( I don’t entirely ), boosting is an example of symmetry breaking on features. [sent-12, score-1.788]
7 example asymmetry Online learning introduces an example asymmetry. [sent-13, score-0.315]
8 label asymmetry Tree structured algorithms are good instances of example asymmetry. [sent-15, score-0.342]
9 These approaches are exponentially faster in the number of labels than more symmetric approaches. [sent-18, score-0.268]
10 The examples above are notably important, with good symmetry breaking approaches yielding substantially improved prediction or computational performance. [sent-19, score-1.388]
11 Given such strong evidence that symmetry breaking is a desirable property, a basic question is: Why isn’t it more prevalent, and more thoroughly studied? [sent-20, score-1.195]
12 One reasonable answer is that doing symmetry breaking well requires more serious thought about learning algorithm design, so researchers simply haven’t gotten to it. [sent-21, score-1.333]
13 A more complete answer is that many researchers seem to reflexively avoid symmetry breaking. [sent-23, score-0.812]
14 Matlab is a handy tool for fast prototyping of learning algorithms, but it has an intrinsic language-level bias towards symmetric approaches since there are builtin primitives for matrix operations. [sent-25, score-0.27]
15 Anytime symmetry breaking is undertaken, the method for symmetry breaking is in question, and many people feel uncomfortable without a theorem suggesting the method is the right one. [sent-29, score-2.428]
16 Since there are few theorems motivating symmetry breaking methods, it is often avoided. [sent-30, score-1.235]
17 Arbitrary symmetry breaking is something like random, except there is no randomness—you simply declare this feature/label/example comes first and that one second. [sent-37, score-1.285]
18 Boosting is a good example where a data-driven approach drives symmetry breaking (over features). [sent-40, score-1.336]
19 Data-driven approaches for symmetry breaking seem the most sound, as they can result in improved performance. [sent-41, score-1.335]
20 While there are examples of learning algorithms doing symmetry breaking for features, labels, and examples individually, there aren’t any I know which do all three, well. [sent-42, score-1.363]
wordName wordTfidf (topN-words)
[('symmetry', 0.68), ('breaking', 0.515), ('treated', 0.168), ('asymmetry', 0.153), ('feature', 0.116), ('matlab', 0.091), ('symmetric', 0.091), ('approaches', 0.089), ('boosting', 0.08), ('pervasive', 0.073), ('label', 0.067), ('every', 0.066), ('algorithms', 0.062), ('example', 0.06), ('reason', 0.056), ('examples', 0.053), ('update', 0.052), ('entirely', 0.052), ('improved', 0.051), ('labels', 0.05), ('answer', 0.05), ('tree', 0.047), ('seems', 0.046), ('builtin', 0.045), ('gotten', 0.045), ('initialize', 0.045), ('arg', 0.045), ('additive', 0.045), ('admirable', 0.045), ('declare', 0.045), ('prototyping', 0.045), ('like', 0.045), ('researchers', 0.043), ('composed', 0.042), ('designers', 0.042), ('stepwise', 0.042), ('mildly', 0.042), ('unmotivated', 0.042), ('randomness', 0.042), ('introduces', 0.042), ('gymnastics', 0.042), ('approach', 0.041), ('anytime', 0.04), ('max', 0.04), ('drives', 0.04), ('motivating', 0.04), ('avoid', 0.039), ('features', 0.038), ('suggesting', 0.038), ('exponentially', 0.038)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 348 hunch net-2009-04-02-Asymmophobia
Introduction: One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg max l w l x but it occurs in many other places as well. Empirically, breaking symmetry well seems to yield great algorithms. feature asymmetry For those who like t
2 0.16568185 16 hunch net-2005-02-09-Intuitions from applied learning
Introduction: Since learning is far from an exact science, it’s good to pay attention to basic intuitions of applied learning. Here are a few I’ve collected. Integration In Bayesian learning, the posterior is computed by an integral, and the optimal thing to do is to predict according to this integral. This phenomena seems to be far more general. Bagging, Boosting, SVMs, and Neural Networks all take advantage of this idea to some extent. The phenomena is more general: you can average over many different classification predictors to improve performance. Sources: Zoubin , Caruana Differentiation Different pieces of an average should differentiate to achieve good performance by different methods. This is know as the ‘symmetry breaking’ problem for neural networks, and it’s why weights are initialized randomly. Boosting explicitly attempts to achieve good differentiation by creating new, different, learning problems. Sources: Yann LeCun , Phil Long Deep Representation Ha
3 0.16217388 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms
Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there
4 0.09906929 104 hunch net-2005-08-22-Do you believe in induction?
Introduction: Foster Provost gave a talk at the ICML metalearning workshop on “metalearning” and the “no free lunch theorem” which seems worth summarizing. As a review: the no free lunch theorem is the most complicated way we know of to say that a bias is required in order to learn. The simplest way to see this is in a nonprobabilistic setting. If you are given examples of the form (x,y) and you wish to predict y from x then any prediction mechanism errs half the time in expectation over all sequences of examples. The proof of this is very simple: on every example a predictor must make some prediction and by symmetry over the set of sequences it will be wrong half the time and right half the time. The basic idea of this proof has been applied to many other settings. The simplistic interpretation of this theorem which many people jump to is “machine learning is dead” since there can be no single learning algorithm which can solve all learning problems. This is the wrong way to thi
5 0.090216577 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
Introduction: At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered. There are many different kinds of scaling. Scaling in examples This is the most basic kind of scaling. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss , logistic loss , Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonath
6 0.083180711 237 hunch net-2007-04-02-Contextual Scaling
7 0.081221312 218 hunch net-2006-11-20-Context and the calculation misperception
8 0.073584691 347 hunch net-2009-03-26-Machine Learning is too easy
9 0.071393386 253 hunch net-2007-07-06-Idempotent-capable Predictors
10 0.066913322 179 hunch net-2006-05-16-The value of the orthodox view of Boosting
11 0.066803232 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
12 0.064208336 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design
13 0.064011015 351 hunch net-2009-05-02-Wielding a New Abstraction
14 0.062988043 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?
15 0.062130004 272 hunch net-2007-11-14-BellKor wins Netflix
16 0.06166964 6 hunch net-2005-01-27-Learning Complete Problems
17 0.060988512 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
18 0.059343591 343 hunch net-2009-02-18-Decision by Vetocracy
19 0.058358449 420 hunch net-2010-12-26-NIPS 2010
20 0.057228472 235 hunch net-2007-03-03-All Models of Learning have Flaws
topicId topicWeight
[(0, 0.148), (1, 0.07), (2, -0.001), (3, -0.007), (4, 0.031), (5, -0.001), (6, -0.049), (7, 0.023), (8, 0.031), (9, 0.026), (10, -0.038), (11, -0.039), (12, 0.016), (13, -0.05), (14, 0.008), (15, 0.056), (16, -0.009), (17, 0.001), (18, -0.048), (19, 0.049), (20, 0.003), (21, -0.018), (22, 0.025), (23, 0.019), (24, -0.043), (25, -0.035), (26, 0.054), (27, 0.057), (28, -0.026), (29, -0.031), (30, -0.033), (31, -0.045), (32, -0.003), (33, -0.061), (34, -0.063), (35, 0.002), (36, -0.054), (37, 0.045), (38, -0.042), (39, 0.002), (40, 0.031), (41, 0.056), (42, -0.028), (43, 0.0), (44, -0.03), (45, -0.028), (46, 0.01), (47, -0.013), (48, -0.021), (49, -0.072)]
simIndex simValue blogId blogTitle
same-blog 1 0.91997349 348 hunch net-2009-04-02-Asymmophobia
Introduction: One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg max l w l x but it occurs in many other places as well. Empirically, breaking symmetry well seems to yield great algorithms. feature asymmetry For those who like t
2 0.74018687 253 hunch net-2007-07-06-Idempotent-capable Predictors
Introduction: One way to distinguish different learning algorithms is by their ability or inability to easily use an input variable as the predicted output. This is desirable for at least two reasons: Modularity If we want to build complex learning systems via reuse of a subsystem, it’s important to have compatible I/O. “Prior” knowledge Machine learning is often applied in situations where we do have some knowledge of what the right solution is, often in the form of an existing system. In such situations, it’s good to start with a learning algorithm that can be at least as good as any existing system. When doing classification, most learning algorithms can do this. For example, a decision tree can split on a feature, and then classify. The real differences come up when we attempt regression. Many of the algorithms we know and commonly use are not idempotent predictors. Logistic regressors can not be idempotent, because all input features are mapped through a nonlinearity.
3 0.7155056 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms
Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there
4 0.70416051 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?
Introduction: Multitask learning is the learning to predict multiple outputs given the same input. Mathematically, we might think of this as trying to learn a function f:X -> {0,1} n . Structured learning is similar at this level of abstraction. Many people have worked on solving multitask learning (for example Rich Caruana ) using methods which share an internal representation. On other words, the the computation and learning of the i th prediction is shared with the computation and learning of the j th prediction. Another way to ask this question is: can we avoid sharing the internal representation? For example, it might be feasible to solve multitask learning by some process feeding the i th prediction f(x) i into the j th predictor f(x,f(x) i ) j , If the answer is “no”, then it implies we can not take binary classification as a basic primitive in the process of solving prediction problems. If the answer is “yes”, then we can reuse binary classification algorithms to
5 0.69262463 152 hunch net-2006-01-30-Should the Input Representation be a Vector?
Introduction: Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds. To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values). On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix . In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted. A vector window and a matrix both imply a notion of locality which is being actively and
6 0.66044515 16 hunch net-2005-02-09-Intuitions from applied learning
7 0.65641958 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
8 0.62258971 164 hunch net-2006-03-17-Multitask learning is Black-Boxable
9 0.62227267 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?
10 0.60558593 308 hunch net-2008-07-06-To Dual or Not
11 0.6023438 179 hunch net-2006-05-16-The value of the orthodox view of Boosting
12 0.60212207 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
13 0.59873939 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
14 0.59845608 217 hunch net-2006-11-06-Data Linkage Problems
15 0.5846234 258 hunch net-2007-08-12-Exponentiated Gradient
16 0.57991725 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
17 0.56551915 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms
18 0.56333262 6 hunch net-2005-01-27-Learning Complete Problems
19 0.55894995 411 hunch net-2010-09-21-Regretting the dead
20 0.54203391 161 hunch net-2006-03-05-“Structural” Learning
topicId topicWeight
[(3, 0.012), (27, 0.209), (38, 0.052), (49, 0.313), (51, 0.014), (53, 0.05), (55, 0.062), (84, 0.037), (94, 0.112), (95, 0.029)]
simIndex simValue blogId blogTitle
1 0.95694226 224 hunch net-2006-12-12-Interesting Papers at NIPS 2006
Introduction: Here are some papers that I found surprisingly interesting. Yoshua Bengio , Pascal Lamblin, Dan Popovici, Hugo Larochelle, Greedy Layer-wise Training of Deep Networks . Empirically investigates some of the design choices behind deep belief networks. Long Zhu , Yuanhao Chen, Alan Yuille Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing. An unsupervised method for detecting objects using simple feature filters that works remarkably well on the (supervised) caltech-101 dataset . Shai Ben-David , John Blitzer , Koby Crammer , and Fernando Pereira , Analysis of Representations for Domain Adaptation . This is the first analysis I’ve seen of learning with respect to samples drawn differently from the evaluation distribution which depends on reasonable measurable quantities. All of these papers turn out to have a common theme—the power of unlabeled data to do generically useful things.
2 0.95645529 122 hunch net-2005-10-13-Site tweak
Introduction: Several people have had difficulty with comments which seem to have an allowed language significantly poorer than posts. The set of allowed html tags has been increased and the markdown filter has been put in place to try to make commenting easier. I’ll put some examples into the comments of this post.
3 0.90751165 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project
Introduction: Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github . For example, the lastest and greatest can be downloaded via: git clone git://github.com/JohnLangford/vowpal_wabbit.git If you aren’t familiar with git , it’s a distributed version control system which supports quick and easy branching, as well as reconciliation. This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes. As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience. The program can effectively scale up to batch-style training on sparse terafeature (i.e. 10 12 sparse feature) size datasets. The limiting factor is typically i/o. I started using the the real datasets from the large-scale learning workshop as a conve
4 0.87776804 338 hunch net-2009-01-23-An Active Learning Survey
Introduction: Burr Settles wrote a fairly comprehensive survey of active learning . He intends to maintain and update the survey, so send him any suggestions you have.
5 0.86393899 37 hunch net-2005-03-08-Fast Physics for Learning
Introduction: While everyone is silently working on ICML submissions, I found this discussion about a fast physics simulator chip interesting from a learning viewpoint. In many cases, learning attempts to predict the outcome of physical processes. Access to a fast simulator for these processes might be quite helpful in predicting the outcome. Bayesian learning in particular may directly benefit while many other algorithms (like support vector machines) might have their speed greatly increased. The biggest drawback is that writing software for these odd architectures is always difficult and time consuming, but a several-orders-of-magnitude speedup might make that worthwhile.
same-blog 6 0.852826 348 hunch net-2009-04-02-Asymmophobia
7 0.83487803 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
8 0.74427819 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
9 0.66432929 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
10 0.64466476 438 hunch net-2011-07-11-Interesting Neural Network Papers at ICML 2011
11 0.64166158 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
12 0.64016223 143 hunch net-2005-12-27-Automated Labeling
13 0.63785017 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
14 0.63767052 237 hunch net-2007-04-02-Contextual Scaling
15 0.62985921 262 hunch net-2007-09-16-Optimizing Machine Learning Programs
16 0.62778497 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
17 0.62758517 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
18 0.62462926 366 hunch net-2009-08-03-Carbon in Computer Science Research
19 0.62377918 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
20 0.62317926 95 hunch net-2005-07-14-What Learning Theory might do