hunch_net hunch_net-2006 hunch_net-2006-219 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there
sentIndex sentText sentNum sentScore
1 There are a number of learning algorithms which explicitly incorporate randomness into their execution. [sent-1, score-0.303]
2 Neural networks use randomization to assign initial weights. [sent-3, score-0.352]
3 The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. [sent-6, score-0.344]
4 Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. [sent-10, score-0.885]
5 Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. [sent-12, score-0.926]
6 ” It seems perverse to feed extra random bits into your prediction process since they don’t contain any information about the problem itself. [sent-14, score-0.691]
7 This question is not just philosophy—we might hope that deterministic version of learning algorithms are both more accurate and faster. [sent-16, score-0.599]
8 In the case of a neural network, if every weight started as 0, the gradient of the loss with respect to every weight would be the same, implying that after updating, all weights remain the same. [sent-19, score-0.441]
9 Using random numbers to initialize weights breaks this symmetry. [sent-20, score-0.547]
10 It is easy to believe that there are good deterministic methods for symmetry breaking. [sent-21, score-0.481]
11 A basic observation is that deterministic learning algorithms tend to overfit. [sent-23, score-0.539]
12 Bagging avoids this by randomizing the input of these learning algorithms in the hope that directions of overfit for individual predictions cancel out. [sent-24, score-0.537]
13 Similarly, using random bits internally as in a deep belief network avoids overfitting by forcing the algorithm to learn a robust-to-noise set of internal weights, which are then robust-to-overfit. [sent-25, score-0.941]
14 Large margin learning algorithms and maximum entropy learning algorithms can be understood as deterministic operations attempting to achieve the same goal. [sent-26, score-0.886]
15 A significant gap remains between randomized and deterministic learning algorithms: the deterministic versions just deal with linear predictions while the randomized techniques seem to yield improvements in general. [sent-27, score-1.338]
16 In reinforcement learning, it’s hard to optimize a policy over multiple timesteps because the optimal decision at timestep 2 is dependent on the decision at timestep 1 and vice versa. [sent-29, score-0.726]
17 PSDP can be understood as a derandomization of CPI which trades off increased computation (learning a new predictor for each timestep individually). [sent-31, score-0.351]
18 Some algorithms, such as randomized weighted majority are designed to work against adversaries who know your algorithm, except for random bits . [sent-34, score-1.049]
19 The current state-of-the-art is that random bits provide performance (computational and predictive) which we don’t know (or at least can’t prove we know) how to achieve without randomization. [sent-36, score-0.594]
20 Can randomization be removed or is it essential to good learning algorithms? [sent-37, score-0.317]
wordName wordTfidf (topN-words)
[('deterministic', 0.365), ('randomized', 0.269), ('bits', 0.251), ('random', 0.246), ('boltzmann', 0.235), ('randomization', 0.223), ('timestep', 0.209), ('algorithms', 0.174), ('weights', 0.156), ('bagging', 0.139), ('majority', 0.132), ('randomness', 0.129), ('overfit', 0.122), ('symmetry', 0.116), ('policy', 0.109), ('neural', 0.109), ('avoids', 0.101), ('achieve', 0.097), ('essential', 0.094), ('weight', 0.088), ('stochastic', 0.087), ('weighted', 0.081), ('machines', 0.081), ('understood', 0.076), ('network', 0.076), ('numbers', 0.075), ('belief', 0.073), ('predictions', 0.07), ('networks', 0.07), ('internally', 0.07), ('cyclic', 0.07), ('interpolation', 0.07), ('cancel', 0.07), ('perverse', 0.07), ('adversaries', 0.07), ('initialize', 0.07), ('timesteps', 0.07), ('reinforcement', 0.068), ('predictor', 0.066), ('feed', 0.064), ('psdp', 0.064), ('deep', 0.063), ('uses', 0.062), ('forcing', 0.061), ('vice', 0.061), ('logistic', 0.061), ('process', 0.06), ('avoid', 0.06), ('version', 0.06), ('use', 0.059)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000004 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms
Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there
2 0.21816406 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.
3 0.16217388 348 hunch net-2009-04-02-Asymmophobia
Introduction: One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg max l w l x but it occurs in many other places as well. Empirically, breaking symmetry well seems to yield great algorithms. feature asymmetry For those who like t
4 0.14076078 16 hunch net-2005-02-09-Intuitions from applied learning
Introduction: Since learning is far from an exact science, it’s good to pay attention to basic intuitions of applied learning. Here are a few I’ve collected. Integration In Bayesian learning, the posterior is computed by an integral, and the optimal thing to do is to predict according to this integral. This phenomena seems to be far more general. Bagging, Boosting, SVMs, and Neural Networks all take advantage of this idea to some extent. The phenomena is more general: you can average over many different classification predictors to improve performance. Sources: Zoubin , Caruana Differentiation Different pieces of an average should differentiate to achieve good performance by different methods. This is know as the ‘symmetry breaking’ problem for neural networks, and it’s why weights are initialized randomly. Boosting explicitly attempts to achieve good differentiation by creating new, different, learning problems. Sources: Yann LeCun , Phil Long Deep Representation Ha
5 0.13478552 201 hunch net-2006-08-07-The Call of the Deep
Introduction: Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations? What is deep learning? Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features. Specifying this more rigorously turns out to be rather difficult. Consider the following cases: SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun ‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number
6 0.13228545 220 hunch net-2006-11-27-Continuizing Solutions
7 0.12663096 388 hunch net-2010-01-24-Specializations of the Master Problem
8 0.124346 248 hunch net-2007-06-19-How is Compressed Sensing going to change Machine Learning ?
9 0.12384157 205 hunch net-2006-09-07-Objective and subjective interpretations of probability
10 0.11708339 152 hunch net-2006-01-30-Should the Input Representation be a Vector?
11 0.11285871 163 hunch net-2006-03-12-Online learning or online preservation of learning?
12 0.10515977 438 hunch net-2011-07-11-Interesting Neural Network Papers at ICML 2011
13 0.098688528 131 hunch net-2005-11-16-The Everything Ensemble Edge
14 0.098561779 19 hunch net-2005-02-14-Clever Methods of Overfitting
15 0.098116621 258 hunch net-2007-08-12-Exponentiated Gradient
16 0.092409812 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
17 0.091078438 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
18 0.090635329 235 hunch net-2007-03-03-All Models of Learning have Flaws
19 0.090292126 183 hunch net-2006-06-14-Explorations of Exploration
20 0.089779384 407 hunch net-2010-08-23-Boosted Decision Trees for Deep Learning
topicId topicWeight
[(0, 0.205), (1, 0.128), (2, -0.002), (3, -0.015), (4, 0.079), (5, -0.032), (6, -0.031), (7, 0.026), (8, 0.049), (9, 0.014), (10, -0.049), (11, -0.031), (12, 0.008), (13, -0.096), (14, 0.007), (15, 0.182), (16, 0.035), (17, -0.043), (18, -0.1), (19, 0.078), (20, -0.03), (21, -0.025), (22, 0.017), (23, 0.015), (24, -0.1), (25, 0.047), (26, -0.008), (27, 0.037), (28, 0.059), (29, 0.024), (30, -0.035), (31, -0.048), (32, 0.07), (33, -0.045), (34, 0.051), (35, 0.017), (36, -0.075), (37, 0.077), (38, 0.025), (39, 0.032), (40, -0.012), (41, 0.123), (42, 0.028), (43, -0.005), (44, -0.028), (45, -0.053), (46, 0.098), (47, 0.016), (48, -0.033), (49, 0.007)]
simIndex simValue blogId blogTitle
same-blog 1 0.94243658 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms
Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there
2 0.72426063 152 hunch net-2006-01-30-Should the Input Representation be a Vector?
Introduction: Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds. To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values). On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix . In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted. A vector window and a matrix both imply a notion of locality which is being actively and
3 0.70533103 16 hunch net-2005-02-09-Intuitions from applied learning
Introduction: Since learning is far from an exact science, it’s good to pay attention to basic intuitions of applied learning. Here are a few I’ve collected. Integration In Bayesian learning, the posterior is computed by an integral, and the optimal thing to do is to predict according to this integral. This phenomena seems to be far more general. Bagging, Boosting, SVMs, and Neural Networks all take advantage of this idea to some extent. The phenomena is more general: you can average over many different classification predictors to improve performance. Sources: Zoubin , Caruana Differentiation Different pieces of an average should differentiate to achieve good performance by different methods. This is know as the ‘symmetry breaking’ problem for neural networks, and it’s why weights are initialized randomly. Boosting explicitly attempts to achieve good differentiation by creating new, different, learning problems. Sources: Yann LeCun , Phil Long Deep Representation Ha
4 0.6774407 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.
5 0.66689026 201 hunch net-2006-08-07-The Call of the Deep
Introduction: Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations? What is deep learning? Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features. Specifying this more rigorously turns out to be rather difficult. Consider the following cases: SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun ‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number
6 0.646447 348 hunch net-2009-04-02-Asymmophobia
7 0.59891731 131 hunch net-2005-11-16-The Everything Ensemble Edge
8 0.58119208 438 hunch net-2011-07-11-Interesting Neural Network Papers at ICML 2011
9 0.58000505 407 hunch net-2010-08-23-Boosted Decision Trees for Deep Learning
10 0.57918549 179 hunch net-2006-05-16-The value of the orthodox view of Boosting
11 0.57625073 253 hunch net-2007-07-06-Idempotent-capable Predictors
12 0.56510425 388 hunch net-2010-01-24-Specializations of the Master Problem
13 0.5401479 197 hunch net-2006-07-17-A Winner
14 0.53954804 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?
15 0.53409195 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
16 0.53171092 248 hunch net-2007-06-19-How is Compressed Sensing going to change Machine Learning ?
17 0.52953279 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms
18 0.51728332 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
19 0.50994295 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
20 0.50052524 258 hunch net-2007-08-12-Exponentiated Gradient
topicId topicWeight
[(3, 0.018), (27, 0.211), (38, 0.016), (53, 0.155), (55, 0.029), (75, 0.01), (84, 0.337), (94, 0.099), (95, 0.021)]
simIndex simValue blogId blogTitle
1 0.91081023 200 hunch net-2006-08-03-AOL’s data drop
Introduction: AOL has released several large search engine related datasets. This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want.
2 0.90899807 121 hunch net-2005-10-12-The unrealized potential of the research lab
Introduction: I attended the IBM research 60th anniversary . IBM research is, by any reasonable account, the industrial research lab which has managed to bring the most value to it’s parent company over the long term. This can be seen by simply counting the survivors: IBM research is the only older research lab which has not gone through a period of massive firing. (Note that there are also new research labs .) Despite this impressive record, IBM research has failed, by far, to achieve it’s potential. Examples which came up in this meeting include: It took about a decade to produce DRAM after it was invented in the lab. (In fact, Intel produced it first.) Relational databases and SQL were invented and then languished. It was only under external competition that IBM released it’s own relational database. Why didn’t IBM grow an Oracle division ? An early lead in IP networking hardware did not result in IBM growing a Cisco division . Why not? And remember … IBM research is a s
3 0.89738822 383 hunch net-2009-12-09-Inherent Uncertainty
Introduction: I’d like to point out Inherent Uncertainty , which I’ve added to the ML blog post scanner on the right. My understanding from Jake is that the intention is to have a multiauthor blog which is more specialized towards learning theory/game theory than this one. Nevertheless, several of the posts seem to be of wider interest.
4 0.8726151 467 hunch net-2012-06-15-Normal Deviate and the UCSC Machine Learning Summer School
Introduction: Larry Wasserman has started the Normal Deviate blog which I added to the blogroll on the right. Manfred Warmuth points out the UCSC machine learning summer school running July 9-20 which may be of particular interest to those in silicon valley.
same-blog 5 0.86848313 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms
Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there
6 0.83316344 411 hunch net-2010-09-21-Regretting the dead
7 0.7929787 142 hunch net-2005-12-22-Yes , I am applying
8 0.72227764 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
9 0.61840481 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?
10 0.61694103 201 hunch net-2006-08-07-The Call of the Deep
11 0.60708833 132 hunch net-2005-11-26-The Design of an Optimal Research Environment
12 0.6050173 152 hunch net-2006-01-30-Should the Input Representation be a Vector?
13 0.6044569 347 hunch net-2009-03-26-Machine Learning is too easy
14 0.60428554 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning
15 0.60226399 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
16 0.60027879 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
17 0.59990108 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
18 0.59966242 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class
19 0.59932828 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
20 0.59928757 370 hunch net-2009-09-18-Necessary and Sufficient Research