hunch_net hunch_net-2007 hunch_net-2007-227 knowledge-graph by maker-knowledge-mining

227 hunch net-2007-01-10-A Deep Belief Net Learning Problem


meta infos for this blog

Source: html

Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. [sent-2, score-0.887]

2 Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily. [sent-6, score-0.969]

3 It turns out that we can define deep learning problems which are solvable by deep belief net style algorithms. [sent-10, score-0.903]

4 Some definitions: Learning Problem A learning problem is defined by probability distribution D(x,y) over features x which are a vector of bits and a label y which is either 0 or 1 . [sent-11, score-0.946]

5 Shallow Learning Problem A shallow learning problem is a learning problem where the label y can be predicted with error rate at most e < 0. [sent-12, score-0.947]

6 Deep Learning Problem A deep learning problem is a learning problem with a solution representable by a circuit of weighted linear sums with O(number of input features) gates. [sent-14, score-1.067]

7 Theorem There exists a deep learning problem for which: A deep belief net (like) learning algorithm can achieve error rate 0 with probability 1- d for any d > 0 in the limit as the number of IID samples goes to infinity. [sent-19, score-1.505]

8 For each hidden bit h i , we have 4 output bits x i1 ,x i2 ,x i3 ,x i4 (implying a total of 4k output bits). [sent-36, score-1.358]

9 5 probability we set one of the output bits to 1 and the rest to 0 , and with 0. [sent-38, score-0.859]

10 5 probability we set one of the output bits to 0 and the rest to 1 , and with 0. [sent-41, score-0.859]

11 The deep belief net like algorithm we consider is the algorithm which: Builds a threshold weighted sum predictor for every feature x ij using weights = the probability of agreement between the features minus 0. [sent-45, score-1.704]

12 ) For each output feature x ij , the values of output features corresponding to other hidden bits are uncorrelated since by construction Pr(h i = h i’ ) = 0. [sent-49, score-1.498]

13 For output features which share a hidden bit, the probability of agreement in value between two bits j,j’ is 0. [sent-52, score-1.381]

14 For the prediction of each feature, when n = 512 k 4 log ((4k) 2 /d) , the sum of the weights on the 4 (k-1) features corresponding to other hidden weights is bounded by 4(k-1) * 1/(32 k 2 ) <= 1/(8k) . [sent-56, score-0.875]

15 Consequently, the predicted value is the majority of the 3 other features which is always the value of the hidden bit. [sent-59, score-0.767]

16 The above analysis (sketchily) shows that the predicted value for each output bit is the hiden bit used to generate it. [sent-60, score-0.974]

17 The same style of analysis shows that given the hidden bits, the output bit can be predicted perfectly. [sent-61, score-0.975]

18 In this case, the value of each hidden bit provides a slight consistent edge in predicting the value of the output bit implying that the learning algorithm converges to uniform weighting over the predicted hidden bit values. [sent-62, score-2.089]

19 To prove the second part of the theorem, we can first show that a uniform weight over all features is the optimal predictor, and then show that the error rate of this predictor converges to 1/2 as k -> infinity . [sent-63, score-0.749]

20 Essentially, the noise in the observed bits given the hidden bits kills prediction performance. [sent-67, score-0.914]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('bits', 0.323), ('deep', 0.298), ('output', 0.291), ('hidden', 0.268), ('bit', 0.185), ('probability', 0.182), ('features', 0.179), ('xor', 0.178), ('shallow', 0.178), ('predicted', 0.156), ('circuit', 0.154), ('depth', 0.142), ('problem', 0.139), ('sum', 0.139), ('weighted', 0.134), ('belief', 0.122), ('net', 0.119), ('definition', 0.117), ('concisely', 0.115), ('weights', 0.115), ('uniform', 0.104), ('theorem', 0.092), ('ij', 0.087), ('weighting', 0.086), ('value', 0.082), ('analysis', 0.075), ('error', 0.074), ('predictor', 0.073), ('rate', 0.072), ('builds', 0.071), ('representable', 0.071), ('algorithm', 0.069), ('learning', 0.066), ('expectations', 0.064), ('rest', 0.063), ('proof', 0.062), ('show', 0.062), ('threshold', 0.062), ('represent', 0.062), ('converges', 0.062), ('statement', 0.061), ('prove', 0.061), ('architectures', 0.059), ('corresponding', 0.059), ('label', 0.057), ('definitions', 0.056), ('agreement', 0.056), ('bounds', 0.052), ('evidence', 0.052), ('random', 0.051)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000005 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.

2 0.28592381 201 hunch net-2006-08-07-The Call of the Deep

Introduction: Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations? What is deep learning? Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features. Specifying this more rigorously turns out to be rather difficult. Consider the following cases: SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun ‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number

3 0.21816406 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there

4 0.18473278 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

5 0.18280581 407 hunch net-2010-08-23-Boosted Decision Trees for Deep Learning

Introduction: About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms. This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance. Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing wha

6 0.15299278 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?

7 0.14978217 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

8 0.14230771 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

9 0.13955849 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

10 0.12651449 332 hunch net-2008-12-23-Use of Learning Theory

11 0.12063706 14 hunch net-2005-02-07-The State of the Reduction

12 0.11852598 5 hunch net-2005-01-26-Watchword: Probability

13 0.11849312 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

14 0.11779842 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression

15 0.11066179 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

16 0.11045702 34 hunch net-2005-03-02-Prior, “Prior” and Bias

17 0.10923713 351 hunch net-2009-05-02-Wielding a New Abstraction

18 0.10757519 477 hunch net-2013-01-01-Deep Learning 2012

19 0.10755506 143 hunch net-2005-12-27-Automated Labeling

20 0.10651304 438 hunch net-2011-07-11-Interesting Neural Network Papers at ICML 2011


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.249), (1, 0.184), (2, 0.039), (3, -0.074), (4, 0.069), (5, -0.056), (6, 0.024), (7, 0.031), (8, 0.035), (9, -0.079), (10, -0.109), (11, -0.059), (12, -0.021), (13, -0.17), (14, 0.029), (15, 0.19), (16, -0.069), (17, 0.025), (18, -0.072), (19, 0.136), (20, -0.05), (21, 0.105), (22, 0.147), (23, 0.013), (24, -0.066), (25, -0.016), (26, -0.005), (27, -0.038), (28, 0.024), (29, 0.021), (30, 0.007), (31, 0.009), (32, -0.0), (33, 0.044), (34, 0.024), (35, -0.078), (36, 0.008), (37, 0.104), (38, -0.016), (39, -0.014), (40, -0.076), (41, 0.077), (42, -0.019), (43, -0.013), (44, -0.031), (45, 0.015), (46, -0.041), (47, 0.029), (48, -0.0), (49, -0.032)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95641905 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.

2 0.77746499 201 hunch net-2006-08-07-The Call of the Deep

Introduction: Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations? What is deep learning? Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features. Specifying this more rigorously turns out to be rather difficult. Consider the following cases: SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun ‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number

3 0.72575593 407 hunch net-2010-08-23-Boosted Decision Trees for Deep Learning

Introduction: About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms. This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance. Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing wha

4 0.67735791 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

Introduction: Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds. To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values). On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix . In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted. A vector window and a matrix both imply a notion of locality which is being actively and

5 0.67609692 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there

6 0.61625487 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

7 0.60475022 224 hunch net-2006-12-12-Interesting Papers at NIPS 2006

8 0.59338593 16 hunch net-2005-02-09-Intuitions from applied learning

9 0.58832848 477 hunch net-2013-01-01-Deep Learning 2012

10 0.58582491 438 hunch net-2011-07-11-Interesting Neural Network Papers at ICML 2011

11 0.57612228 431 hunch net-2011-04-18-A paper not at Snowbird

12 0.56574446 348 hunch net-2009-04-02-Asymmophobia

13 0.55541784 253 hunch net-2007-07-06-Idempotent-capable Predictors

14 0.54955643 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression

15 0.53903443 218 hunch net-2006-11-20-Context and the calculation misperception

16 0.5277347 118 hunch net-2005-10-07-On-line learning of regular decision rules

17 0.51900095 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting

18 0.51337719 258 hunch net-2007-08-12-Exponentiated Gradient

19 0.50985104 131 hunch net-2005-11-16-The Everything Ensemble Edge

20 0.50947791 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.016), (3, 0.068), (27, 0.257), (38, 0.106), (40, 0.127), (49, 0.029), (53, 0.115), (55, 0.064), (61, 0.021), (94, 0.048), (95, 0.029)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9369092 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.

2 0.89844882 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

Introduction: There are several different flavors of Machine Learning classes. Many classes are of the ‘zoo’ sort: many different learning algorithms are presented. Others avoid the zoo by not covering the full scope of machine learning. This is my view of what makes a good machine learning class, along with why. I’d like to specifically invite comment on whether things are missing, misemphasized, or misplaced. Phase Subject Why? Introduction What is a machine learning problem? A good understanding of the characteristics of machine learning problems seems essential. Characteristics include: a data source, some hope the data is predictive, and a need for generalization. This is probably best taught in a case study manner: lay out the specifics of some problem and then ask “Is this a machine learning problem?” Introduction Machine Learning Problem Identification Identification and recognition of the type of learning problems is (obviously) a very important step i

3 0.89421058 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

4 0.88710362 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

5 0.88668126 12 hunch net-2005-02-03-Learning Theory, by assumption

Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.

6 0.87776244 14 hunch net-2005-02-07-The State of the Reduction

7 0.87694913 41 hunch net-2005-03-15-The State of Tight Bounds

8 0.87425047 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

9 0.87391829 201 hunch net-2006-08-07-The Call of the Deep

10 0.87260514 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

11 0.87219453 194 hunch net-2006-07-11-New Models

12 0.87156361 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

13 0.87017769 259 hunch net-2007-08-19-Choice of Metrics

14 0.86974007 183 hunch net-2006-06-14-Explorations of Exploration

15 0.86748153 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

16 0.86541444 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

17 0.86538643 370 hunch net-2009-09-18-Necessary and Sufficient Research

18 0.86467272 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

19 0.86374164 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

20 0.86339873 289 hunch net-2008-02-17-The Meaning of Confidence