hunch_net hunch_net-2006 hunch_net-2006-150 knowledge-graph by maker-knowledge-mining

150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

meta infos for this blog

Source: html

Introduction: Say we have two random variables X,Y with mutual information I(X,Y) . Let’s say we want to represent them with a bayes net of the form X< -M->Y , such that the entropy of M equals the mutual information, i.e. H(M)=I(X,Y) . Intuitively, we would like our hidden state to be as simple as possible (entropy wise). The data processing inequality means that H(M)>=I(X,Y) , so the mutual information is a lower bound on how simple the M could be. Furthermore, if such a construction existed it would have a nice coding interpretation — one could jointly code X and Y by first coding the mutual information, then coding X with this mutual info (without Y ) and coding Y with this mutual info (without X ). It turns out that such a construction does not exist in general (Thx Alina Beygelzimer for a counterexample! see below for the sketch). What are the implications of this? Well, it’s hard for me to say, but it does suggest to me that the ‘generative’ model philosophy might be

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Say we have two random variables X,Y with mutual information I(X,Y) . [sent-1, score-0.826]

2 Let’s say we want to represent them with a bayes net of the form X< -M->Y , such that the entropy of M equals the mutual information, i. [sent-2, score-1.217]

3 Intuitively, we would like our hidden state to be as simple as possible (entropy wise). [sent-5, score-0.37]

4 The data processing inequality means that H(M)>=I(X,Y) , so the mutual information is a lower bound on how simple the M could be. [sent-6, score-0.893]

5 Furthermore, if such a construction existed it would have a nice coding interpretation — one could jointly code X and Y by first coding the mutual information, then coding X with this mutual info (without Y ) and coding Y with this mutual info (without X ). [sent-7, score-3.594]

6 It turns out that such a construction does not exist in general (Thx Alina Beygelzimer for a counterexample! [sent-8, score-0.139]

7 Well, it’s hard for me to say, but it does suggest to me that the ‘generative’ model philosophy might be burdened with a harder modeling task. [sent-11, score-0.236]

8 If all we care about is a information theoretic, compact hidden state, then constructing an accurate Bayes net might be harder, due to the fact that it takes more bits to specify the distribution of the hidden state. [sent-12, score-1.27]

9 In fact, since we usually condition on the data, it seems odd that we should bother specifying a (potentially more complex) generative model. [sent-13, score-0.515]

10 The information bottleneck seems interesting, though this has peculiarities of its own. [sent-15, score-0.245]

11 Alina’s counterexample: Here is the joint distribution P(X,Y) . [sent-16, score-0.21]

12 Now choose Y to be the OR function of X and some other ‘hidden’ random bit (uniform). [sent-18, score-0.082]

13 So the joint is: P(0,0)=1/4 P(0,1)=1/4 P(1,0)=0 P(1,1)=1/2 Note P(X=1)=1/2 and P(Y=1)=3/4 . [sent-19, score-0.132]

14 31 The rest of the proof showing that this is not achievable in a ‘compact’ Bayes net is in a comment. [sent-21, score-0.294]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('mutual', 0.517), ('coding', 0.31), ('hidden', 0.217), ('net', 0.216), ('info', 0.209), ('counterexample', 0.186), ('information', 0.164), ('compact', 0.155), ('bayes', 0.153), ('generative', 0.149), ('construction', 0.139), ('entropy', 0.132), ('joint', 0.132), ('say', 0.125), ('alina', 0.116), ('harder', 0.098), ('existed', 0.093), ('wise', 0.093), ('fact', 0.089), ('state', 0.087), ('bother', 0.086), ('alternatives', 0.086), ('random', 0.082), ('bottleneck', 0.081), ('beygelzimer', 0.081), ('condition', 0.081), ('jointly', 0.081), ('theoretic', 0.081), ('distribution', 0.078), ('achievable', 0.078), ('philosophy', 0.074), ('unbiased', 0.074), ('sketch', 0.074), ('intuitively', 0.074), ('represent', 0.074), ('inequality', 0.074), ('without', 0.073), ('processing', 0.072), ('interpretation', 0.072), ('constructing', 0.07), ('odd', 0.07), ('usually', 0.068), ('simple', 0.066), ('accurate', 0.064), ('modeling', 0.064), ('variables', 0.063), ('uniform', 0.063), ('specifying', 0.061), ('implications', 0.061), ('furthermore', 0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999976 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

2 0.14978217 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.

3 0.12219325 160 hunch net-2006-03-02-Why do people count for learning?

Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat

4 0.095532604 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

Introduction: I have recently completed a 500+ page-book on MDL , the first comprehensive overview of the field (yes, this is a sneak advertisement ). Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes! In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by somethin

5 0.091858216 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner â€”usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

6 0.088324189 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

7 0.081699938 12 hunch net-2005-02-03-Learning Theory, by assumption

8 0.076350771 237 hunch net-2007-04-02-Contextual Scaling

9 0.07507737 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

10 0.070873015 45 hunch net-2005-03-22-Active learning

11 0.06716273 169 hunch net-2006-04-05-What is state?

12 0.064811416 248 hunch net-2007-06-19-How is Compressed Sensing going to change Machine Learning ?

13 0.064508319 27 hunch net-2005-02-23-Problem: Reinforcement Learning with Classification

14 0.064436935 185 hunch net-2006-06-16-Regularization = Robustness

15 0.063776411 165 hunch net-2006-03-23-The Approximation Argument

16 0.063158318 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

17 0.062394261 98 hunch net-2005-07-27-Not goal metrics

18 0.059972331 19 hunch net-2005-02-14-Clever Methods of Overfitting

19 0.059252266 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

20 0.059026018 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.121), (1, 0.063), (2, 0.022), (3, -0.012), (4, 0.022), (5, -0.031), (6, 0.034), (7, 0.015), (8, 0.073), (9, -0.057), (10, -0.017), (11, -0.012), (12, -0.011), (13, -0.026), (14, 0.035), (15, -0.043), (16, -0.059), (17, 0.02), (18, 0.064), (19, -0.042), (20, -0.04), (21, -0.013), (22, 0.072), (23, -0.023), (24, -0.0), (25, 0.077), (26, 0.005), (27, 0.019), (28, 0.026), (29, 0.006), (30, 0.012), (31, -0.054), (32, 0.058), (33, 0.003), (34, 0.088), (35, -0.068), (36, -0.051), (37, 0.072), (38, 0.051), (39, 0.01), (40, -0.025), (41, 0.098), (42, 0.049), (43, -0.017), (44, -0.003), (45, -0.014), (46, -0.004), (47, 0.09), (48, -0.017), (49, -0.067)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97538495 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

2 0.68680751 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

3 0.62317073 248 hunch net-2007-06-19-How is Compressed Sensing going to change Machine Learning ?

Introduction: Compressed Sensing (CS) is a new framework developed by Emmanuel Candes , Terry Tao and David Donoho . To summarize, if you acquire a signal in some basis that is incoherent with the basis in which you know the signal to be sparse in, it is very likely you will be able to reconstruct the signal from these incoherent projections. Terry Tao, the recent Fields medalist , does a very nice job at explaining the framework here . He goes further in the theory description in this post where he mentions the central issue of the Uniform Uncertainty Principle. It so happens that random projections are on average incoherent, within the UUP meaning, with most known basis (sines, polynomials, splines, wavelets, curvelets …) and are therefore an ideal basis for Compressed Sensing. [ For more in-depth information on the subject, the Rice group has done a very good job at providing a central library of papers relevant to the growing subject: http://www.dsp.ece.rice.edu/cs/ ] The Machine

4 0.59672254 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is

5 0.56681657 160 hunch net-2006-03-02-Why do people count for learning?

6 0.55309063 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?

7 0.54664791 165 hunch net-2006-03-23-The Approximation Argument

8 0.51623005 185 hunch net-2006-06-16-Regularization = Robustness

9 0.50459129 217 hunch net-2006-11-06-Data Linkage Problems

10 0.4904995 312 hunch net-2008-08-04-Electoralmarkets.com

11 0.47841287 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound

12 0.47561961 140 hunch net-2005-12-14-More NIPS Papers II

13 0.46614966 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

14 0.45756876 87 hunch net-2005-06-29-Not EM for clustering at COLT

15 0.44978911 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

16 0.44351909 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

17 0.43123567 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

18 0.4235993 34 hunch net-2005-03-02-Prior, “Prior” and Bias

19 0.42045808 237 hunch net-2007-04-02-Contextual Scaling

20 0.41480359 61 hunch net-2005-04-25-Embeddings: what are they good for?

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.033), (27, 0.194), (38, 0.082), (53, 0.019), (55, 0.045), (63, 0.417), (94, 0.079), (95, 0.015)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.87424296 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

2 0.86360782 99 hunch net-2005-08-01-Peekaboom

Introduction: Luis has released Peekaboom a successor to ESPgame ( game site ). The purpose of the game is similar—using the actions of people playing a game to gather data helpful in solving AI. Peekaboom gathers more detailed, and perhaps more useful, data about vision. For ESPgame, the byproduct of the game was mutually agreed upon labels for common images. For Peekaboom, the location of the subimage generating the label is revealed by the game as well. Given knowledge about what portion of the image is related to a label it may be more feasible learn to recognize the appropriate parts. There isn’t a dataset yet available for this game as there is for ESPgame, but hopefully a significant number of people will play and we’ll have one to work wtih soon.

3 0.73756742 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?

Introduction: Manifold based dimension-reduction algorithms share the following general outline. Given: a metric d() and a set of points S Construct a graph with a point in every node and every edge connecting to the node of one of the k -nearest neighbors. Associate with the edge a weight which is the distance between the points in the connected nodes. Digest the graph. This might include computing the shortest path between all points or figuring out how to linearly interpolate the point from it’s neighbors. Find a set of points in a low dimensional space which preserve the digested properties. Examples include LLE, Isomap (which I worked on), Hessian-LLE, SDE, and many others. The hope with these algorithms is that they can recover the low dimensional structure of point sets in high dimensional spaces. Many of them can be shown to work in interesting ways producing various compelling pictures. Despite doing some early work in this direction, I suffer from a motivational

4 0.72992349 260 hunch net-2007-08-25-The Privacy Problem

Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings

5 0.47338447 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

6 0.46805403 14 hunch net-2005-02-07-The State of the Reduction

7 0.46585229 351 hunch net-2009-05-02-Wielding a New Abstraction

8 0.46516499 95 hunch net-2005-07-14-What Learning Theory might do

9 0.46515495 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

10 0.4650934 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

11 0.46443072 41 hunch net-2005-03-15-The State of Tight Bounds

12 0.46339396 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

13 0.46330729 235 hunch net-2007-03-03-All Models of Learning have Flaws

14 0.46259007 143 hunch net-2005-12-27-Automated Labeling

15 0.46239856 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

16 0.46182084 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

17 0.46048689 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

18 0.46013737 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

19 0.45960897 131 hunch net-2005-11-16-The Everything Ensemble Edge

20 0.45918116 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design