brendan_oconnor_ai brendan_oconnor_ai-2011 knowledge-graph by maker-knowledge-mining
1 brendan oconnor ai-2011-11-13-Bayes update view of pointwise mutual information
Introduction: This is fun. Pointwise Mutual Information (e.g. Church and Hanks 1990 ) between two variable outcomes \(x\) and \(y\) is \[ PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)} \] It’s called “pointwise” because Mutual Information , between two (discrete) variables X and Y, is the expectation of PMI over possible outcomes of X and Y: \( MI(X,Y) = \sum_{x,y} p(x,y) PMI(x,y) \). One interpretation of PMI is it’s measuring how much deviation from independence there is — since \(p(x,y)=p(x)p(y)\) if X and Y were independent, so the ratio is how non-independent they (the outcomes) are. You can get another interpretation of this quantity if you switch into conditional probabilities. Looking just at the ratio, apply the definition of conditional probability: \[ \frac{p(x,y)}{p(x)p(y)} = \frac{p(x|y)}{p(x)} \] Think about doing a Bayes update for your belief about \(x\). Start with the prior \(p(x)\), then learn \(y\) and you update to the posterior belief \(p(x|y)\). How much your belief
2 brendan oconnor ai-2011-11-11-Memorizing small tables
Introduction: Lately, I’ve been trying to memorize very small tables, especially for better intuitions and rule-of-thumb calculations. At the moment I have these above my desk: The first one is a few entries in a natural logarithm table. There are all these stories about how in the slide rule era, people would develop better intuitions about the scale of logarithms because they physically engaged with them all the time. I spend lots of time looking at log-likelihoods, log-odds-ratios, and logistic regression coefficients, so I think it would be nice to have quick intuitions about what they are. (Though the Gelman and Hill textbook has an interesting argument against odds scale interpretations of logistic regression coefficients.) The second one are some zsh filename manipulation shortcuts . OK, this is more narrow than the others, but pretty useful for me at least. The third one are rough unit equivalencies for data rates over time. I find this very important for quickly determ
3 brendan oconnor ai-2011-10-05-Be careful with dictionary-based text analysis
Introduction: OK, everyone loves to run dictionary methods for sentiment and other text analysis — counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus. In particular, this is often done for sentiment analysis: count positive and negative words (according to a sentiment polarity lexicon, which was derived from human raters or previous researchers’ intuitions), and then proclaim the output yields sentiment levels of the documents. More and more papers come out every day that do this. I’ve done this myself. It’s interesting and fun, but it’s easy to get a bunch of meaningless numbers if you don’t carefully validate what’s going on. There are certainly good studies in this area that do further validation and analysis, but it’s hard to trust a study that just presents a graph with a few overly strong speculative claims as to its meaning. This happens more than it ought to. I was happy to see a similarly critical view in a nice workin
4 brendan oconnor ai-2011-09-25-Information theory stuff
Introduction: Actually this post is mainly to test the MathJax installation I put into WordPress via this plugin . But information theory is great, why not? The probability of a symbol is \(p\). It takes \(\log \frac{1}{p} = -\log p\) bits to encode one symbol — sometimes called its “surprisal”. Surprisal is 0 for a 100% probable symbol, and ranges up to \(\infty\) for extremely low probability symbols. This is because you use a coding scheme that encodes common symbols as very short strings, and less common symbols as longer ones. (e.g. Huffman or arithmetic coding.) We should say logarithms are base-2 so information is measured in bits.\(^*\) If you have a stream of such symbols and a probability distribution \(\vec{p}\) for them, where a symbol \(i\) comes at probability \(p_i\), then the average message size is the expected surprisal: \[ H(\vec{p}) = \sum_i p_i \log \frac{1}{p_i} \] this is the Shannon entropy of the probability distribution \( \vec{p} \), which is a me
5 brendan oconnor ai-2011-09-19-End-to-end NLP packages
Introduction: What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures? Lots of NLP research focuses on single tasks at a time, and thus produces software that does a single task at a time. But for various applications, it is nicer to have a full end-to-end system that just runs on whatever text you give it. If you believe this is a worthwhile goal (see caveat at bottom), I will postulate there aren’t a ton of such end-to-end, multilevel systems. Here are ones I can think of. Corrections and clarifications welcome. Stanford CoreNLP . Raw text to rich syntactic dependencies ( LFG -inspired). Also POS, NER, coreference. C&C; tools . From (sentence-segmented, tokenized?) text to rich syntactic dependencies ( CCG -based) and also a semantic representation. POS and chunks on the way. Does anyone use this much? It seems underappreciated relative to its richness. Senna . Sentence-se
6 brendan oconnor ai-2011-08-27-CMU Twitter Part-of-Speech tagger 0.2
Introduction: Announcement: We recently released a new version (0.2) of our part-of-speech tagger for English Twitter messages , along with annotations and interface. See the link for more details.
7 brendan oconnor ai-2011-06-26-Good linguistic semantics textbook?
Introduction: I’m looking for recommendations for a good textbook/handbook/reference on (non-formal) linguistic semantics. Â My undergrad semantics course was almost entirely focused on logical/formal semantics, which is fine, but I don’t feel familiar with the breadth of substantive issues — for example, I’d be hard-pressed to explain why something like semantic/thematic role labeling should be useful for anything at all. I somewhat randomly stumbled upon Frawley 1992 ( review ) in a used bookstore and it seemed pretty good — in particular, it cleanly separates itself from the philosophical study of semantics, and thus identifies issues that seem amenable to computational modeling. I’m wondering what else is out there? Â Here’s a comparison of three textbooks .
8 brendan oconnor ai-2011-06-14-How much text versus metadata is in a tweet?
Introduction: This should have been a blog post, but I got lazy and wrote a plaintext document instead. Link For twitter, context matters: 90% of a tweet is metadata and 10% is text. Â That’s measured by (an approximation of) information content; by raw data size, it’s 95/5.
9 brendan oconnor ai-2011-05-21-iPhone autocorrection error analysis
Introduction: re @andrewparker : My iPhone auto-corrected “Harvard” to “Garbage”. Well played Apple engineers. I was wondering how this would happen, and then noticed that each character pair has 0 to 2 distance on the QWERTY keyboard. Perhaps their model is eager to allow QWERTY-local character substitutions. >>> zip(‘harvard’,'garbage’) [('h', 'g'), ('a', 'a'), ('r', 'r'), ('v', 'b'), ('a', 'a'), ('r', 'g'), ('d', 'e')] And then most any language model thinks p(“garbage”) > p(“harvard”), at the very least in a unigram model with a broad domain corpus. So if it’s a noisy channel-style model, they’re underpenalizing the edit distance relative to the LM prior. (Reference: Norvig’s noisy channel spelling correction article .) On the other hand, given how insane iPhone autocorrections are , and from the number of times I’ve seen it delete a quite reasonable word I wrote, I’d bet “harvard” isn’t even in their LM. (Where the LM is more like just a dictionary; call it quantizin
10 brendan oconnor ai-2011-05-20-Log-normal and logistic-normal terminology
Introduction: I was cleaning my office and found a back-of-envelope diagram Shay drew me once, so I’m writing it up to not forget. The definitions of the logistic-normal and log-normal distributions are a little confusing with regard to their relationship to the normal distribution. If you draw samples from one, the arrows below show the transformation to make it such you have samples from another. For example, if x ~ Normal , then transforming as y=exp(x) implies y ~ LogNormal . The adjective terminology is inverted: the logistic function goes from normal to logistic-normal, but the log function goes from log-normal to normal (other way!). The log of the log-normal is normal, but it’s the logit of the logistic normal that’s normal. Here are densities of these different distributions via transformations from a standard normal. In R: x=rnorm(1e6); hist(x); hist(exp(x)/(1+exp(x)); hist(exp(x)) Just to make things more confusing, note the logistic-normal distributi
11 brendan oconnor ai-2011-05-05-Shalizi’s review of NKS
Introduction: I laugh out loud every time I reread Cosma Shalizi’s review of “New Kind of Science” (2005). I remember reading it back in college when everyone was talking about the book, when I was just losing my naivete about the popular science treatments of complex systems and such. I must be getting more cynical as I get older because I keep liking the review more. This time my favorite line was Wolfram even goes on to refute post-modernism on this basis; I won’t touch that except to say that I’d have paid a lot to see Wolfram and Jacques Derrida go one-on-one. And on the issue of running your own conventions and citing yourself, he compares it to … the way George Lakoff uses “as cognitive science shows” to mean “as I claimed in my earlier books” These quotes are funnier in context .
12 brendan oconnor ai-2011-04-08-Rough binomial confidence intervals
Introduction: I made this table a while ago and find it handy: for example, looking at a table of percentages and trying to figure out what’s meaningful or not. Why run a test if you can estimate it in your head? References: Wikipedia , binom.test
13 brendan oconnor ai-2011-03-02-Poor man’s linear algebra textbook
Introduction: I keep learning new bits of linear algebra all the time, but I’m always hurting for a useful reference. I probably should get a good book (which?), but in the meantime I’m collecting several nice online sources that ML researchers seem to often recommend: The Matrix Cookbook, plus a few more tutorial/introductory pieces, aimed at an intermediate-ish level. Main reference: The Matrix Cookbook – 71 pages of identities and such. This seems to be really popular. Tutorials/introductions: CS229 linear algebra review – from Stanford’s ML course. It seems to introduce all the essentials, and it’s vaguely familiar for me. (26 pages) Minka’s Old and New Matrix Algebra Useful for Statistics – has a great part on how to do derivatives. (19 pages) MacKay’s The Humble Gaussian – OK, not really pure linear algebra anymore, but quite enlightening. (12 pages) After studying for this last stats/ML midterm, I’ve now printed them out and stuck them in a binder. A poor
14 brendan oconnor ai-2011-02-19-Move to brenocon.com
Introduction: I’ve changed my website and blog URL from anyall.org to brenocon.com . The former was supposed to be a reference to first-order logic: the existential and universal quantifiers are fundamental to relational reasoning, and as testament to that, they are enshrined as “any()” and “all()” in wise programming languages like Python and R. Or something like that. It turns out this was obvious only to me :) I tried to set up everything to automatically redirect, so no links should be broken. Hopefully.
15 brendan oconnor ai-2011-01-11-Please report your SVM’s kernel!
Introduction: I’m tired of reading papers that use an SVM but don’t say which kernel they used. (There’s tons of such papers in NLP and, I think, other areas that do applied machine learning.) I suspect a lot of these papers are actually using a linear kernel. An un-kernelized, linear SVM is nearly the same as logistic regression — every feature independently increases or decreases the classifier’s output prediction. But a quadratic kernelized SVM is much more like boosted depth-2 decision trees. It can do automatic combinations of pairs of features — a potentially very different thing, since you can start throwing in features that don’t do anything on their own but might have useful interactions with others. (And of course, more complicated kernels do progressively more complicated and non-linear things.) I have heard people say they download an SVM package, try a bunch of different kernels, and find the linear kernel is the best. In such cases they could have just used a logistic regr
Introduction: I wrote an interactive visualization for Gaussian mixtures and some probability laws, using the excellent Protovis library. Â It helped me build intuition for the law of total variance. Link