brendan_oconnor_ai brendan_oconnor_ai-2013 knowledge-graph by maker-knowledge-mining

brendan_oconnor_ai 2013 knowledge graph


similar blogs computed by tfidf model


similar blogs computed by lsi model


similar blogs computed by lda model


blogs list:

1 brendan oconnor ai-2013-10-31-tanh is a rescaled logistic sigmoid function

Introduction: This confused me for a while when I first learned it, so in case it helps anyone else: The logistic sigmoid function, a.k.a. the inverse logit function, is \[ g(x) = \frac{ e^x }{1 + e^x} \] Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression). The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There’s horizontal stretching as well.) \[ tanh(x) = 2 g(2x) - 1 \] It’s easy to show the above leads to the standard definition \( tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \). The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot. The two functions are plotted below. Blue is the logistic function, and red is tanh.

2 brendan oconnor ai-2013-09-13-Response on our movie personas paper

Introduction: Update (2013-09-17): See David Bamman ‘s great guest post on Language Log on our latent personas paper, and the big picture of interdisciplinary collaboration. I’ve been informed that an interesting critique of my, David Bamman’s and Noah Smith’s ACL paper on movie personas has appeared on the Language Log, a guest post by Hannah Alpert-Abrams and Dan Garrette . I posted the following as a comment on LL. Thanks everyone for the interesting comments. Scholarship is an ongoing conversation, and we hope our work might contribute to it. Responding to the concerns about our paper , We did not try to make a contribution to contemporary literary theory. Rather, we focus on developing a computational linguistic research method of analyzing characters in stories. We hope there is a place for both the development of new research methods, as well as actual new substantive findings. If you think about the tremendous possibilities for computer science and humanities collabor

3 brendan oconnor ai-2013-08-31-Probabilistic interpretation of the B3 coreference resolution metric

Introduction: Here is an intuitive justification for the B3 evaluation metric often used in coreference resolution, based on whether mention pairs are coreferent. If a mention from the document is chosen at random, B3-Recall is the (expected) proportion of its actual coreferents that the system thinks are coreferent with it. B3-Precision is the (expected) proportion of its system-hypothesized coreferents that are actually coreferent with it. Does this look correct to people? Details below: In B3′s basic form, it’s a clustering evaluation metric, to evaluate a gold-standard clustering of mentions against a system-produced clustering of mentions. Let \(G\) mean a gold-standard entity and \(S\) mean a system-predicted entity, where an entity is a set of mentions. \(i\) refers to a mention; there are \(n\) mentions in the document. \(G_i\) means the gold entity that contains mention \(i\); and \(S_i\) means the system entity that has \(i\). The B3 precision and recall for a document

4 brendan oconnor ai-2013-08-20-Some analysis of tweet shares and “predicting” election outcomes

Introduction: Everyone recently seems to be talking about this newish paper by Digrazia, McKelvey, Bollen, and Rojas  ( pdf here ) that examines the correlation of Congressional candidate name mentions on Twitter against whether the candidate won the race.  One of the coauthors also wrote a Washington Post Op-Ed  about it.  I read the paper and I think it’s reasonable, but their op-ed overstates their results.  It claims: “In the 2010 data, our Twitter data predicted the winner in 404 out of 435 competitive races” But this analysis is nowhere in their paper.  Fabio Rojas has now posted errata/rebuttals  about the op-ed and described this analysis they did here.  There are several major issues off the bat: They didn’t ever predict 404/435 races; they only analyzed 406 races they call “competitive,” getting 92.5% (in-sample) accuracy, then extrapolated to all races to get the 435 number. They’re reporting about  in-sample predictions, which is really misleading to a non-scientific audi

5 brendan oconnor ai-2013-06-17-Confusion matrix diagrams

Introduction: I wrote a little note and diagrams on confusion matrix metrics: Precision, Recall, F, Sensitivity, Specificity, ROC, AUC, PR Curves, etc. brenocon.com/confusion_matrix_diagrams.pdf also,  graffle source .

6 brendan oconnor ai-2013-05-08-Movie summary corpus and learning character personas

Introduction: Here is one of our exciting just-finished ACL papers.   David  and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies. To do this we collected a  brand new dataset : 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre.  We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them.  Did you see that NYT article on quantitative analysis of film scripts ?  This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money?  We have released the data here . Our focus, though, is on narrative analysis.  We investigate  character personas : familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trick

7 brendan oconnor ai-2013-04-21-What inputs do Monte Carlo algorithms need?

Introduction: Monte Carlo sampling algorithms (either MCMC or not) have a goal to attain samples from a distribution.  They can be organized by what inputs or prior knowledge about the distribution they require.  This ranges from a low amount of knowledge, as in slice sampling (just give it an unnormalized density function), to a high amount, as in Gibbs sampling (you have to decompose your distribution into individual conditionals). Typical inputs include \(f(x)\), an unnormalized density or probability function for the target distribution, which returns a real number for a variable value.  \(g()\) and \(g(x)\) represent sample generation procedures (that output a variable value); some generators require an input, some do not. Here are the required inputs for a few algorithms.  (For an overview, see e.g.  Ch 29 of MacKay .)  There are many more out there of course.  I’m leaving off tuning parameters. Black-box samplers:  Slice sampling ,  Affine-invariant ensemble - unnorm density \(f(x)\

8 brendan oconnor ai-2013-04-16-Rise and fall of Dirichlet process clusters

Introduction: Here’s Gibbs sampling for a Dirichlet process 1-d mixture of Gaussians . On 1000 data points that look like this. I gave it fixed variance and a concentration and over MCMC iterations, and it looks like this. The top is the number of points in a cluster. The bottom are the cluster means. Every cluster has a unique color. During MCMC, clusters are created and destroyed. Every cluster has a unique color; when a cluster dies, its color is never reused. I’m showing clusters every 100 iterations. If there is a single point, that cluster was at that iteration but not before or after. If there is a line, the cluster lived for at least 100 iterations. Some clusters live long, some live short, but all eventually die. Usually the model likes to think there are about two clusters, occupying positions at the two modes in the data distribution. It also entertains the existence of several much more minor ones. Usually these are shortlived clusters that die away. But

9 brendan oconnor ai-2013-03-18-Correlation picture

Introduction: Paul Moore posted a comment pointing out this great discussion of the correlation coefficient: Joseph Lee Rodgers and W. Alan Nicewander. “Thirteen Ways to Look at the Correlation Coefficient.” The American Statistician, Vol. 42, No. 1. (Feb., 1988), pp. 59-66. Link It’s related to the the post on cosine similarity, correlation and OLS . Anyway, I was just struck by the following diagram. It almost has a pop-art feel.

10 brendan oconnor ai-2013-03-14-R scan() for quick-and-dirty checks

Introduction: One of my favorite R tricks is scan() . I was using it to verify whether I wrote a sampler recently, which was supposed to output numbers uniformly between 1 and 100 into a logfile; this loads the logfile, counts the different outcomes, and plots. plot(table(scan(“log”))) As the logfile was growing, I kept replotting it and found it oddly compelling. This was useful: in fact, an early version had an off-by-one bug, immediately obvious from the plot . And of course, chisq.test(table(scan(“log”))) does a null-hypothesis to check uniformity.

11 brendan oconnor ai-2013-02-23-Wasserman on Stats vs ML, and previous comparisons

Introduction: Larry Wasserman has a new position paper (forthcoming 2013) with a great comparison the Statistics and Machine Learning research cultures, “Rise of the Machines” . He has a very conciliatory view in terms of intellectual content, and a very pro-ML take on the research cultures. Central to his argument is that ML has recently adopted rigorous statistical concepts, and the fast-moving conference culture (and heavy publishing by its grad students) have helped with this and other good innovations. (I agree with a comment from Sinead that he’s going a little easy on ML, but it’s certainly worth a read.) There’s now a little history of “Statistics vs Machine Learning” position papers that this can be compared to. A classic is Leo Breiman (2001), “Statistical Modeling: The Two Cultures” , which isn’t exactly about stats vs. ML, but is about the focus on modeling vs algorithms, and maybe about description vs. prediction. It’s been a while since I’ve looked at it, but I’ve also enjoye

12 brendan oconnor ai-2013-01-07-Perplexity as branching factor; as Shannon diversity index

Introduction: A language model’s perplexity is exponentiated negative average log-likelihood, $$\exp( -\frac{1}{N} \log(p(x)))$$ Where the inner term usually decomposes into a sum over individual items; for example, as \(\sum_i \log p(x_i | x_1..x_{i-1})\) or \(\sum_i \log p(x_i)\) depending on independence assumptions, where for language modeling word tokens are usually taken as the individual units. (In which case it is the geometric mean of per-token negative log-likelihoods.) It’s equivalent to exponentiated cross-entropy between the model and the empirical data distribution, since \(-1/N \sum_i^N \log p(x_i) = -\sum_k^K \hat{p}_k \log p_k = H(\hat{p};p)\) where \(N\) is the number of items and \(K\) is the number of discrete classes (e.g. word types for language modeling) and \(\hat{p}_k\) is the proportion of data having class \(k\). A nice interpretation of any exponentiated entropy measure is as branching factor: entropy measures uncertainty in bits or nats, but in exponentiated f