brendan_oconnor_ai brendan_oconnor_ai-2012 knowledge-graph by maker-knowledge-mining

brendan_oconnor_ai 2012 knowledge graph

similar blogs computed by tfidf model

similar blogs computed by lsi model

similar blogs computed by lda model

blogs list:

1 brendan oconnor ai-2012-11-24-Graphs for SANCL-2012 web parsing results

Introduction: I was just looking at some papers from the SANCL-2012 workshop on web parsing from June this year, which are very interesting to those of us who wish we had good parsers for non-newspaper text. The shared task focus was on domain adaptation from a setting of lots of Wall Street Journal annotated data and very little in-domain training data. (Previous discussion here ; see Ryan McDonald’s detailed comment.) Here are some graphs of the results ( last page in the Petrov & McDonald overview ). I was most interested in whether parsing accuracy on the WSJ correlates to accuracy on web text. Fortunately, it does. They evaluated all systems on four evaluation sets: (1) Text from a question/answer site, (2) newsgroups, (3) reviews, and (4) Wall Street Journal PTB. Here is a graph across system entries, with the x-axis being the labeled dependency parsing accuracy on WSJPTB, and the y-axis the average accuracy on the three web evaluation sets. Note the axis scales are different: web

2 brendan oconnor ai-2012-10-02-Powerset’s natural language search system

Introduction: There’s a lot to say about Powerset , the short-lived natural language search company (2005-2008) where I worked after college. AI overhype, flying too close to the sun, the psychology of tech journalism and venture capitalism, etc. A year or two ago I wrote the following bit about Powerset’s technology in response to a question on Quora . I’m posting a revised version here. Question: What was Powerset’s core innovation in search? As far as I can tell, they licensed an NLP engine. They did not have a question answering system or any system for information extraction. How was Powerset’s search engine different than Google’s? My answer: Powerset built a system vaguely like a question-answering system on top of Xerox PARC’s NLP engine. The output is better described as query-focused summarization rather than question answering; primarily, it matched semantic fragments of the user query against indexed semantic relations, with lots of keyword/ngram-matching fallback for when

3 brendan oconnor ai-2012-09-21-CMU ARK Twitter Part-of-Speech Tagger – v0.3 released

Introduction: We’re pleased to announce a new release of the CMU ARK Twitter Part-of-Speech Tagger, version 0.3. The new version is much faster (40x) and more accurate (89.2 -> 92.8) than before. We also have released new POS-annotated data, including a dataset of one tweet for each of 547 days. We have made available large-scale word clusters from unlabeled Twitter data (217k words, 56m tweets, 847m tokens). Tools, data, and a new technical report describing the release are available at: www.ark.cs.cmu.edu/TweetNLP . 0100100 a 1111100101110 111100000011 , Brendan

4 brendan oconnor ai-2012-08-21-Berkeley SDA and the General Social Survey

Introduction: It is worth contemplating how grand the General Social Survey is. When playing around with the Statwing YC demo (which is very cool!) I was reminded of the very old-school SDA web tool for exploratory cross-tabulation analysesâ€Ś They have the GSS loaded and running here . The GSS is so large you can analyze really weird combinations of variables. For example, here is one I just did: How much good versus evil is there in the world (on a 7 point scale, of course!), versus age.

5 brendan oconnor ai-2012-07-17-p-values, CDF’s, NLP etc.

Introduction: Update Aug 10: THIS IS NOT A SUMMARY OF THE WHOLE PAPER! it’s whining about one particular method of analysis before talking about other things further down A quick note on Berg-Kirkpatrick et al EMNLP-2012, “An Empirical Investigation of Statistical Signiﬁcance in NLP” . They make lots of graphs of p-values against observed magnitudes and talk about “curves”, e.g. We see the same curve-shaped trend we saw for summarization and dependency parsing. Different group comparisons, same group comparisons, and system combination comparisons form distinct curves. For example, Figure 2. I fear they made 10 graphs to rediscover a basic statistical fact: a p-value comes from a null hypothesis CDF. That’s what these “curve-shaped trends” are in all their graphs. They are CDFs. To back up, the statistical significance testing question is whether, in their notation, the observed dataset performance difference $\delta(x)$ is “real” or not: if you were to resample the data,

6 brendan oconnor ai-2012-07-04-The $60,000 cat: deep belief networks make less sense for language than vision

Introduction: There was an interesting ICML paper this year about very large-scale training of deep belief networks (a.k.a. neural networks) for unsupervised concept extraction from images. They ( Quoc V. Le and colleagues at Google/Stanford) have a cute example of learning very high-level features that are evoked by images of cats (from YouTube still-image training data); one is shown below. For those of us who work on machine learning and text, the question always comes up, why not DBN’s for language? Many shallow latent-space text models have been quite successful (LSI, LDA, HMM, LPCFG…); there is hope that some sort of “deeper” concepts could be learned. I think this is one of the most interesting areas for unsupervised language modeling right now. But note it’s a bad idea to directly analogize results from image analysis to language analysis. The problems have radically different levels of conceptual abstraction baked-in. Consider the problem of detecting the concept of a cat; i.e.

7 brendan oconnor ai-2012-04-11-F-scores, Dice, and Jaccard set similarity

Introduction: The Dice similarity is the same as F1-score ; and they are monotonic in Jaccard similarity . I worked this out recently but couldn’t find anything about it online so here’s a writeup. Let $A$ be the set of found items, and $B$ the set of wanted items. $Prec=|AB|/|A|$, $Rec=|AB|/|B|$. Their harmonic mean, the $F1$-measure, is the same as the Dice coefficient: \begin{align*} F1(A,B) &= \frac{2}{1/P+ 1/R} = \frac{2}{|A|/|AB| + |B|/|AB|} \\ Dice(A,B) &= \frac{2|AB|}{ |A| + |B| } \\ &= \frac{2 |AB|}{ (|AB| + |A \setminus B|) + (|AB| + |B \setminus A|)} \\ &= \frac{|AB|}{|AB| + \frac{1}{2}|A \setminus B| + \frac{1}{2} |B \setminus A|} \end{align*} It’s nice to characterize the set comparison into the three mutually exclusive partitions $AB$, $A \setminus B$, and $B \setminus A$. This illustrates Dice’s close relationship to the Jaccard metric, \begin{align*} Jacc(A,B) &= \frac{|AB|}{|A \cup B|} \\ &= \frac{|AB|}{|AB| + |A \setminus B| + |B \setminus

8 brendan oconnor ai-2012-03-13-Cosine similarity, Pearson correlation, and OLS coefficients

Introduction: Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product — tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that). Details: You have two vectors $x$ and $y$ and want to measure similarity between them. A basic similarity function is the inner product \[ Inner(x,y) = \sum_i x_i y_i = \langle x, y \rangle \] If x tends to be high where y is also high, and low where y is low, the inner product will be high — the vectors are more similar. The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectors’ L2 norms, giving the cosine similarity \[ CosSim(x,y) = \frac{\sum_i x_i y_i}{ \sqrt{ \sum_i x_i^2} \sqrt{ \sum_i y_i^2 } } = \frac{ \langle x,y \rangle }{ ||x||\ ||y|| } \] This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between t

9 brendan oconnor ai-2012-03-09-I don’t get this web parsing shared task

Introduction: The idea for a shared task on web parsing is really cool. But I don’t get this one: Shared Task – SANCL 2012 (First Workshop on Syntactic Analysis of Non-Canonical Language) They’re explicitly banning Manually annotating in-domain (web) sentences Creating new word clusters, or anything, from as much text data as possible … instead restricting participants to the data sets they release. Isn’t a cycle of annotation, error analysis, and new annotations (a self-training + active-learning loop, with smarter decisions through error analysis) the hands-down best way to make an NLP tool for a new domain? Are people scared of this reality? Am I off-base? I am, of course, just advocating for our Twitter POS tagger approach, where we annotated some data, made a supervised tagger, and iterated on features. The biggest weakness in that paper is we didn’t have additional iterations of error analysis. Our lack of semi-supervised learning was not a weakness.

10 brendan oconnor ai-2012-02-14-Save Zipf’s Law (new anti-credulous-power-law article)

Introduction: To the delight of those of us enjoying the ride on the anti-power-law bandwagon (bandwagons are ok if it’s a backlash to another bandwagon), Cosma links to a new article in Science, “Critical Truths About Power Laws,” by Stumpf and Porter . Since it’s behind a paywall you might as well go read the Clauset/Shalizi/Newman paper on the topic, and since you won’t be bothered to read the paper, see the blogpost entitled “So You Think You Have a Power Law — Well Isn’t That Special?” Anyway, the Science article is nice — it amusingly refers to certain statistical tests as “epically fail[ing]” — and it’s on the side of truth and goodness so it should be supported, BUT, it has one horrendous figure. I just love that, in this of all articles that should be harping on deeply flawed uses of (log-log) plots, they use one of those MBA-style bozo plots with unlabeled axes, one of which is viciously, unapologetically subjective: If there is one power law I may single out for mercy in

11 brendan oconnor ai-2012-02-02-Histograms — matplotlib vs. R

Introduction: When possible, I like to use R for its really, really good statistical visualization capabilities. I’m doing a modeling project in Python right now (R is too slow, bad at large data, bad at structured data, etc.), and in comparison to base R, the matplotlib library is just painful. I wrote a toy Metropolis sampler for a triangle distribution and all I want to see is whether it looks like it’s working. For the same dataset, here are histograms with default settings. (Python: pylab.hist(d) , R: hist(d) ) I want to know whether my Metropolis sampler is working; those two plots give a very different idea. Of course, you could say this is an unfair comparison, since matplotlib is only using 10 bins, while R is using 18 here — and it’s always important to vary the bin size a few times when looking at histograms. But R’s defaults really are better: it actually uses an adaptive bin size, and the heuristic worked, choosing a reasonable number for the data. The hist() manu