brendan_oconnor_ai brendan_oconnor_ai-2009 knowledge-graph by maker-knowledge-mining

brendan_oconnor_ai 2009 knowledge graph


similar blogs computed by tfidf model


similar blogs computed by lsi model


similar blogs computed by lda model


blogs list:

1 brendan oconnor ai-2009-12-31-List of probabilistic model mini-language toolkits

Introduction: There are an increasing number of systems that attempt to allow the user to specify a probabilistic model in a high-level language — for example, declare a (Bayesian) generative model as a hierarchy of various distributions — then automatically run training and inference algorithms on a data set. Now, you could always learn a good math library, and implement every model from scratch, but the motivation for this approach is you’ll avoid doing lots of repetitive and error-prone programming. I’m not yet convinced that any of them completely achieve this goal, but it would be great if they succeeded and we could use high-level frameworks for everything. Everyone seems to know about only a few of them, so here’s a meager attempt to list together a bunch that can be freely downloaded. There is one package that is far more mature and been around much longer than the rest, so let’s start with: BUGS – Bayesian Inference under Gibbs Sampling. Specify a generative model, then it doe

2 brendan oconnor ai-2009-09-26-Seeing how “art” and “pharmaceuticals” are linguistically similar in web text

Introduction: Earlier this week I asked the question, How are “art” and “pharmaceuticals” similar? People sent me lots of submissions! Some are great, some are a bit of a stretch. Overpriced by an order of magnitude. The letters of “art” are found embedded, in order, in “pharmaceuticals”. Search keywords that cost the most to advertise on? “Wyeth”: I think this means this , and this . “Romeo and Juliet” famously includes both “art” (wherefore art thou) and pharmaceuticals (poison!) Some art has been created out of pharmaceuticals. Some art has been created under the influence of pharmaceuticals. I was asking because I was playing around with a dataset of 100,000 noun phrases’ appearances on the web, from the Reading the Web project at CMU. That is, for a noun like “art”, this data has a large list of phrases in which the word “art” is used, across some 200 million web pages. For two noun concepts, we can see what they have in common and what’s different by looking at

3 brendan oconnor ai-2009-09-20-Quiz: “art” and “pharmaceuticals”

Introduction: A lexical semantics question: How are “art” and “pharmaceuticals” similar? I have a data-driven answer, but am curious how easy it is to guess it, and in what sense it’s valid. I’ll post my answer and supporting evidence on Tuesday.

4 brendan oconnor ai-2009-09-10-Don’t MAWK AWK – the fastest and most elegant big data munging language!

Introduction: update 2012-10-25 : I’ve been informed there is a new maintainer for Mawk, who has probably fixed the bugs I’ve been seeing. From: Gert Hulselmans [The bugs you have found are] indeed true with mawk v1.3.3 which comes standard with Debian/Ubuntu. This version is almost not developed the last 10 years. I now already use mawk v1.3.4 maintained by another developer (Thomas E. Dickey) for more than a year on huge datafiles (sometimes several GB). The problems/wrong results I had with mawk v1.3.3 sometimes are gone. In his version, normally all open/known bugs are fixed. This version can be downloaded from: http://invisible-island.net/mawk/ update 2010-04-30 : I have since found large datasets where mawk is buggy and gives the wrong result. nawk seems safe. When one of these newfangled “Big Data” sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you’re dealing with hun

5 brendan oconnor ai-2009-09-08-Patches to Rainbow, the old text classifier that won’t go away

Introduction: I’ve been reading several somewhat recent finance papers ( Antweiler and Frank 2005 , Das and Chen 2007 ) that use Rainbow , the text classification software originally written by Andrew McCallum back in 1996. The last version is from 2002 and the homepage announces he isn’t really supporting it any more. However, as far as I can tell, it might still be the easiest-to-use text classifier package out there. You don’t have to program — just invoke commandline arguments — and it can accommodate reasonably sized datasets, does tokenization, stopword filtering, etc. for you, and has some useful feature selection and other options. Based on my limited usage, it seems well-implemented. If anyone knows of a better one I’d love to hear it. I once looked at, among other things, GATE and UIMA , and they seemed too hard to use if you wanted to download something that did simple text classification; or else, maybe they didn’t have documentation on how to use them in that manner. R

6 brendan oconnor ai-2009-09-08-Another R flashmob today

Introduction: Dan Goldstein sends word they’re doing another Stackoverflow R flashmob today . It’s a neat trick. The R tag there is becoming pretty useful.

7 brendan oconnor ai-2009-08-12-Beautiful Data book chapter

Introduction: Today I received my copy of Beautiful Data , a just-released anthology of articles about, well, working with data.   Lukas and I contributed a chapter on analyzing social perceptions in web data.   See it here. After a long process of drafting, proofreading, re-drafting, and bothering the publishers under rather sudden deadlines, I’ve resolved to never use graphics again in anything I write :) Here’s our final figure, a k-means clustering of face photos via perceived social attributes (social concepts/types ? with exemplars ?): I just started reading the rest of the book and it’s very fun.   Peter Norvig ‘s chapter on language models is gripping.  (It does word segmentation, ciphers, and more, in that lovely python-centric tutorial style extending his previous spell correction article .)  There are also chapters by many other great researchers and practitioners (some of whom you may have seen around this blog or its neighborhood) like Andrew Gelman , Hadley Wickham ,

8 brendan oconnor ai-2009-08-08-Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features

Introduction: I haven’t done a paper review on this blog for a while, so here we go. Coreference resolution is an interesting NLP problem.  ( Examples. )  It involves honest-to-goodness syntactic, semantic, and discourse phenomena, but still seems like a real cognitive task that humans have to solve when reading text [1].  I haven’t read the whole literature, but I’ve always been puzzled by the crop of papers on it I’ve seen in the last year or two.  There’s a big focus on fancy graph/probabilistic/constrained optimization algorithms, but often these papers gloss over the linguistic features — the core information they actually make their decisions with [2].  I never understood why the latter isn’t the most important issue.  Therefore, it was a joy to read Aria Haghighi and Dan Klein, EMNLP-2009.   “Simple Coreference Resolution with Rich Syntactic and Semantic Features.” They describe a simple, essentially non-statistical system that outperforms previous unsupervised systems, and compa

9 brendan oconnor ai-2009-08-04-Blogger to WordPress migration helper

Introduction: A while ago I moved my blog from Blogger (socialscienceplusplus.blogspot.com) to a custom WordPress installation here (anyall.org/blog).  Wordpress has a nice Blogger import feature, but I also wanted all the old URL’s to redirect to their new equivalents.  This is tricky because Blogger doesn’t give you much control over their system.  I only found pretty hacky solutions online, so I wrote a new one that’s slightly better, and posted it here if anyone’s interested:  gist.github.com/15594

10 brendan oconnor ai-2009-07-23-R questions on StackOverflow

Introduction: R is notoriously hard to learn, but there was just an effort  [1] [2] to populate the programming question-and-answer website StackOverflow with content for the R language . Amusingly, one of the most useful intro questions is: How to search for “R” materials? Mike Driscoll (who organized an in-person conference event to get this bootstrapped) pointed out that in many ways StackOverflow is a nicer forum for help than a mailing list.  (i.e. the impressive but hard-to-approach  R-help .)  It’s more organized, easier to browse, and repetition and wrong answers can get downvoted.  (And more thoughts from John Cook .)

11 brendan oconnor ai-2009-07-22-FFT: Friedman + Fortran + Tricks

Introduction: …is a tongue-in-cheek phrase from Trevor Hastie’s very fun to read useR-2009 presentation , from the merry trio of Hastie, Friedman, and Tibshirani, who brought us, among other things, the excellent Elements of Statistical Learning textbook .  It’s a joy to read sophisticated but well-presented work like this. This comes from a slide explaining the impressive speed results for their glmnet regression package.  Substantively, I’m interested in their observation that coordinate descent works well for sparse data — if you’re optimizing one feature at a time, and that feature is used in only a small percentage of instances, there are some neat optimizations! But mostly, I had a fun time skimming the glmnet code .  It’s written in 2008, but, yes,  the core algorithm is written entirely in Fortran , complete with punchcard-style, fixed-width formatting!  (This seems gratuitous to me — I thought the modern Fortran-90 had done away with such things?)  I’ve felt clever enough making

12 brendan oconnor ai-2009-07-15-Beta conjugate explorer

Introduction: Here’s a little interactive explorer for the beta probability distribution , a conjugate prior for the Bernoulli under Bayesian inference … Ack, too much jargon. Simply press the right arrow every time you see the sun rise, the up arrow when it doesn’t, and opposite directions for amnesia. I’ve wanted this for a while, an interface that lets you directly control a learning process / play with parameters, and see the effect on posterior beliefs, because I have a poor intuition for all these probability distributions. However, it was never worth actually making this until I tried out using Processing , an amazingly easy-to-use visualization development tool. This is my first Processing app and it was extremely easy to develop — easier than any other graphic/GUI framework I can think of. ( Source. ) If only java applets didn’t horribly lock up a browser when you open the page…

13 brendan oconnor ai-2009-06-26-Michael Jackson in Persepolis

Introduction: Michael Jackson just died while Iran is in turmoil. I am reminded of a passage in Marjane Satrapi’s wonderful graphic novel Persepolis , a memoir of growing up in revolutionary Iran in the 80′s. (Read the book to see how it ends.) I wonder how much coincidences of news event timing can influence perceptions. Clearly, large news stories can crowd out other ones. Are there any other effects of joint appearances? Celebrity deaths are fairly exogenous shocks — there might be a nice natural experiment somewhere here.

14 brendan oconnor ai-2009-06-14-Psychometrics quote

Introduction: It is rather surprising that systematic studies of human abilities were not undertaken until the second half of the last century… An accurate method was available for measuring the circumference of the earth 2,000 years before the first systematic measures of human ability were developed. –Jum Nunnally, Psychometric Theory (1967) (Social science textbooks from the 60′s and 70′s are rad.)

15 brendan oconnor ai-2009-06-04-June 4

Introduction: BBC News – June 4, 1989, Tiananmen Square Massacre Also worth reading: Nicholas Kristof’s riveting firsthand account .

16 brendan oconnor ai-2009-05-27-Where tweets get sent from

Introduction: Playing around with stream.twitter.com/spritzer , ggplot2 and maps / mapdata : I think I like the top better, without the map lines, like those night satellite photos : pointwise ghosts of high-end human economic development. This data is a fairly extreme sample of convenience: I’m only looking at tweets posted by certain types of iPhone clients, because they conveniently report exact gps-derived latitude/longitude numbers. ( search.twitter.com has geographic proximity operators — which are very cool! — but they seem to usually use zip codes or other user information that’s not available in the per-tweet API data.) So there’s only 30,000 messages out of 1.2 million spritzer tweets over ~3 days (itself only a small single-digit percentage sample of twitter).

17 brendan oconnor ai-2009-05-24-Zipf’s law and world city populations

Introduction: Will Fitzgerald just wrote about an excellent article by Steven Strogatz on Zipf’s Law for the populations of cities. If you look at the biggest city, then the next biggest city, etc., there tends to be an exponential fall-off in size. I was wondering what this looks like so here’s the classic zipfian plot (log-size vs. log-rank) for city population data from from populationdata.net : If you fit a power law — that is, a line on the above logsize-logrank plot — you can use rank to predict the sizes of smaller cities very accurately, according to Will’s analysis. Larger cities are more problematic, lying off the line. I was curious whether the power law holds within countries as well. The above plot was only for the countries that had more than 10 cities in the dataset — just eight countries. So here are those same cities again, but plotted against ranks within their respective countries. The answer is — usually, yes, the power law looks like it holds within

18 brendan oconnor ai-2009-05-18-Announcing TweetMotif for summarizing twitter topics

Introduction: Update (3/14/2010): There is now a TweetMotif paper . Last week, I, with my awesome friends David Ahn and Mike Krieger , finished hacking together an experimental prototype, TweetMotif , for exploratory search on Twitter. If you want to know what people are thinking about something, the normal search interface search.twitter.com gives really cool information, but it’s hard to wade through hundreds or thousands of results. We take tweets matching a query and group together similar messages, showing significant terms and phrases that co-occur with the user query. Try it out at tweetmotif.com . Here’s an example for a current hot topic, #WolframAlpha : It’s currently showing tweets that match both #WolframAlpha as well as two interesting bigrams: “queries failed” and “google killer”. TweetMotif doesn’t attempt to derive the meaning or sentiment toward the phrases — NLP is hard, and doing this much is hard enough! — but it’s easy for you to look at the tweet

19 brendan oconnor ai-2009-04-22-Performance comparison: key-value stores for language model counts

Introduction: I’m doing word and bigram counts on a corpus of tweets. I want to store and rapidly retrieve them later for language model purposes. So there’s a big table of counts that get incremented many times. The easiest way to get something running is to use an open-source key/value store; but which? There’s recently been some development in this area so I thought it would be good to revisit and evaluate some options. Here are timings for a single counting process: iterate over 45,000 short text messages, tokenize them, then increment counters for their unigrams and bigrams. (The speed of the data store is only one component of performance.) There are about 17 increments per tweet: 400k unique terms and 750k total count. This is substantially smaller than what I need, but it’s small enough to easily test. I used several very different architectures and packages, explained below. architecture name speed in-memory, within-process python dictionary 2700 tweets/sec

20 brendan oconnor ai-2009-04-17-1 billion web page dataset from CMU

Introduction: This is fun — Jamie Callan ‘s group at CMU LTI just finished a crawl of 1 billion web pages. It’s 5 terabytes compressed — big enough so they have to send it to you by mailing hard drives. Link: ClueWeb09 One of their motivations was to have a corpus large enough such that research results on it would be taken seriously by search engine companies. To my mind, this begs the question whether academics should try to innovate in web search, when it’s a research area incredibly dependent on really large, expensive-to-acquire datasets. And what’s the point? To slightly improve Google someday? Don’t they do that pretty well themselves? On the other hand, having a billion web pages around sounds like a lot of fun. Someone should get Amazon to add this to the AWS Public Datasets . Then, to process the data, instead of paying to get 5 TB of data shipped to you, you instead pay Amazon to rent virtual computers that can access the data. This costs less only to a certain point,

21 brendan oconnor ai-2009-04-15-Pirates killed by President

22 brendan oconnor ai-2009-04-01-Binary classification evaluation in R via ROCR

23 brendan oconnor ai-2009-02-23-Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

24 brendan oconnor ai-2009-01-30-“Logic Bomb”

25 brendan oconnor ai-2009-01-23-SF conference for data mining mercenaries

26 brendan oconnor ai-2009-01-07-Love it and hate it, R has come of age