brendan_oconnor_ai brendan_oconnor_ai-2013 brendan_oconnor_ai-2013-199 knowledge-graph by maker-knowledge-mining

199 brendan oconnor ai-2013-08-31-Probabilistic interpretation of the B3 coreference resolution metric

meta infos for this blog

Source: html

Introduction: Here is an intuitive justification for the B3 evaluation metric often used in coreference resolution, based on whether mention pairs are coreferent. If a mention from the document is chosen at random, B3-Recall is the (expected) proportion of its actual coreferents that the system thinks are coreferent with it. B3-Precision is the (expected) proportion of its system-hypothesized coreferents that are actually coreferent with it. Does this look correct to people? Details below: In B3′s basic form, it’s a clustering evaluation metric, to evaluate a gold-standard clustering of mentions against a system-produced clustering of mentions. Let $G$ mean a gold-standard entity and $S$ mean a system-predicted entity, where an entity is a set of mentions. $i$Â refers to a mention; there are $n$ mentions in the document. $G_i$ means the gold entity that contains mention $i$; and $S_i$ means the system entity that has $i$. The B3 precision and recall for a document

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Here is an intuitive justification for the B3 evaluation metric often used in coreference resolution, based on whether mention pairs are coreferent. [sent-1, score-0.922]

2 If a mention from the document is chosen at random, B3-Recall is the (expected) proportion of its actual coreferents that the system thinks are coreferent with it. [sent-2, score-1.223]

3 B3-Precision is the (expected) proportion of its system-hypothesized coreferents that are actually coreferent with it. [sent-3, score-0.591]

4 Details below: In B3′s basic form, it’s a clustering evaluation metric, to evaluate a gold-standard clustering of mentions against a system-produced clustering of mentions. [sent-5, score-0.699]

5 Let $G$ mean a gold-standard entity and $S$ mean a system-predicted entity, where an entity is a set of mentions. [sent-6, score-0.436]

6 $G_i$ means the gold entity that contains mention $i$; and $S_i$ means the system entity that has $i$. [sent-8, score-1.289]

7 Think about it like, \begin{align} B3Prec &= E_{ment}\left[ \frac{ |G_i \cap S_i| }{ |S_i| } \right] \\ &= E_{ment}\left[ P(G_j = G_i \mid j \in S_i) \right] \end{align} The first step is the expectation under the distribution of “pick a mention $i$ at random from the document”. [sent-10, score-0.551]

8 The second step is from restating $|G_i \cap S_i|$ as: out of the system-hypothesized coreferents of $i$, how many are in the same gold cluster as $i$? [sent-11, score-0.699]

9 Thus $|G_i \cap S_i|/|S_i|$ is: if you choose a mention $j$ randomly out of $S_i$, how often does it have the same gold cluster as $i$? [sent-12, score-0.888]

10 This is why I like B3: I can explain it in terms of mention pairs. [sent-17, score-0.373]

11 I think this also gives an additional justification to Cai and Strube (2010) ‘s proposal to handle divergent gold versus system mentions. [sent-18, score-0.707]

12 So say the system produces a spurious mention $i$ that isn’t part of the gold standard’s mentions (a “twinless” mention). [sent-19, score-1.04]

13 If you assume that mentions not in the gold standard should be considered to have no coreferents, then all of $i$’s system-hypothesized coreferents are false positives. [sent-20, score-0.775]

14 Therefore, to think about precision under this assumption, the system’s non-gold-mentions should be added to the gold as singleton entities, before computing precision. [sent-21, score-0.449]

15 And analogously for recall (add gold-only mentions as system-side singletons: the system has failed to find any coreference links to them). [sent-22, score-0.529]

16 I also like the pairwise linking metric since it’s defined only in terms of mentions; to be analogous to the presentation of B3 here, Pairwise-Prec: choose a pair of mentions the system thinks are coreferent. [sent-26, score-1.067]

17 Pairwise-Rec: choose a pair of coreferent mentions. [sent-28, score-0.396]

18 Or algorithmically: take all entities to be fully connected mention graphs and compute link recovery precision/recall. [sent-30, score-0.399]

19 ) Â It is apparent though, that the Cai and Strube method can be adapted to pairwise metrics, maybe including BLANC , under the same justification given here for why it should apply to B3. [sent-33, score-0.372]

20 (As far as I know B3 hasn’t been proposed before as a pure clustering metric … you could actually think of it in comparison to Rand index , VI , etc. [sent-34, score-0.43]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('mention', 0.319), ('gold', 0.279), ('cap', 0.251), ('coreferents', 0.251), ('mentions', 0.245), ('coreferent', 0.218), ('entity', 0.218), ('justification', 0.175), ('frac', 0.153), ('system', 0.153), ('ment', 0.151), ('vi', 0.151), ('metric', 0.149), ('pairwise', 0.149), ('document', 0.149), ('align', 0.14), ('clustering', 0.133), ('choose', 0.111), ('cluster', 0.105), ('cai', 0.1), ('mid', 0.1), ('strube', 0.1), ('think', 0.1), ('random', 0.088), ('entities', 0.08), ('defined', 0.08), ('intuitive', 0.08), ('expectation', 0.08), ('entropies', 0.08), ('proportion', 0.074), ('often', 0.074), ('begin', 0.07), ('precision', 0.07), ('coreference', 0.07), ('pair', 0.067), ('step', 0.064), ('expected', 0.064), ('recall', 0.061), ('thinks', 0.059), ('though', 0.059), ('evaluation', 0.055), ('correct', 0.054), ('left', 0.054), ('terms', 0.054), ('means', 0.051), ('end', 0.049), ('given', 0.048), ('actually', 0.048), ('part', 0.044), ('consideration', 0.044)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999964 199 brendan oconnor ai-2013-08-31-Probabilistic interpretation of the B3 coreference resolution metric

2 0.16725591 150 brendan oconnor ai-2009-08-08-Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features

Introduction: I haven’t done a paper review on this blog for a while, so here we go. Coreference resolution is an interesting NLP problem. ( Examples. ) It involves honest-to-goodness syntactic, semantic, and discourse phenomena, but still seems like a real cognitive task that humans have to solve when reading text [1]. I haven’t read the whole literature, but I’ve always been puzzled by the crop of papers on it I’ve seen in the last year or two. There’s a big focus on fancy graph/probabilistic/constrained optimization algorithms, but often these papers gloss over the linguistic features — the core information they actually make their decisions with [2]. I never understood why the latter isn’t the most important issue. Therefore, it was a joy to read Aria Haghighi and Dan Klein, EMNLP-2009. “Simple Coreference Resolution with Rich Syntactic and Semantic Features.” They describe a simple, essentially non-statistical system that outperforms previous unsupervised systems, and compa

3 0.095950752 183 brendan oconnor ai-2012-04-11-F-scores, Dice, and Jaccard set similarity

Introduction: The Dice similarity is the same as F1-score ; and they are monotonic in Jaccard similarity . I worked this out recently but couldn’t find anything about it online so here’s a writeup. Let $A$ be the set of found items, and $B$ the set of wanted items. $Prec=|AB|/|A|$, $Rec=|AB|/|B|$. Their harmonic mean, the $F1$-measure, is the same as the Dice coefficient: \begin{align*} F1(A,B) &= \frac{2}{1/P+ 1/R} = \frac{2}{|A|/|AB| + |B|/|AB|} \\ Dice(A,B) &= \frac{2|AB|}{ |A| + |B| } \\ &= \frac{2 |AB|}{ (|AB| + |A \setminus B|) + (|AB| + |B \setminus A|)} \\ &= \frac{|AB|}{|AB| + \frac{1}{2}|A \setminus B| + \frac{1}{2} |B \setminus A|} \end{align*} It’s nice to characterize the set comparison into the three mutually exclusive partitions $AB$, $A \setminus B$, and $B \setminus A$. This illustrates Dice’s close relationship to the Jaccard metric, \begin{align*} Jacc(A,B) &= \frac{|AB|}{|A \cup B|} \\ &= \frac{|AB|}{|AB| + |A \setminus B| + |B \setminus

4 0.092214592 108 brendan oconnor ai-2008-07-01-Bias correction sneak peek!

Introduction: (Update 10/2008: actually this model doesn’t work in all cases.Â In the final paper we use an (even) simpler model.) I really don’t have time to write up an explanation for what this is so I’ll just post the graph instead. Each box is a scatterplot of an AMT worker’s responses versus a gold standard. Drawn are attempts to fit linear models to each worker. The idea is to correct for the biases of each worker. With a linear model y ~ ax+b, the correction is correction(y) = (y-b)/a. Arrows show such corrections. Hilariously bad “corrections” happen. *But*, there is also weighting: to get the “correct” answer (maximum likelihood) from several workers, you weight by a^2/stddev^2. Despite the sometimes odd corrections, the cross-validated results from this model correlate better with the gold than the raw averaging of workers. (Raw averaging is the maximum likelihood solution for a fixed noise model: a=1, b=0, and each worker’s variance is equal). Much better explanation is c

5 0.089759588 131 brendan oconnor ai-2008-12-27-Facebook sentiment mining predicts presidential polls

Introduction: I’m a bit late blogging this, but here’s a messy, exciting — and statistically validated! — new online data source. My friend Roddy at Facebook wrote a post describing their sentiment analysis system , which can evaluate positive or negative sentiment toward a particular topic by looking at a large number of wall messages. (I’d link to it, but I can’t find the URL anymore — here’s the Lexicon , but that version only gets term frequencies but no sentiment.) How they constructed their sentiment detector is interesting. Starting with a list of positive and negative terms, they had a lexical acquisition step to gather many more candidate synonyms and misspellings — a necessity in this social media domain, where WordNet ain’t gonna come close! After manually filtering these candidates, they assess the sentiment toward a mention of a topic by looking for instances of these positive and negative words nearby, along with “negation heuristics” and a few other features. He describ

6 0.088724308 194 brendan oconnor ai-2013-04-16-Rise and fall of Dirichlet process clusters

7 0.079577476 106 brendan oconnor ai-2008-06-17-Pairwise comparisons for relevance evaluation

8 0.07329125 178 brendan oconnor ai-2011-11-13-Bayes update view of pointwise mutual information

9 0.070624843 198 brendan oconnor ai-2013-08-20-Some analysis of tweet shares and “predicting” election outcomes

10 0.070122041 175 brendan oconnor ai-2011-09-25-Information theory stuff

11 0.064136922 182 brendan oconnor ai-2012-03-13-Cosine similarity, Pearson correlation, and OLS coefficients

12 0.059231408 176 brendan oconnor ai-2011-10-05-Be careful with dictionary-based text analysis

13 0.05667121 185 brendan oconnor ai-2012-07-17-p-values, CDF’s, NLP etc.

14 0.055986729 136 brendan oconnor ai-2009-04-01-Binary classification evaluation in R via ROCR

15 0.055917952 203 brendan oconnor ai-2014-02-19-What the ACL-2014 review scores mean

16 0.051384233 47 brendan oconnor ai-2007-01-02-The Jungle Economy

17 0.049815532 53 brendan oconnor ai-2007-03-15-Feminists, anarchists, computational complexity, bounded rationality, nethack, and other things to do

18 0.048892442 171 brendan oconnor ai-2011-06-14-How much text versus metadata is in a tweet?

19 0.048869088 74 brendan oconnor ai-2007-08-08-When’s the last time you dug through 19th century English mortuary records

20 0.048082843 129 brendan oconnor ai-2008-12-03-Statistics vs. Machine Learning, fight!

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, -0.181), (1, -0.089), (2, 0.09), (3, -0.049), (4, -0.024), (5, 0.002), (6, -0.007), (7, -0.086), (8, -0.132), (9, -0.024), (10, -0.048), (11, -0.036), (12, -0.009), (13, 0.176), (14, -0.046), (15, -0.057), (16, -0.101), (17, -0.091), (18, 0.093), (19, -0.088), (20, -0.077), (21, -0.069), (22, 0.083), (23, -0.087), (24, -0.003), (25, -0.011), (26, 0.164), (27, -0.129), (28, -0.017), (29, -0.006), (30, -0.065), (31, -0.087), (32, -0.03), (33, -0.116), (34, -0.073), (35, 0.094), (36, -0.094), (37, -0.017), (38, 0.053), (39, -0.066), (40, 0.037), (41, -0.184), (42, -0.003), (43, 0.063), (44, -0.094), (45, 0.094), (46, -0.044), (47, -0.087), (48, -0.041), (49, 0.006)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98752421 199 brendan oconnor ai-2013-08-31-Probabilistic interpretation of the B3 coreference resolution metric

2 0.63304383 183 brendan oconnor ai-2012-04-11-F-scores, Dice, and Jaccard set similarity

3 0.5784992 178 brendan oconnor ai-2011-11-13-Bayes update view of pointwise mutual information

Introduction: This is fun. Pointwise Mutual Information (e.g. Church and Hanks 1990 ) between two variable outcomes $x$ and $y$ is \[ PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)} \] It’s called “pointwise” because Mutual Information , between two (discrete) variables X and Y, is the expectation of PMI over possible outcomes of X and Y: $ MI(X,Y) = \sum_{x,y} p(x,y) PMI(x,y) $. One interpretation of PMI is it’s measuring how much deviation from independence there is — since $p(x,y)=p(x)p(y)$ if X and Y were independent, so the ratio is how non-independent they (the outcomes) are. You can get another interpretation of this quantity if you switch into conditional probabilities. Looking just at the ratio, apply the definition of conditional probability: \[ \frac{p(x,y)}{p(x)p(y)} = \frac{p(x|y)}{p(x)} \] Think about doing a Bayes update for your belief about $x$. Start with the prior $p(x)$, then learn $y$ and you update to the posterior belief $p(x|y)$. How much your belief

4 0.57614291 106 brendan oconnor ai-2008-06-17-Pairwise comparisons for relevance evaluation

Introduction: Not much on this blog lately, so I’ll repost a comment I just wrote on whether to use pairwise vs. absolute judgments for relevance quality evaluation. (A fun one I know!) From this post on the Dolores Labs blog . The paper being talked about is Here or There: Preference Judgments for Relevance by Carterette et al. I skimmed through the Carterette paper and it’s interesting. My concern with pairwise setup is, in order to get comparability among query-result pairs, you need to get annotators to do an O(N^2) amount of work. (Unless you do something horribly complicated with partial orders.) The absolute judgment task scales linearly, of course. Given the AMT environment and a fixed budget, if I stay in the smaller-volume task, instead of spending a lot on a quadratic taskload, I can simply get a higher number of workers per result and boil out more noise. Of course, if it’s true the pairwise judgment task is easier — as the paper claims — that might make my spending more effic

5 0.54480374 150 brendan oconnor ai-2009-08-08-Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features

6 0.52119297 195 brendan oconnor ai-2013-04-21-What inputs do Monte Carlo algorithms need?

7 0.47221982 131 brendan oconnor ai-2008-12-27-Facebook sentiment mining predicts presidential polls

8 0.44967097 176 brendan oconnor ai-2011-10-05-Be careful with dictionary-based text analysis

9 0.44495431 194 brendan oconnor ai-2013-04-16-Rise and fall of Dirichlet process clusters

10 0.37930471 185 brendan oconnor ai-2012-07-17-p-values, CDF’s, NLP etc.

11 0.35311112 182 brendan oconnor ai-2012-03-13-Cosine similarity, Pearson correlation, and OLS coefficients

12 0.34388348 198 brendan oconnor ai-2013-08-20-Some analysis of tweet shares and “predicting” election outcomes

13 0.33570164 108 brendan oconnor ai-2008-07-01-Bias correction sneak peek!

14 0.32080635 136 brendan oconnor ai-2009-04-01-Binary classification evaluation in R via ROCR

15 0.28980845 111 brendan oconnor ai-2008-08-16-A better Obama vs McCain poll aggregation

16 0.28719488 74 brendan oconnor ai-2007-08-08-When’s the last time you dug through 19th century English mortuary records

17 0.28212354 174 brendan oconnor ai-2011-09-19-End-to-end NLP packages

18 0.27342841 175 brendan oconnor ai-2011-09-25-Information theory stuff

19 0.27045673 68 brendan oconnor ai-2007-07-08-Game outcome graphs — prisoner’s dilemma with FUN ARROWS!!!

20 0.26495031 203 brendan oconnor ai-2014-02-19-What the ACL-2014 review scores mean

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(44, 0.099), (48, 0.039), (57, 0.021), (61, 0.433), (62, 0.011), (70, 0.026), (74, 0.126), (80, 0.064), (83, 0.01), (86, 0.025), (89, 0.037)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.95257664 172 brendan oconnor ai-2011-06-26-Good linguistic semantics textbook?

Introduction: I’m looking for recommendations for a good textbook/handbook/reference on (non-formal) linguistic semantics. Â My undergrad semantics course was almost entirely focused on logical/formal semantics, which is fine, but I don’t feel familiar with the breadth of substantive issues — for example, I’d be hard-pressed to explain why something like semantic/thematic role labeling should be useful for anything at all. I somewhat randomly stumbled upon Frawley 1992 ( review ) in a used bookstore and it seemed pretty good — in particular, it cleanly separates itself from the philosophical study of semantics, and thus identifies issues that seem amenable to computational modeling. I’m wondering what else is out there? Â Here’s a comparison of three textbooks .

same-blog 2 0.93931729 199 brendan oconnor ai-2013-08-31-Probabilistic interpretation of the B3 coreference resolution metric

3 0.90596694 68 brendan oconnor ai-2007-07-08-Game outcome graphs — prisoner’s dilemma with FUN ARROWS!!!

Introduction: I think game theory could benefit immensely from better presentation. Its default presentation is pretty mathematical. This is good because it treats social interactions in an abstract way, highlighting their essential properties, but is bad because it’s hard to understand, especially at first. However, I think I have a visualization that can sometimes capture the same abstract properties of the mathematics. Here’s a stab at using it to explain everyone’s favorite game, the prisoner’s dilemma. THE PD: Two players each choose whether to play nice, or be mean — Cooperate or Defect. Then they simultaneously play their actions, and get payoffs depending on what both played. If both cooperated, they help each other and do well; if both defect, they do quite poorly. But if one tries to cooperate and the other defects, then the defector gets a big win, and the cooperator gets a crappy “sucker’s payoff”. The formal PD definition looks like this: where each of the four pairs

4 0.36362574 129 brendan oconnor ai-2008-12-03-Statistics vs. Machine Learning, fight!

Introduction: 10/1/09 update — well, it’s been nearly a year, and I should say not everything in this rant is totally true, and I certainly believe much less of it now. Current take: Statistics , not machine learning, is the real deal, but unfortunately suffers from bad marketing. On the other hand, to the extent that bad marketing includes misguided undergraduate curriculums, there’s plenty of room to improve for everyone. So it’s pretty clear by now that statistics and machine learning aren’t very different fields. I was recently pointed to a very amusing comparison by the excellent statistician — and machine learning expert — Robert Tibshiriani . Reproduced here: Glossary Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classiﬁcation unsupervised learning density estimation, clustering large grant = $1,000,000

5 0.36175472 150 brendan oconnor ai-2009-08-08-Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features

6 0.33730191 138 brendan oconnor ai-2009-04-17-1 billion web page dataset from CMU

7 0.32762194 53 brendan oconnor ai-2007-03-15-Feminists, anarchists, computational complexity, bounded rationality, nethack, and other things to do

8 0.32563987 86 brendan oconnor ai-2007-12-20-Data-driven charity

9 0.32350343 140 brendan oconnor ai-2009-05-18-Announcing TweetMotif for summarizing twitter topics

10 0.31923202 200 brendan oconnor ai-2013-09-13-Response on our movie personas paper

11 0.3190164 174 brendan oconnor ai-2011-09-19-End-to-end NLP packages

12 0.31420338 188 brendan oconnor ai-2012-10-02-Powerset’s natural language search system

13 0.31409714 123 brendan oconnor ai-2008-11-12-Disease tracking with web queries and social messaging (Google, Twitter, Facebook…)

14 0.31355521 203 brendan oconnor ai-2014-02-19-What the ACL-2014 review scores mean

15 0.30791217 184 brendan oconnor ai-2012-07-04-The $60,000 cat: deep belief networks make less sense for language than vision

16 0.30083823 153 brendan oconnor ai-2009-09-08-Patches to Rainbow, the old text classifier that won’t go away

17 0.299833 198 brendan oconnor ai-2013-08-20-Some analysis of tweet shares and “predicting” election outcomes

18 0.29839113 2 brendan oconnor ai-2004-11-24-addiction & 2 problems of economics

19 0.29714146 55 brendan oconnor ai-2007-03-27-Seth Roberts and academic blogging

20 0.29681641 63 brendan oconnor ai-2007-06-10-Freak-Freakonomics (Ariel Rubinstein is the shit!)