brendan_oconnor_ai brendan_oconnor_ai-2013 brendan_oconnor_ai-2013-196 knowledge-graph by maker-knowledge-mining

196 brendan oconnor ai-2013-05-08-Movie summary corpus and learning character personas


meta infos for this blog

Source: html

Introduction: Here is one of our exciting just-finished ACL papers.   David  and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies. To do this we collected a  brand new dataset : 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre.  We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them.  Did you see that NYT article on quantitative analysis of film scripts ?  This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money?  We have released the data here . Our focus, though, is on narrative analysis.  We investigate  character personas : familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trick


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 David  and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies. [sent-2, score-0.674]

2 To do this we collected a  brand new dataset : 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre. [sent-3, score-1.044]

3 We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them. [sent-4, score-0.574]

4 Did you see that NYT article on quantitative analysis of film scripts ? [sent-5, score-0.297]

5 This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money? [sent-6, score-0.527]

6 They are defined in part by what they do and who they are — which we can glean from their actions and descriptions in plot summaries. [sent-10, score-0.38]

7 Our model clusters movie characters, learning posteriors like this:   Each box is one automatically learned persona cluster, along with actions and attribute words that pertain to it. [sent-11, score-1.024]

8 For example, characters like Dracula and The Joker are always “hatching” things (hatching plans, presumably). [sent-12, score-0.578]

9 One of our models takes the metadata features, like movie genre and gender and age of an actor, and associates them with different personas. [sent-13, score-0.53]

10 For example, we learn the types of characters in romantic comedies versus action movies. [sent-14, score-0.554]

11 Here are a few examples of my favorite learned personas: One of the best things I learned about during this project was the website TVTropes (which we use to compare our model against). [sent-15, score-0.515]

12 We’ll be at ACL this summer to present the paper. [sent-16, score-0.15]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('characters', 0.411), ('personas', 0.246), ('movie', 0.21), ('character', 0.21), ('dataset', 0.207), ('acl', 0.198), ('hatching', 0.189), ('movies', 0.164), ('metadata', 0.164), ('learned', 0.143), ('types', 0.143), ('box', 0.14), ('film', 0.14), ('david', 0.12), ('actions', 0.115), ('plot', 0.115), ('along', 0.093), ('like', 0.086), ('actors', 0.082), ('killed', 0.082), ('attribute', 0.082), ('mythical', 0.082), ('connor', 0.082), ('scripts', 0.082), ('august', 0.082), ('pertain', 0.082), ('freebase', 0.082), ('bamman', 0.082), ('noah', 0.082), ('things', 0.081), ('present', 0.075), ('descriptions', 0.075), ('interest', 0.075), ('collected', 0.075), ('etc', 0.075), ('scenes', 0.075), ('defined', 0.075), ('quantitative', 0.075), ('records', 0.075), ('narrative', 0.075), ('wise', 0.075), ('summer', 0.075), ('brendan', 0.075), ('latent', 0.075), ('best', 0.075), ('model', 0.073), ('example', 0.073), ('age', 0.07), ('smith', 0.07), ('investigate', 0.07)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 196 brendan oconnor ai-2013-05-08-Movie summary corpus and learning character personas

Introduction: Here is one of our exciting just-finished ACL papers.   David  and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies. To do this we collected a  brand new dataset : 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre.  We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them.  Did you see that NYT article on quantitative analysis of film scripts ?  This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money?  We have released the data here . Our focus, though, is on narrative analysis.  We investigate  character personas : familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trick

2 0.21709162 200 brendan oconnor ai-2013-09-13-Response on our movie personas paper

Introduction: Update (2013-09-17): See David Bamman ‘s great guest post on Language Log on our latent personas paper, and the big picture of interdisciplinary collaboration. I’ve been informed that an interesting critique of my, David Bamman’s and Noah Smith’s ACL paper on movie personas has appeared on the Language Log, a guest post by Hannah Alpert-Abrams and Dan Garrette . I posted the following as a comment on LL. Thanks everyone for the interesting comments. Scholarship is an ongoing conversation, and we hope our work might contribute to it. Responding to the concerns about our paper , We did not try to make a contribution to contemporary literary theory. Rather, we focus on developing a computational linguistic research method of analyzing characters in stories. We hope there is a place for both the development of new research methods, as well as actual new substantive findings. If you think about the tremendous possibilities for computer science and humanities collabor

3 0.14682123 125 brendan oconnor ai-2008-11-21-Netflix Prize

Introduction: Here’s a fascinating NYT article on the Netflix Prize for a better movie recommendation system.  Tons of great stuff there; here’s a few highlights … First, a good unsupervised learning story: There’s a sort of unsettling, alien quality to their computers’ results. When the teams examine the ways that singular value decomposition is slotting movies into categories, sometimes it makes sense to them — as when the computer highlights what appears to be some essence of nerdiness in a bunch of sci-fi movies. But many categorizations are now so obscure that they cannot see the reasoning behind them. Possibly the algorithms are finding connections so deep and subconscious that customers themselves wouldn’t even recognize them. At one point, Chabbert showed me a list of movies that his algorithm had discovered share some ineffable similarity; it includes a historical movie, “Joan of Arc,” a wrestling video, “W.W.E.: SummerSlam 2004,” the comedy “It Had to Be You” and a version of Charle

4 0.084113568 187 brendan oconnor ai-2012-09-21-CMU ARK Twitter Part-of-Speech Tagger – v0.3 released

Introduction: We’re pleased to announce a new release of the CMU ARK Twitter Part-of-Speech Tagger, version 0.3. The new version is much faster (40x) and more accurate (89.2 -> 92.8) than before. We also have released new POS-annotated data, including a dataset of one tweet for each of 547 days. We have made available large-scale word clusters from unlabeled Twitter data (217k words, 56m tweets, 847m tokens). Tools, data, and a new technical report describing the release are available at: www.ark.cs.cmu.edu/TweetNLP . 0100100 a 1111100101110 111100000011 , Brendan

5 0.084052123 194 brendan oconnor ai-2013-04-16-Rise and fall of Dirichlet process clusters

Introduction: Here’s Gibbs sampling for a Dirichlet process 1-d mixture of Gaussians . On 1000 data points that look like this. I gave it fixed variance and a concentration and over MCMC iterations, and it looks like this. The top is the number of points in a cluster. The bottom are the cluster means. Every cluster has a unique color. During MCMC, clusters are created and destroyed. Every cluster has a unique color; when a cluster dies, its color is never reused. I’m showing clusters every 100 iterations. If there is a single point, that cluster was at that iteration but not before or after. If there is a line, the cluster lived for at least 100 iterations. Some clusters live long, some live short, but all eventually die. Usually the model likes to think there are about two clusters, occupying positions at the two modes in the data distribution. It also entertains the existence of several much more minor ones. Usually these are shortlived clusters that die away. But

6 0.074754469 203 brendan oconnor ai-2014-02-19-What the ACL-2014 review scores mean

7 0.069312133 170 brendan oconnor ai-2011-05-21-iPhone autocorrection error analysis

8 0.063620456 129 brendan oconnor ai-2008-12-03-Statistics vs. Machine Learning, fight!

9 0.062220573 184 brendan oconnor ai-2012-07-04-The $60,000 cat: deep belief networks make less sense for language than vision

10 0.05647672 185 brendan oconnor ai-2012-07-17-p-values, CDF’s, NLP etc.

11 0.056382403 86 brendan oconnor ai-2007-12-20-Data-driven charity

12 0.055792011 150 brendan oconnor ai-2009-08-08-Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features

13 0.05544759 44 brendan oconnor ai-2006-08-30-A big, fun list of links I’m reading

14 0.054578979 126 brendan oconnor ai-2008-11-21-The Wire: Mr. Nugget

15 0.054469749 92 brendan oconnor ai-2008-01-31-Food Fight

16 0.053511318 179 brendan oconnor ai-2012-02-02-Histograms — matplotlib vs. R

17 0.052571315 171 brendan oconnor ai-2011-06-14-How much text versus metadata is in a tweet?

18 0.052096769 165 brendan oconnor ai-2011-02-19-Move to brenocon.com

19 0.049658667 135 brendan oconnor ai-2009-02-23-Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

20 0.049483888 95 brendan oconnor ai-2008-03-18-color name study i did


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, -0.205), (1, -0.078), (2, 0.003), (3, 0.036), (4, 0.003), (5, 0.013), (6, -0.05), (7, 0.007), (8, -0.077), (9, 0.062), (10, -0.003), (11, 0.015), (12, 0.125), (13, 0.026), (14, 0.075), (15, -0.062), (16, 0.069), (17, -0.026), (18, -0.191), (19, 0.035), (20, 0.092), (21, -0.05), (22, 0.068), (23, 0.148), (24, 0.188), (25, -0.166), (26, -0.019), (27, -0.158), (28, 0.065), (29, -0.222), (30, -0.021), (31, -0.031), (32, 0.072), (33, 0.219), (34, 0.046), (35, 0.011), (36, 0.014), (37, 0.1), (38, 0.124), (39, 0.045), (40, 0.069), (41, -0.135), (42, -0.047), (43, 0.089), (44, 0.022), (45, 0.06), (46, 0.036), (47, 0.12), (48, 0.01), (49, -0.003)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98229879 196 brendan oconnor ai-2013-05-08-Movie summary corpus and learning character personas

Introduction: Here is one of our exciting just-finished ACL papers.   David  and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies. To do this we collected a  brand new dataset : 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre.  We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them.  Did you see that NYT article on quantitative analysis of film scripts ?  This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money?  We have released the data here . Our focus, though, is on narrative analysis.  We investigate  character personas : familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trick

2 0.66182065 200 brendan oconnor ai-2013-09-13-Response on our movie personas paper

Introduction: Update (2013-09-17): See David Bamman ‘s great guest post on Language Log on our latent personas paper, and the big picture of interdisciplinary collaboration. I’ve been informed that an interesting critique of my, David Bamman’s and Noah Smith’s ACL paper on movie personas has appeared on the Language Log, a guest post by Hannah Alpert-Abrams and Dan Garrette . I posted the following as a comment on LL. Thanks everyone for the interesting comments. Scholarship is an ongoing conversation, and we hope our work might contribute to it. Responding to the concerns about our paper , We did not try to make a contribution to contemporary literary theory. Rather, we focus on developing a computational linguistic research method of analyzing characters in stories. We hope there is a place for both the development of new research methods, as well as actual new substantive findings. If you think about the tremendous possibilities for computer science and humanities collabor

3 0.49944627 125 brendan oconnor ai-2008-11-21-Netflix Prize

Introduction: Here’s a fascinating NYT article on the Netflix Prize for a better movie recommendation system.  Tons of great stuff there; here’s a few highlights … First, a good unsupervised learning story: There’s a sort of unsettling, alien quality to their computers’ results. When the teams examine the ways that singular value decomposition is slotting movies into categories, sometimes it makes sense to them — as when the computer highlights what appears to be some essence of nerdiness in a bunch of sci-fi movies. But many categorizations are now so obscure that they cannot see the reasoning behind them. Possibly the algorithms are finding connections so deep and subconscious that customers themselves wouldn’t even recognize them. At one point, Chabbert showed me a list of movies that his algorithm had discovered share some ineffable similarity; it includes a historical movie, “Joan of Arc,” a wrestling video, “W.W.E.: SummerSlam 2004,” the comedy “It Had to Be You” and a version of Charle

4 0.47259596 84 brendan oconnor ai-2007-11-26-How did Freud become a respected humanist?!

Introduction: Freud Is Widely Taught at Universities, Except in the Psychology Department : PSYCHOANALYSIS and its ideas about the unconscious mind have spread to every nook and cranny of the culture from Salinger to “South Park,” from Fellini to foreign policy. Yet if you want to learn about psychoanalysis at the nation’s top universities, one of the last places to look may be the psychology department. A new report by the American Psychoanalytic Association has found that while psychoanalysis — or what purports to be psychoanalysis — is alive and well in literature, film, history and just about every other subject in the humanities, psychology departments and textbooks treat it as “desiccated and dead,” a historical artifact instead of “an ongoing movement and a living, evolving process.” I’ve been wondering about this for a while, ever since I heard someone describe Freud as “one of the greatest humanists who ever lived.” I’m pretty sure he didn’t think of himself that way. If you’re a

5 0.4441241 194 brendan oconnor ai-2013-04-16-Rise and fall of Dirichlet process clusters

Introduction: Here’s Gibbs sampling for a Dirichlet process 1-d mixture of Gaussians . On 1000 data points that look like this. I gave it fixed variance and a concentration and over MCMC iterations, and it looks like this. The top is the number of points in a cluster. The bottom are the cluster means. Every cluster has a unique color. During MCMC, clusters are created and destroyed. Every cluster has a unique color; when a cluster dies, its color is never reused. I’m showing clusters every 100 iterations. If there is a single point, that cluster was at that iteration but not before or after. If there is a line, the cluster lived for at least 100 iterations. Some clusters live long, some live short, but all eventually die. Usually the model likes to think there are about two clusters, occupying positions at the two modes in the data distribution. It also entertains the existence of several much more minor ones. Usually these are shortlived clusters that die away. But

6 0.42340642 203 brendan oconnor ai-2014-02-19-What the ACL-2014 review scores mean

7 0.34223002 128 brendan oconnor ai-2008-11-28-Calculating running variance in Python and C++

8 0.33216923 95 brendan oconnor ai-2008-03-18-color name study i did

9 0.32205665 48 brendan oconnor ai-2007-01-02-funny comic

10 0.32027063 139 brendan oconnor ai-2009-04-22-Performance comparison: key-value stores for language model counts

11 0.31580675 170 brendan oconnor ai-2011-05-21-iPhone autocorrection error analysis

12 0.31308675 101 brendan oconnor ai-2008-04-13-Are women discriminated against in graduate admissions? Simpson’s paradox via R in three easy steps!

13 0.31214288 92 brendan oconnor ai-2008-01-31-Food Fight

14 0.30079716 186 brendan oconnor ai-2012-08-21-Berkeley SDA and the General Social Survey

15 0.2962341 86 brendan oconnor ai-2007-12-20-Data-driven charity

16 0.29580522 192 brendan oconnor ai-2013-03-14-R scan() for quick-and-dirty checks

17 0.29048705 184 brendan oconnor ai-2012-07-04-The $60,000 cat: deep belief networks make less sense for language than vision

18 0.2826739 126 brendan oconnor ai-2008-11-21-The Wire: Mr. Nugget

19 0.27740473 11 brendan oconnor ai-2005-07-01-Modelling environmentalism thinking

20 0.26375219 68 brendan oconnor ai-2007-07-08-Game outcome graphs — prisoner’s dilemma with FUN ARROWS!!!


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(16, 0.014), (24, 0.024), (44, 0.109), (55, 0.028), (57, 0.035), (70, 0.644), (74, 0.042), (96, 0.01)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.98500156 5 brendan oconnor ai-2005-06-25-1st International Conference on Computational Models of Argument (COMMA06)

Introduction: does this look awesome or what? Imagine if you could computationally model the argumentation and communication in economic and political behavior. To say nothing of the AI applications too! PRELIMINARY ANNOUNCEMENT 1st International Conference on Computational Models of Argument (COMMA06) Organised by the ASPIC project (www.argumentation.org) The University of Liverpool, Liverpool, UK 11th-12th September 2006 (provisional) General Chair: Professor Michael J. Wooldridge Programme Chair: Paul E. Dunne Over the past decade argumentation has become increasingly important in Artificial Intelligence. It has provided a fruitful way of approaching non-monotonic and defeasible reasoning, deliberation about action, and agent communication scenarios such as negotiation. In application domains such as law, medicine and e-democracy it has come to be seen as an essential part of the reasoning. Successful workshops have been associated with major Artificial Intelligence Conferences, notabl

2 0.96831161 116 brendan oconnor ai-2008-10-08-MyDebates.org, online polling, and potentially the coolest question corpus ever

Introduction: MySpace and the Commission on the Presidential Debates put together a neat site, mydebates.org , which presents the candidates’ positions through various mini-polls and such. It even has a cool data exploration tool for the poll results … for example, here are two support maps, one for respondents over 65 and one for 18-24 year olds. Anyway, the site also takes submissions of questions for tonight’s debate. Apparently six million questions were submitted, and moderator Tom Brokaw will of course use only 10 or so. This begs a question, how were they selected? There’s no Digg-like social filtering or anything. You could imagine automatic methods to help narrow down the pool: Topic clustering? Quality ranking on syntax and vocabulary? Eric Fish suggested the obvious: probably someone picked 1000 randomly and sent them to Brokaw. I’d love to see a corpus of 6 million questions on U.S. political subjects, directed at only two different people. Anyone know anyon

same-blog 3 0.966048 196 brendan oconnor ai-2013-05-08-Movie summary corpus and learning character personas

Introduction: Here is one of our exciting just-finished ACL papers.   David  and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies. To do this we collected a  brand new dataset : 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre.  We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them.  Did you see that NYT article on quantitative analysis of film scripts ?  This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money?  We have released the data here . Our focus, though, is on narrative analysis.  We investigate  character personas : familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trick

4 0.83535588 44 brendan oconnor ai-2006-08-30-A big, fun list of links I’m reading

Introduction: Since blogging is hard, but reading is easy, lately I’ve taken to bookmarking interesting articles I’m reading, with the plan of blogging about them later. This follow-through has happened a few times, but not that often. In an amazing moment of thesis procrastination, today I sat down and figured out how to turn my del.icio.us bookmarks into a nice blogpost, with the plan that every week a post will appear with links I’ve recently read, or maybe I’ll use the script to generate a draft for myself that I’ll revise, or something. But for this first such link post, I put in a whole bunch of them beyond just the last week — why have just a few when you could have *all* of them? Future link posts will be shorter, I promise. Ariel Rubinstein: Freak-Freakonomics July 2006 posted 8/19 under economics sarcastic, critical review of levitt & dubner’s Freakonomics New Yorker review of Philip Tetlock’s book on political expert judgment posted 8/19 under judgment , psycholo

5 0.40857482 200 brendan oconnor ai-2013-09-13-Response on our movie personas paper

Introduction: Update (2013-09-17): See David Bamman ‘s great guest post on Language Log on our latent personas paper, and the big picture of interdisciplinary collaboration. I’ve been informed that an interesting critique of my, David Bamman’s and Noah Smith’s ACL paper on movie personas has appeared on the Language Log, a guest post by Hannah Alpert-Abrams and Dan Garrette . I posted the following as a comment on LL. Thanks everyone for the interesting comments. Scholarship is an ongoing conversation, and we hope our work might contribute to it. Responding to the concerns about our paper , We did not try to make a contribution to contemporary literary theory. Rather, we focus on developing a computational linguistic research method of analyzing characters in stories. We hope there is a place for both the development of new research methods, as well as actual new substantive findings. If you think about the tremendous possibilities for computer science and humanities collabor

6 0.38391834 8 brendan oconnor ai-2005-06-25-more argumentation & AI-formal modelling links

7 0.36696264 117 brendan oconnor ai-2008-10-11-It is accurate to determine a blog’s bias by what it links to

8 0.34991592 204 brendan oconnor ai-2014-04-26-Replot: departure delays vs flight time speed-up

9 0.34217343 156 brendan oconnor ai-2009-09-26-Seeing how “art” and “pharmaceuticals” are linguistically similar in web text

10 0.33877578 188 brendan oconnor ai-2012-10-02-Powerset’s natural language search system

11 0.33653733 32 brendan oconnor ai-2006-03-26-new kind of science, for real

12 0.30467591 140 brendan oconnor ai-2009-05-18-Announcing TweetMotif for summarizing twitter topics

13 0.29724285 189 brendan oconnor ai-2012-11-24-Graphs for SANCL-2012 web parsing results

14 0.27298647 90 brendan oconnor ai-2008-01-20-Moral psychology on Amazon Mechanical Turk

15 0.26928601 129 brendan oconnor ai-2008-12-03-Statistics vs. Machine Learning, fight!

16 0.26774114 184 brendan oconnor ai-2012-07-04-The $60,000 cat: deep belief networks make less sense for language than vision

17 0.26612037 20 brendan oconnor ai-2005-07-11-guns, germs, & steel pbs show?!

18 0.2658307 133 brendan oconnor ai-2009-01-23-SF conference for data mining mercenaries

19 0.25908202 111 brendan oconnor ai-2008-08-16-A better Obama vs McCain poll aggregation

20 0.25754964 83 brendan oconnor ai-2007-11-15-Actually that 2008 elections voter fMRI study is batshit insane (and sleazy too)