acl acl2013 acl2013-220 knowledge-graph by maker-knowledge-mining

220 acl-2013-Learning Latent Personas of Film Characters

Source: pdf

Author: David Bamman ; Brendan O'Connor ; Noah A. Smith

Abstract: We present two latent variable models for learning character types, or personas, in film, in which a persona is defined as a set of mixtures over latent lexical classes. These lexical classes capture the stereotypical actions of which a character is the agent and patient, as well as attributes by which they are described. As the first attempt to solve this problem explicitly, we also present a new dataset for the text-driven analysis of film, along with a benchmark testbed to help drive future work in this area.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present two latent variable models for learning character types, or personas, in film, in which a persona is defined as a set of mixtures over latent lexical classes. [sent-4, score-1.149]

2 These lexical classes capture the stereotypical actions of which a character is the agent and patient, as well as attributes by which they are described. [sent-5, score-0.592]

3 As the first attempt to solve this problem explicitly, we also present a new dataset for the text-driven analysis of film, along with a benchmark testbed to help drive future work in this area. [sent-6, score-0.048]

4 1 Introduction Philosophers and dramatists have long argued whether the most important element of narrative is plot or character. [sent-7, score-0.215]

5 , 2010), narrative chains (Chambers and Jurafsky, 2008), and plot structure (Finlayson, 2011; Elsner, 2012; McIntyre and Lapata, 2010; Goyal et al. [sent-9, score-0.184]

6 We present a complementary perspective that addresses the importance of character in defining supreme;1 disagree. [sent-11, score-0.32]

7 is not with a view to the representation of character: character comes in as subsidiary to the actions . [sent-15, score-0.425]

8 2“Aristotle was mistaken in his time, and our scholars are mistaken today when they accept his rulings concerning character. [sent-21, score-0.05]

9 Under this perspec- tive, a character’s latent internal nature drives the action we observe. [sent-26, score-0.134]

10 Articulating narrative in this way leads to a natural generative story: we first decide that we’re going to make a particular kind of movie (e. [sent-27, score-0.223]

11 , a romantic comedy), then decide on a set of character types, or personas, we want to see involved (the PROTAGONIST, the LOVE INTEREST, the BEST FRIEND). [sent-29, score-0.345]

12 This work is inspired by past approaches that infer typed semantic arguments along with narrative schemas (Chambers and Jurafsky, 2009; Regneri et al. [sent-31, score-0.153]

13 , 2011), but seeks a more holistic view of character, one that learns from stereotypical attributes in addition to plot events. [sent-32, score-0.176]

14 First, can we learn what those standard personas are by how individual characters (who instantiate those types) are portrayed? [sent-37, score-0.551]

15 Second, can we learn the set of attributes and actions by which we recognize those common types? [sent-38, score-0.145]

16 At its most extreme, this perspective reduces to learning the grand archetypes of Joseph Campbell (1949) or Carl Jung (1981), such as the HERO or TRICKSTER. [sent-40, score-0.047]

17 We seek, however, a more finegrained set that includes not only archetypes, but stereotypes as well characters defined by a fixed set of actions widely known to be representative of – 352 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-41, score-0.251]

18 This work offers a data-driven method for answering these questions, presenting two probablistic generative models for inferring latent character types. [sent-44, score-0.418]

19 This is the first work that attempts to learn explicit character personas in detail; as such, we present a new dataset for character type induction in film and a benchmark testbed for evaluating future work. [sent-45, score-1.157]

20 1 Text Our primary source of data comes from 42,306 movie plot summaries extracted from the November 2, 2012 dump of English-language Wikipedia. [sent-47, score-0.26]

21 4 These summaries, which have a median length of approximately 176 words,5 contain a concise synopsis of the movie’s events, along with implicit descriptions of the characters (e. [sent-48, score-0.146]

22 Verbs for which the entity is an agent argument (nsubj or agent). [sent-53, score-0.086]

23 Adjectives and common noun Awtotrrdibs uthteats relate to the mention as adjectival modifiers, noun-noun compounds, appositives, or copulas (nsubj or appos governors, or nsubj, appos, amod, nn dependents of an entity mention). [sent-58, score-0.028]

24 org/ enwiki / 5More popular movies naturally attract more attention on Wikipedia and hence more detail: the top 1,000 movies by box office revenue have a median length of 715 words. [sent-66, score-0.154]

25 shtml These three roles capture three different ways in which character personas are revealed: the actions they take on others, the actions done to them, and the attributes by which they are described. [sent-70, score-0.975]

26 For every character we thus extract a bag of (r, w) tuples, where w is the word lemma and r is one of {agent verb, patient verb, attribute} as identoiffi e{da by tth vee arbbo,vpea rtiuelenst. [sent-71, score-0.452]

27 2 Metadata Our second source of information consists of character and movie metadata drawn from the November 4, 2012 dump of Freebase. [sent-73, score-0.607]

28 7 At the movie level, this includes data on the language, country, release date and detailed genre (365 non-mutually exclusive categories, including “Epic Western,” “Revenge,” and “Hip Hop Movies”). [sent-74, score-0.195]

29 For all experiments described below, we restrict our dataset to only those events that are among the 1,000 most frequent overall, and only characters with at least 3 events. [sent-80, score-0.175]

30 120,345 characters meet this criterion; of these, 33,559 can be matched to Freebase actors with a specified gender, and 29,802 can be matched to actors with a given date of birth. [sent-81, score-0.307]

31 Of all actors in the Freebase data whose age is given, the average age at the time of movie is 37. [sent-82, score-0.315]

32 8 The age distribution is strongly bimodal when conditioning on gender: the average age of a female actress at the time of a movie’s release is 33. [sent-86, score-0.116]

33 – 3 Personas One way we recognize a character’s latent type is by observing the stereotypical actions they 7http: //download. [sent-95, score-0.244]

34 com/ dat adump s / 8Whether this extreme 2:1 male/female ratio reflects an inherent bias in film or a bias in attention on Freebase (or Wikipedia, on which it draws) is an interesting research question in itself. [sent-97, score-0.064]

35 To capture this intuition, we define a persona as a set of three typed distributions: one for the words for which the character is the agent, one for which it is the patient, and one for words by which the character is attributively modified. [sent-103, score-1.303]

36 Each distribution ranges over a fixed set of latent word classes, or topics. [sent-104, score-0.098]

37 Figure 1 illustrates this definition for a toy example: a ZOMBIE persona may be characterized as being the agent of primarily eating and killing actions, the patient of killing actions, and the object of dead attributes. [sent-105, score-0.896]

38 The topic labeled eat may include words like eat, drink, and devour. [sent-106, score-0.069]

39 8tealikvodaeyph Figure 1: A persona is a set of three distributions over latent topics. [sent-113, score-0.731]

40 In this toy example, the ZOMBIE persona is primarily characterized by being the agent of words from the eat and kill topics, the patient of kill words, and the object of words from the dead topic. [sent-114, score-0.877]

41 We present the text-only model first Figure 2: Above: Dirichlet persona model (left) and persona regression model (right). [sent-126, score-1.306]

42 Each latent word cluster 354 φk ∼ Dir(γ) is a multinomial over the V words in the vocabulary, dar mawunlti fnroommi a D ovireicrh thleet parameterized by γ. [sent-137, score-0.128]

43 Next, let a persona p be defined as a set of three multinomials ψp over these K topics, one for each typed role r, each drawn from a Dirichlet with a role-specific hyperparameter (νr). [sent-138, score-0.7]

44 Every document (a movie plot summary) contains a set of characters, each of which is associated with a single latent persona p; for every observed (r, w) tuple associated with the character, we sample a latent topic k from the role-specific ψp,r. [sent-139, score-1.154]

45 Conditioned on this topic assignment, the observed word is drawn from φk. [sent-140, score-0.077]

46 The distribution of these personas for a given document is determined by a document-specific multinomial θ, drawn from a Dirichlet parameterized by α. [sent-141, score-0.442]

47 To simplify inference, we collapse out the persona-topic distributions ψ, the topic-word distributions φ and the persona distribution θ for each document. [sent-143, score-0.633]

48 z+jK+ννrrj ) (1) Here, cd−,ke is the count of all characters in docu- ment d whose current persona sample is also k (not counting the current character e under consideration);9 j ranges over all (rj, wj) tuples associated with character e. [sent-147, score-1.504]

49 Each cr−je,k,zj is the count of all tuples with role rj and current topic zj used with persona k. [sent-148, score-0.809]

50 Once all personas have been sampled, we sam9The −e superscript denotes counts taken without consideringT hthee −cuer rseunpte srsamcrpiplet d dfeonro octheasr caoctuenrt se . [sent-152, score-0.405]

51 t ple the latent topics for each tuple as the following. [sent-153, score-0.157]

52 We optimize the values of the Dirichlet hyperparameters α, ν and γ using slice sampling with a uniform prior every 20 iterations for the first 500 iterations, and every 100 iterations thereafter. [sent-159, score-0.091]

53 After a burn-in phase of 10,000 iterations, we collect samples every 10 iterations (to lessen autocorrelation) until a total of 100 have been collected. [sent-160, score-0.032]

54 This captures the increased likelihood, for example, that a 25-year-old male actor in an action movie will play an ACTION HERO than he will play a VALLEY GIRL. [sent-163, score-0.228]

55 Given current values for β, for all characters e in all plot summaries, sample values of pe and zj for all associated tuples. [sent-165, score-0.35]

56 Given input metadata features m and the associated sampled values of p, find the values ofβ that maximize the standard multiclass logistic regression log likelihood, subject to ‘2 regularization. [sent-167, score-0.153]

57 As with the Dirichlet persona model, inference on p for step 1 is conducted with collapsed Gibbs sampling; the only difference in the sampling probability from equation 1 is the effect of the prior, which here is deterministically fixed as the output of the regression. [sent-169, score-0.66]

58 z+j+Kννrrj ) (4) The sampling equation for the topic assignments z is identical to that in equation 2. [sent-171, score-0.067]

59 In practice we optimize β every 1,000 iterations, until a burn-in phase of 10,000 iterations has been reached; at this point we following the same sampling regime as for the Dirichlet persona model. [sent-172, score-0.692]

60 This evaluation also helps offer guidance for model selection (in choosing the number of latent topics and personas) by measuring performance on an objective task. [sent-174, score-0.157]

61 Each of these names is used by at least two different characters; for example, a character named “Jason Bourne” is portrayed in The Bourne Identity, The Bourne Supremacy, and The Bourne Ultimatum. [sent-177, score-0.392]

62 While these characters are certainly free to assume different roles in different movies, we believe that, in the aggregate, they should tend to embody the same character type and thus prove to be a natural clustering to recover. [sent-178, score-0.507]

63 970 character names occur at least twice in our data, and 2,666 individual characters use one of those names. [sent-179, score-0.507]

64 Let those 970 character names define 970 unique gold clusters whose members include the individual characters who use that name. [sent-180, score-0.588]

65 2 TV Tropes As a second external measure of validation, we consider a manually created clustering presented at the website TV Tropes,10 a wiki that collects user-submitted examples of common tropes (narrative, character and plot devices) found in television, film, and fiction, among other media. [sent-182, score-0.498]

66 While TV Tropes contains a wide range of such conventions, we manually identified a set of 72 tropes that could reasonably be labeled character types, including THE CORRUPT CORPORATE EXECUTIVE, THE HARDBOILED DETECTIVE, THE JERK JOCK, THE KLUTZ and THE SURFER DUDE. [sent-183, score-0.403]

67 We manually aligned user-submitted examples of characters embodying these 72 character types with the canonical references in Freebase to create a test set of 501 individual characters. [sent-184, score-0.466]

68 While the 72 character tropes represented here are a more subjective measure, we expect to be able to at least partially recover this clustering. [sent-185, score-0.403]

69 Low VI indicates that (induced) clusters and (gold) clusters tend to overlap; i. [sent-188, score-0.106]

70 , knowing a character’s (induced) cluster usually tells us their (gold) cluster, and vice versa. [sent-190, score-0.03]

71 org 356 Table1:Vr5i2a0tonPD eirf soc hn lfaeotr pe gmers ao stinoa n betw7 . [sent-192, score-0.043]

72 61ri(t↑ y12 0951s% co)re is paired with its improvement over a controlled baseline of permuting the learned labels while keeping the cluster proportions the same. [sent-215, score-0.08]

73 Table 1presents the VI between the learned persona clusters and gold clusters, for varying numbers of personas (P = {25, 50, 100}) and topibcesr (K = {25, 50, 100}). [sent-217, score-1.145]

74 Over all tests in comparison to both gold clusterings, we see VI improve as both P and, to a lesser extent, K increase. [sent-224, score-0.028]

75 11 The difference between the persona regression model and the Dirichlet persona model here is not 11This trend is robust to the choice of cluster metric: here VI and F-score have a correlation of −0. [sent-226, score-1.336]

76 87; as more latent tVopI iacnsd da Fnd- personas ear ae caodrdreedla, icoluns otefr i−ng0 8im7p;r aosv meso r(ceau lastienngt the F-score to go up and the VI distance to go down). [sent-227, score-0.503]

77 While we would naturally prefer a text-only model to be as expressive as a model that requires po- tentially hard to acquire metadata, we tease apart whether a distinction actually does exist by evaluating the purity of the gold clusters with respect to the labels assigned them. [sent-229, score-0.162]

78 cj} we calculat}e purity as: Purity =N1Xkmajx|gk∩ cj| (7) While purity cannot be used to compare models of different persona size P, it can help us distinguish between models of the same size. [sent-237, score-0.795]

79 2% of the total characters, 357 Figure 3: Dramatis personae of The Dark Knight (2008), illustrating 3 of the 100 character types learned by the persona regression model, along with links from other characters in those latent classes to other movies. [sent-239, score-1.263]

80 Each character type is listed with the top three latent topics with which it is associated. [sent-240, score-0.477]

81 the probability of selecting that persona at random is 3. [sent-241, score-0.633]

82 Table 2 presents each model’s absolute purity score paired with its improvement over its controlled permutation (e. [sent-243, score-0.105]

83 g partition, the use of metadata yields a substantial improvement over the Dirichlet model, both in terms of absolute purity and in its relative improvement over its sizedcontrolled baseline. [sent-247, score-0.166]

84 In practice, we find that while the Dirichlet model distinguishes between character personas in different movies, the persona regression model helps distinguish between different personas within the same movie. [sent-248, score-1.803]

85 6 Exploratory Data Analysis As with other generative approaches, latent persona models enable exploratory data analysis. [sent-249, score-0.731]

86 To illustrate this, we present results from the persona regression model learned above, with 50 latent lexical classes and 100 latent personas. [sent-250, score-0.895]

87 Figure 3 visualizes this data by focusing on a single movie, The Dark Knight (2008); the movie’s protagonist, Batman, belongs to the same latent persona as Detective Jim Gordon, as well as other action movie protagonists Jason Bourne and Tony Stark (Iron Man). [sent-251, score-0.929]

88 The movie’s antagonist, The Joker, belongs to the same latent persona as Dracula from Van Helsing and Colin Sullivan from The Departed, illustrating the ability of personas to be informed by, but still cut across, different genres. [sent-252, score-1.136]

89 Of note are topics relating to romance (unite, marry, woo, elope, court), commercial transactions (purchase, sign, sell, owe, buy), and the clas- × sic criminal schema from Chambers (201 1) (sentence, arrest, assign, convict, promote). [sent-254, score-0.059]

90 Table 4 presents the most frequent 14 personas in our dataset, illustrated with characters from the 500 highest grossing movies. [sent-255, score-0.551]

91 The personas learned are each three separate mixtures of the 50 latent topics (one for agent relations, one for patient relations, and one for attributes), as illustrated in figure 1 above. [sent-256, score-0.778]

92 Rather than presenting a 3 50 histogram for each persona, we illustara 3te × ×th 5em0 by listing tfoher emacosht pcehrasroancate,ri wsteic i topics, movie characters, and metadata features associated with it. [sent-257, score-0.247]

93 Characteristic actions and features are defined as those having the highest smoothed pointwise mutual information with that class; exemplary characters are those with the highest posterior probability of being drawn from that class. [sent-258, score-0.288]

94 topic classes presented in table 3; subscripts denote whether the character is predominantly the agent (a), patient (p) or is modified by an attribute (m). [sent-261, score-0.579]

95 359 7 Conclusion We present a method for automatically inferring latent character personas from text (and metadata, when available). [sent-262, score-0.823]

96 By examining how any individual character deviates from the behavior indicative of their type, we might be able to paint a more nuanced picture of how a character can embody a specific persona while resisting it at the same time. [sent-266, score-1.314]

97 Plot induc- tion and evolutionary search for story generation. [sent-355, score-0.029]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('persona', 0.633), ('personas', 0.405), ('character', 0.32), ('characters', 0.146), ('movie', 0.134), ('actions', 0.105), ('patient', 0.104), ('latent', 0.098), ('vi', 0.096), ('plot', 0.095), ('bourne', 0.093), ('narrative', 0.089), ('agent', 0.086), ('metadata', 0.085), ('tropes', 0.083), ('purity', 0.081), ('movies', 0.077), ('actors', 0.065), ('film', 0.064), ('dirichlet', 0.062), ('topics', 0.059), ('age', 0.058), ('tuples', 0.057), ('clusterings', 0.057), ('clusters', 0.053), ('regneri', 0.051), ('testbed', 0.048), ('gender', 0.048), ('archetypes', 0.047), ('hero', 0.047), ('rrj', 0.047), ('villains', 0.047), ('freebase', 0.047), ('cr', 0.044), ('pe', 0.043), ('stereotypical', 0.041), ('embody', 0.041), ('names', 0.041), ('rj', 0.041), ('attributes', 0.04), ('regression', 0.04), ('topic', 0.04), ('chambers', 0.039), ('aristotle', 0.038), ('zj', 0.038), ('drawn', 0.037), ('action', 0.036), ('schemas', 0.034), ('iterations', 0.032), ('wj', 0.032), ('dump', 0.031), ('male', 0.031), ('assault', 0.031), ('batman', 0.031), ('darth', 0.031), ('detective', 0.031), ('dramatists', 0.031), ('evil', 0.031), ('flirt', 0.031), ('governors', 0.031), ('kmodelp', 0.031), ('portrayed', 0.031), ('strangle', 0.031), ('supremacy', 0.031), ('vader', 0.031), ('zombie', 0.031), ('date', 0.031), ('cluster', 0.03), ('typed', 0.03), ('genre', 0.03), ('attribute', 0.029), ('events', 0.029), ('story', 0.029), ('eat', 0.029), ('gold', 0.028), ('associated', 0.028), ('appos', 0.028), ('meil', 0.028), ('protagonist', 0.028), ('vee', 0.028), ('mcintyre', 0.028), ('protagonists', 0.028), ('ccr', 0.028), ('comedy', 0.028), ('michaela', 0.028), ('sampling', 0.027), ('nsubj', 0.027), ('actor', 0.027), ('nathanael', 0.026), ('tv', 0.026), ('learned', 0.026), ('mistaken', 0.025), ('dead', 0.025), ('poetics', 0.025), ('romantic', 0.025), ('schank', 0.025), ('goyal', 0.024), ('november', 0.024), ('killing', 0.024), ('controlled', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 220 acl-2013-Learning Latent Personas of Film Characters

Author: David Bamman ; Brendan O'Connor ; Noah A. Smith

2 0.11053358 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

Author: Longkai Zhang ; Li Li ; Zhengyan He ; Houfeng Wang ; Ni Sun

Abstract: Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctuation information of unlabeled micro-blog data by introducing characters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training framework to incorporate confident instances is also used, which prove to be helpful. Ex- periments on micro-blog data show that our approach improves performance, especially in OOV-recall. 1 INTRODUCTION Micro-blog (also known as tweets in English) is a new kind of broadcast medium in the form of blogging. A micro-blog differs from a traditional blog in that it is typically smaller in size. Furthermore, texts in micro-blogs tend to be informal and new words occur more frequently. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data. The poor segmentation results will hurt subsequent analysis on micro-blog text. ∗Corresponding author Manually labeling the texts of micro-blog is time consuming. Luckily, punctuations provide useful information because they are used as indicators of the end of previous sentence and the beginning of the next one, which also indicate the start and the end of a word. These ”natural boundaries” appear so frequently in micro-blog texts that we can easily make good use of them. TABLE 1 shows some statistics of the news corpus vs. the micro-blogs. Besides, English letters and digits are also more than those in news corpus. They all are natural delimiters of Chinese characters and we treat them just the same as punctuations. We propose a method to enlarge the training corpus by using punctuation information. We build a semi-supervised learning (SSL) framework which can iteratively incorporate newly labeled instances from unlabeled micro-blog data during the training process. We test our method on microblog texts and experiments show good results. This paper is organized as follows. In section 1 we introduce the problem. Section 2 gives detailed description of our approach. We show the experi- ment and analyze the results in section 3. Section 4 gives the related works and in section 5 we conclude the whole work. 2 Our method 2.1 Punctuations Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. To be simple, one can use label ’B’ to indicate a character is the beginning of a word, and use ’N’ to indicate a character is not the beginning of a word. We also use the 2-tag in our work. Other tag sets like the ’BIES’ tag set are not suiteable because the puctuation information cannot decide whether a character after punctuation should be labeled as ’B’ or ’S’(word with Single 177 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 177–182, micNreow-bslogC68h56i. n73e%%seE10n1.g6.8l%i%shN20u. m76%%berPu1n13c9.t u03a%%tion Table 1: Percentage of Chinese, English, number, punctuation in the news corpus vs. the micro-blogs. character). Punctuations can serve as implicit labels for the characters before and after them. The character right after punctuations must be the first character of a word, meanwhile the character right before punctuations must be the last character of a word. An example is given in TABLE 2. 2.2 Algorithm Our algorithm “ADD-N” is shown in TABLE 3. The initially selected character instances are those right after punctuations. By definition they are all labeled with ’B’ . In this case, the number of training instances with label ’B’ is increased while the number with label ’N’ remains unchanged. Because of this, the model trained on this unbalanced corpus tends to be biased. This problem can become even worse when there is inexhaustible supply of texts from the target domain. We assume that labeled corpus of the source domain can be treated as a balanced reflection of different labels. Therefore we choose to estimate the balanced point by counting characters labeling ’B’ and ’N’ and calculate the ratio which we denote as η . We assume the enlarged corpus is also balanced if and only if the ratio of ’B’ to ’N’ is just the same to η of the source domain. Our algorithm uses data from source domain to make the labels balanced. When enlarging corpus using characters behind punctuations from texts in target domain, only characters labeling ’B’ are added. We randomly reuse some characters labeling ’N’ from labeled data until ratio η is reached. We do not use characters ahead of punctuations, because the single-character words ahead of punctuations take the label of ’B’ instead of ’N’ . In summary our algorithm tackles the problem by duplicating labeled data in source domain. We denote our algorithm as ”ADD-N”. We also use baseline feature templates include the features described in previous works (Sun and Xu, 2011; Sun et al., 2012). Our algorithm is not necessarily limited to a specific tagger. For simplicity and reliability, we use a simple MaximumEntropy tagger. 3 Experiment 3.1 Data set We evaluate our method using the data from weibo.com, which is the biggest micro-blog service in China. We use the API provided by weibo.com1 to crawl 500,000 micro-blog texts of weibo.com, which contains 24,243,772 characters. To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data, which contain 2,420,037 characters. We manually segment 2038 randomly selected microblogs.We follow the segmentation standard as the PKU corpus. In micro-blog texts, the user names and URLs have fixed format. User names start with ’ @ ’, followed by Chinese characters, English letters, numbers and ’ ’, and terminated when meeting punctuations or blanks. URLs also match fixed patterns, which are shortened using ”http : / /t . cn /” plus six random English letters or numbers. Thus user names and URLs can be pre-processed separately. We follow this principle in following experiments. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. We choose the PKU data in our experiment because our baseline methods use the same segmentation standard. We compare our method with three baseline methods. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. ICTCLAS, on the other hand, is a HMMbased Chinese word segmenter. Another baseline is Li and Sun (2009), which also uses punctuation in their semi-supervised framework. F-score 1http : / / open . we ibo .com/wiki 2http : / /www . s ighan .org/bakeo f f2 0 0 5 / 3http : / / i c l .org/ ct as 4http : / / nlp . st an ford . edu /pro j ect s / chine s e-nlp . shtml \ # cws 178 评B论-是-风-格-，-评B论-是-能-力-。- BNBBNBBNBBNB Table 2: The first line represents the original text. The second line indicates whether each character is the Beginning of sentence. The third line is the tag sequence using ”BN” tag set. is used as the accuracy measure. The recall of out-of-vocabulary is also taken into consideration, which measures the ability of the model to correctly segment out of vocabulary words. 3.2 Main results methods on the development data. TABLE 4 summarizes the segmentation results. In TABLE 4, Li-Sun is the method in Li and Sun (2009). Maxent only uses the PKU data for training, with neither punctuation information nor self-training framework incorporated. The next 4 methods all require a 100 iteration of self-training. No-punc is the method that only uses self-training while no punctuation information is added. Nobalance is similar to ADD N. The only difference between No-balance and ADD-N is that the former does not balance label ’B’ and label ’N’ . The comparison of Maxent and No-punctuation shows that naively adding confident unlabeled instances does not guarantee to improve performance. The writing style and word formation of the source domain is different from target domain. When segmenting texts of the target domain using models trained on source domain, the performance will be hurt with more false segmented instances added into the training set. The comparison of Maxent, No-balance and ADD-N shows that considering punctuation as well as self-training does improve performance. Both the f-score and OOV-recall increase. By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV-Recall. However, considering it will improve OOV-Recall by about +1.6% and the fscore +0.2%. We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data. TABLE 5 shows different f-scores and OOV-Recalls on different unlabeled data set. We note that when the number of texts changes from 0 to 50,000, the f-score and OOV both are improved. However, when unlabeled data changes to 200,000, the performance is a bit decreased, while still better than not using unlabeled data. This result comes from the fact that the method ’ADD-N’ only uses characters behind punctua179 Tabl152S0eiz 0:Segm0.8nP67ta245ion0p.8Rer6745f9om0a.8nF57c6e1witOh0 .d7Vi65f-2394Rernt size of unlabeled data tions from target domain. Taking more texts into consideration means selecting more characters labeling ’N’ from source domain to simulate those in target domain. If too many ’N’s are introduced, the training data will be biased against the true distribution of target domain. 3.3 Characters ahead of punctuations In the ”BN” tagging method mentioned above, we incorporate characters after punctuations from texts in micro-blog to enlarge training set.We also try an opposite approach, ”EN” tag, which uses ’E’ to represent ”End of word”, and ’N’ to rep- resent ”Not the end of word”. In this contrasting method, we only use charactersjust ahead ofpunctuations. We find that the two methods show similar results. Experiment results with ADD-N are shown in TABLE 6 . 5DU0an0lt a b0Tsiealzbe lde6:0.C8Fo7”m5BNpa”rO0itsOa.o7gVn7-3oRfBN0.8aFn”7E0dNEN”Ot0.aO.g7V6-3R 4 Related Work Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). These supervised methods show good results, however, are unable to incorporate information from new domain, where OOV problem is a big challenge for the research community. On the other hand unsupervised word segmentation Peng and Schuurmans (2001); Goldwater et al. (2006); Jin and Tanaka-Ishii (2006); Feng et al. (2004); Maosong et al. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. However, they usually are less accurate and more complicated than supervised ones. Meanwhile semi-supervised methods have been applied into NLP applications. Bickel et al. (2007) learns a scaling factor from data of source domain and use the distribution to resemble target domain distribution. Wu et al. (2009) uses a Domain adaptive bootstrapping (DAB) framework, which shows good results on Named Entity Recognition. Similar semi-supervised applications include Shen et al. (2004); Daum e´ III and Marcu (2006); Jiang and Zhai (2007); Weinberger et al. (2006). Besides, Sun and Xu (201 1) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation. There are previous works using punctuations as implicit annotations. Riley (1989) uses it in sentence boundary detection. Li and Sun (2009) proposed a compromising solution to by using a clas- sifier to select the most confident characters. We do not follow this approach because the initial errors will dramatically harm the performance. Instead, we only add the characters after punctuations which are sure to be the beginning of words (which means labeling ’B’) into our training set. Sun and Xu (201 1) uses punctuation information as discrete feature in a sequence labeling framework, which shows improvement compared to the pure sequence labeling approach. Our method is different from theirs. We use characters after punctuations directly. 5 Conclusion In this paper we have presented an effective yet simple approach to Chinese word segmentation on micro-blog texts. In our approach, punctuation information of unlabeled micro-blog data is used, as well as a self-training framework to incorporate confident instances. Experiments show that our approach improves performance, especially in OOV-recall. Both the punctuation information and the self-training phase contribute to this improve- ment. Acknowledgments This research was partly supported by National High Technology Research and Development Program of China (863 Program) (No. 2012AA01 1101), National Natural Science Foundation of China (No.91024009) and Major National Social Science Fund of China(No. 12&ZD227;). 180 References Bickel, S., Br¨ uckner, M., and Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In Proceedings ofthe 24th international conference on Machine learning, pages 81–88. ACM. Chen, W., Zhang, Y., and Isahara, H. (2006). Chinese named entity recognition with conditional random fields. In 5th SIGHAN Workshop on Chinese Language Processing, Australia. Daum e´ III, H. and Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1): 101–126. Feng, H., Chen, K., Deng, X., and Zheng, W. (2004). Accessor variety criteria for chinese word extraction. Computational Linguistics, 30(1):75–93. Goldwater, S., Griffiths, T., and Johnson, M. (2006). Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 673–680. Association for Computational Linguistics. Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in nlp. In Annual Meeting-Association For Computational Linguistics, volume 45, page 264. Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 428–435. Association for Computational Linguistics. Li, Z. and Sun, M. (2009). Punctuation as implicit annotations for chinese word segmentation. Computational Linguistics, 35(4):505– 512. Low, J., Ng, H., and Guo, W. (2005). A maximum entropy approach to chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, volume 164. Jeju Island, Korea. Maosong, S., Dayang, S., and Tsou, B. (1998). Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 1 7th international conference on Computational linguistics-Volume 2, pages 1265–1271 . Association for Computational Linguistics. Pan, S. and Yang, Q. (2010). A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10): 1345–1359. Peng, F. and Schuurmans, D. (2001). Selfsupervised chinese word segmentation. Advances in Intelligent Data Analysis, pages 238– 247. Riley, M. (1989). Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, pages 339–352. Association for Computational Linguistics. Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C. (2004). Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 589. Association for Computational Linguistics. Sun, W. and Xu, J. (201 1). Enhancing chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 970–979. Association for Computational Linguistics. Sun, X., Wang, H., and Li, W. (2012). Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 253–262, Jeju Island, Korea. Association for Computational Linguistics. Weinberger, K., Blitzer, J., and Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In In NIPS. Citeseer. Wu, D., Lee, W., Ye, N., and Chieu, H. (2009). Domain adaptive bootstrapping for named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1523–1532. Association for Computational Linguistics. Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48. Zhao, H., Huang, C., and Li, M. (2006a). An improved chinese word segmentation system with 181 conditional random field. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, volume 117. Sydney: July. Zhao, H., Huang, C., Li, M., and Lu, B. (2006b). Effective tag set selection in chinese word segmentation via conditional random field modeling. In Proceedings pages of PACLIC, volume 20, 87–94. 182

3 0.10632417 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

4 0.096981473 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.

5 0.08272326 224 acl-2013-Learning to Extract International Relations from Political Context

Author: Brendan O'Connor ; Brandon M. Stewart ; Noah A. Smith

Abstract: We describe a new probabilistic model for extracting events between major political actors from news corpora. Our unsupervised model brings together familiar components in natural language processing (like parsers and topic models) with contextual political information— temporal and dyad dependence—to infer latent event classes. We quantitatively evaluate the model’s performance on political science benchmarks: recovering expert-assigned event class valences, and detecting real-world conflict. We also conduct a small case study based on our model’s inferences. A supplementary appendix, and replication software/data are available online, at: http://brenocon.com/irevents

6 0.072815448 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

7 0.071813233 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

8 0.065795913 370 acl-2013-Unsupervised Transcription of Historical Documents

9 0.06357526 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

10 0.063463241 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

11 0.062517345 184 acl-2013-Identification of Speakers in Novels

12 0.059369575 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

13 0.058966536 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

14 0.058144804 318 acl-2013-Sentiment Relevance

15 0.056064203 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

16 0.055454202 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays

17 0.05460263 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

18 0.054495465 121 acl-2013-Discovering User Interactions in Ideological Discussions

19 0.053408336 257 acl-2013-Natural Language Models for Predicting Programming Comments

20 0.052875757 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.136), (1, 0.046), (2, -0.049), (3, -0.022), (4, 0.083), (5, -0.049), (6, 0.02), (7, 0.024), (8, -0.07), (9, 0.018), (10, 0.006), (11, -0.013), (12, 0.053), (13, 0.023), (14, -0.009), (15, -0.068), (16, -0.018), (17, 0.024), (18, 0.027), (19, 0.031), (20, -0.052), (21, -0.037), (22, 0.032), (23, -0.001), (24, 0.046), (25, -0.015), (26, -0.014), (27, -0.029), (28, 0.018), (29, 0.002), (30, 0.038), (31, -0.021), (32, 0.023), (33, -0.045), (34, -0.032), (35, 0.011), (36, -0.049), (37, -0.017), (38, -0.048), (39, 0.086), (40, -0.055), (41, 0.05), (42, -0.041), (43, -0.063), (44, -0.012), (45, -0.08), (46, 0.08), (47, -0.054), (48, -0.009), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92870933 220 acl-2013-Learning Latent Personas of Film Characters

Author: David Bamman ; Brendan O'Connor ; Noah A. Smith

2 0.61720729 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

3 0.60498178 370 acl-2013-Unsupervised Transcription of Historical Documents

Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein

Abstract: We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 3 1% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.

4 0.56684268 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

Abstract: We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e.g., ‘F’) for one minute. Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. Our system provides an automated alternative. In a pilot study, we applied this system to tests of 55 novice and experienced professional fighters (boxers and mixed martial artists) and found that experienced fighters produced significantly longer chains of phonetically similar words, while no differences were found in the total number of words produced. These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired.

5 0.56071079 29 acl-2013-A Visual Analytics System for Cluster Exploration

Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel

Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114

6 0.55283105 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

7 0.55051333 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

8 0.53897852 192 acl-2013-Improved Lexical Acquisition through DPP-based Verb Clustering

9 0.53823799 257 acl-2013-Natural Language Models for Predicting Programming Comments

10 0.53532219 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

11 0.5097627 14 acl-2013-A Novel Classifier Based on Quantum Computation

12 0.50166184 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

13 0.49313772 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

14 0.4929285 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

15 0.48505419 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

16 0.48426956 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

17 0.48175395 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

18 0.4735865 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

19 0.47315422 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

20 0.46873945 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.047), (6, 0.02), (11, 0.042), (15, 0.027), (21, 0.013), (24, 0.041), (26, 0.038), (28, 0.017), (35, 0.087), (42, 0.036), (48, 0.029), (56, 0.011), (70, 0.409), (88, 0.021), (90, 0.02), (95, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98597342 384 acl-2013-Visual Features for Linguists: Basic image analysis techniques for multimodally-curious NLPers

Author: Elia Bruni ; Marco Baroni

Abstract: unkown-abstract

2 0.96122587 296 acl-2013-Recognizing Identical Events with Graph Kernels

Author: Goran Glavas ; Jan Snajder

Abstract: Identifying news stories that discuss the same real-world events is important for news tracking and retrieval. Most existing approaches rely on the traditional vector space model. We propose an approach for recognizing identical real-world events based on a structured, event-oriented document representation. We structure documents as graphs of event mentions and use graph kernels to measure the similarity between document pairs. Our experiments indicate that the proposed graph-based approach can outperform the traditional vector space model, and is especially suitable for distinguishing between topically similar, yet non-identical events.

3 0.95272946 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

4 0.94734615 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

Author: Shay B. Cohen ; Mark Johnson

Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.

5 0.93704277 218 acl-2013-Latent Semantic Tensor Indexing for Community-based Question Answering

Author: Xipeng Qiu ; Le Tian ; Xuanjing Huang

Abstract: Retrieving similar questions is very important in community-based question answering(CQA) . In this paper, we propose a unified question retrieval model based on latent semantic indexing with tensor analysis, which can capture word associations among different parts of CQA triples simultaneously. Thus, our method can reduce lexical chasm of question retrieval with the help of the information of question content and answer parts. The experimental result shows that our method outperforms the traditional methods.

same-paper 6 0.92630291 220 acl-2013-Learning Latent Personas of Film Characters

7 0.91260713 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

8 0.84844577 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia

9 0.72383976 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities

10 0.69579327 249 acl-2013-Models of Semantic Representation with Visual Attributes

11 0.68367517 380 acl-2013-VSEM: An open library for visual semantics representation

12 0.67585254 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

13 0.64259666 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars

14 0.63783902 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus

15 0.62405449 80 acl-2013-Chinese Parsing Exploiting Characters

16 0.6206879 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

17 0.61916345 168 acl-2013-Generating Recommendation Dialogs by Extracting Information from User Reviews

18 0.59902388 339 acl-2013-Temporal Signals Help Label Temporal Relations

19 0.59611171 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

20 0.59208518 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation