acl acl2013 acl2013-89 knowledge-graph by maker-knowledge-mining

89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Source: pdf

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

Abstract: We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e.g., ‘F’) for one minute. Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. Our system provides an automated alternative. In a pilot study, we applied this system to tests of 55 novice and experienced professional fighters (boxers and mixed martial artists) and found that experienced fighters produced significantly longer chains of phonetically similar words, while no differences were found in the total number of words produced. These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ryan1, Serguei Pakhomov1, Susan Marino1, Charles Bernick2, and Sarah Banks2 1 College of Pharmacy, University of Minnesota 2 Lou Ruvo Center for Brain Health, Cleveland Clinic {ryanx7 6 5 , pakh0 0 0 2 , marin 0 0 7 }@ umn . [sent-2, score-0.045]

2 org Abstract We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e. [sent-4, score-0.748]

3 Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. [sent-7, score-0.569]

4 These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired. [sent-10, score-0.586]

5 1 Introduction The neuropsychological test of phonemic verbal fluency (PVF) consists of asking the patient to generate as many words as he or she can in a limited time (usually 60 seconds) that begin with a specific letter of the alphabet (Benton et al. [sent-11, score-0.411]

6 This test has been used extensively as part oflarger cognitive test batteries to study cognitive impairment resulting from a number of neurological conditions, including Parkinson’s and Huntington’s diseases, various forms of dementia, and traumatic brain injury (Troyer et al. [sent-13, score-0.544]

7 Patients with these disorders tend to generate significantly fewer words on this test than do healthy individuals. [sent-17, score-0.108]

8 Prior studies have also found that clustering (the degree to which patients generate groups of phonetically similar words) and switching (transitioning from one cluster to the next) behaviors are also sensitive to the effects of these neurological conditions. [sent-18, score-0.606]

9 Contact sports such as boxing, mixed martial arts, football, and hockey are well known for high prevalence of repetitive head trauma. [sent-19, score-0.196]

10 In recent years, the long-term effects of repetitive head trauma in athletes has become the subject of intensive research. [sent-20, score-0.289]

11 In general, repetitive head trauma is a known risk factor for chronic traumatic encephalopathy (CTE), a devastating and untreatable condition that ultimately results in permanent disability and premature death (Omalu et al. [sent-21, score-0.49]

12 However, little is currently known about the relationship between the amount of exposure to head injury and the magnitude of risk for developing these conditions. [sent-24, score-0.144]

13 Furthermore, the development ofnew behavioral methods aimed at detection of subtle early signs of brain impairment is an active area of research. [sent-25, score-0.371]

14 The PVF test is an excellent target for this research because it is very easy to administer and has been shown to be sensitive to the effects of acute traumatic brain injury (Raskin and Rearick, 1996). [sent-26, score-0.475]

15 However, a major obstacle to using this test widely for early detection ofbrain impairment is that clustering and switching analyses needed to detect these subtle changes have to be done manually. [sent-27, score-0.403]

16 These manual approaches are extremely laborintensive, and are therefore limited in the types of clustering analyses that can be performed. [sent-28, score-0.238]

17 Manual methods are also not scalable to large numbers of tests and are subject to inter-rater variability, making the results difficult to compare across subjects, as well as across different studies. [sent-29, score-0.038]

18 Moreover, traditional manual clustering and switching analyses rely primarily on word orthography to determine phonetic similarity (e. [sent-30, score-0.66]

19 , by comparing the first two letters of two words), rather than phonetic representations, which would be prohibitively time884 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-32, score-0.292]

20 Phonetic similarity has been investigated in application to a number of research areas, including spelling correction (Toutanova and Moore, 2002), machine translation (Knight and Graehl, 1998; Kondrak et al. [sent-36, score-0.038]

21 We first describe the system architecture and our phonetic-similarity computation methods, and then present the results of a pilot study, using data from professional fighters, demonstrating the utility ofthis system for early detection of subtle signs of brain impairment. [sent-41, score-0.498]

22 CMUdict contains phonetic transcriptions, using a phone set based on ARPABET (Rabiner and Juang, 1993), for North American English word pronunciations (Weide, 1998). [sent-45, score-0.348]

23 From the full set of entries in CMUdict, we removed alternative pronunciations for each word, leaving a single phonetic representation for each heteronymous set. [sent-49, score-0.348]

24 Additionally, all vowel symbols were stripped of numeric stress markings (e. [sent-50, score-0.104]

25 , AH1 → AH), and all multicharacter phone symbols were AcoHn)v,er atnedd t aol arbitrary singlecharacter symbols, in lowercase to distinguish these symbols from the original single-character ARPABET symbols (e. [sent-52, score-0.195]

26 Finally, whitespace b seytwmebeonls sth (ee symbols constituting each phonetic representation was removed, yielding compact phonetic-representation strings suitable for computing our similarity measures. [sent-55, score-0.395]

27 To illustrate, the CMUdict pronunciation entry for the word phonetic, [F AH0 N EH1 T IH0 K ] , would be represented as FcNiTmK. [sent-56, score-0.045]

28 2 Similarity Computation Our system uses two methods for determining phonetic similarity: edit distance and a commonbiphone check. [sent-58, score-0.434]

29 Each of these methods gives a measure of similarity for a pair of phonetic representations, which we respectively call a phoneticsimilarity score (PSS) and a common-biphone score (CBS). [sent-59, score-0.33]

30 The CBS is binary, with a score of 1given for two phonetic representations that have a common initial and/or final biphone, and 0 for two strings that have neither in common. [sent-62, score-0.292]

31 885 Figure 2: Phonetic chain and common-biphone chain (below) for an example PVF response. [sent-63, score-0.196]

32 3 Phonetic Clustering We distinguish between two ways of defining phonetic clusters. [sent-65, score-0.292]

33 Traditionally, any sequence of n words in a PVF response is deemed to form a cluster if all pairwise word combinations for that sequence are determined to be phonetically similar by some metric. [sent-66, score-0.153]

34 In addition to this method, we developed a less stringent approach in which we define chains instead of clusters. [sent-67, score-0.105]

35 A chain comprises a sequence for which the phonetic representation of each word is similar to that of the word immediately prior to it in the chain (unless it is chain-initial) and the word subsequent to it (unless it is chain-final). [sent-68, score-0.488]

36 We call chains based on the editdistance methodphonetic chains, and chains based on the common-biphone method common-biphone chains; both are illustrated in Figure 2. [sent-70, score-0.21]

37 We determine the threshold empirically for each letter by taking a random sample of 1000 words starting with that letter in CMUdict, computing PSS scores for each pairwise combination (n = 499, 500), and then setting the threshold as the value separating the upper quintile of these scores. [sent-72, score-0.2]

38 With the commonbiphone method, two words are considered phonetically similar simply if their CBS is 1. [sent-73, score-0.209]

39 html PVF response for a specific letter and, as a preprocessing step, removes any words that do not begin with that letter. [sent-79, score-0.1]

40 For out-of-dictionary words, we automatically generate a phonetic representation with a decision tree-based grapheme-to-phoneme algorithm trained on the CMUdict (Pagel et al. [sent-81, score-0.292]

41 Next, PSSs and CBSs are computed sequentially for each pair of contiguous phonetic representations, and are used in their respective methods to compute the following measures: mean pairwise similarity score (MPSS), mean chain length (MCL), and maximum chain length (MXCL). [sent-83, score-0.526]

42 Singletons are included in these calculations as chains of length 1. [sent-84, score-0.105]

43 We also calculate equivalent measures for clusters, but do not present these results here due to space limitations, as they are similar to those for chains. [sent-85, score-0.036]

44 In addition to these measures, our system produces a count of the total number of words that start with the letter specified for the PVF test (WCNT), and a count of repeated words (RCNT). [sent-86, score-0.1]

45 1 Participants We used PVF tests from 55 boxers and mixed martial artists (4 women, 51 men; mean age 27. [sent-88, score-0.202]

46 The PFBH is a longitudinal study of unarmed active professional fighters, retired professional fighters, and age/education matched controls (Bernick et al. [sent-93, score-0.297]

47 It is designed to enroll over 400 participants over the next five years. [sent-95, score-0.057]

48 The 55 participants in our pilot represent a sample from the first wave of assessments, conducted in summer of 2012. [sent-96, score-0.122]

49 All 55 participants were fluent speakers ofEnglish and were able to read at at least a 4th-grade level. [sent-97, score-0.057]

50 None of these participants fought in a professional or amateur competition within 45 days prior to testing. [sent-98, score-0.187]

51 2 Methods Each participant’s professional fighting history was used to determine his or her total number of pro fights and number of fights per year. [sent-100, score-0.59]

52 These figures were used to construct a composite fightexposure index as a summary measure of cumulative traumatic exposure, as follows. [sent-101, score-0.158]

53 886 (a) Mean pairwise similarity score (b) Mean chain/cluster length (c) Max Figure 3: Computation-method chain/cluster length and exposure-group tween the low- and high-exposure comparisons showing significant differences fighter groups on MPSS, MCL, and MXCL measures. [sent-102, score-0.219]

54 Due to the relatively small sample size in our pilot study, we combined groups with scores of 0 and 1 to constitute the low-exposure group (n = 25), and the rest were assigned to the highexposure group (n = 30). [sent-105, score-0.329]

55 All participants underwent a cognitive test battery that included the PVF test (letter ‘F’). [sent-106, score-0.057]

56 Their responses were processed by our system, and means for our chaining variables of interest, as well as counts of total words and repetitions, were compared across the low- and high-exposure groups. [sent-107, score-0.038]

57 Additionally, all 55 PVF responses were subjected to manual phonetic clustering analysis, following the methodology of Troyer et al. [sent-108, score-0.569]

58 With this approach, clusters are used instead of chains, and two words are considered phonetically similar if they meet any of the following conditions: they begin with the same two orthographic letters; they rhyme; they differ by only a vowel sound (e. [sent-110, score-0.146]

59 For each clustering method, the differences in means between the groups were tested for statistical significance using one-way ANOVA adjusted for the effects of age and years of education. [sent-113, score-0.31]

60 Spearman correlation was used to test for associations between continuous variables, due to nonlinearity, and to directly compare manually determined clustering measures with corresponding automatically determined chain measures. [sent-114, score-0.361]

61 4 Results The results of comparisons between the clustering methods, as well as between the low- and highexposure groups, are illustrated in Figure 3. [sent-115, score-0.237]

62 02) in MPSS between the high- and low-exposure groups using the common-biphone method (0. [sent-117, score-0.082]

63 11), while with edit distance the difference was small (0. [sent-120, score-0.04]

64 Mean chain sizes determined by the commonbiphone method correlated with manually determined cluster sizes more strongly than did chain sizes determined by edit distance (ρ = 0. [sent-125, score-0.476]

65 Comparisons of maximum chain and cluster sizes showed a similar pattern (ρ = 0. [sent-131, score-0.098]

66 Both automatic methods showed significant differences (p < 0. [sent-137, score-0.054]

67 01) between the two groups in MCL and MXCL, with each finding longer chains in the high-exposure group (Figure 3b, 3c); however, slightly larger differences were observed using the common-biphone method (MCL: 2. [sent-138, score-0.281]

68 64 by 2Clustering measures rely on chains for our automatic methods, and on clusters for manual analysis. [sent-147, score-0.203]

69 Group differences for manually determined MCL and MXCL were also significant (p < 0. [sent-152, score-0.1]

70 Of the two automatic clustering methods, the common-biphone method, which uses binary similarity values, found greater differences between groups in MPSS, MCL, and MXCL; thus, it appears to be more sensitive than the edit-distance method in detecting group differences. [sent-164, score-0.386]

71 Commonbiphone measures were also found to better correlate with manual measures; however, both automated methods disagreed with the manual approach to some extent. [sent-165, score-0.209]

72 The fact that the automated common-biphone method shows significant differences between group means, while having less variability in measurements, suggests that it may be a more suitable measure of phonetic clustering than the traditional manual method. [sent-166, score-0.67]

73 These results are particularly important in light of the difference in WCNT means between lowand high-exposure groups being small and not sig- nificant (WCNT: 17. [sent-167, score-0.082]

74 Other studies that used manual clustering and switching analyses reported significantly more switches for healthy controls than for individuals with neurological conditions (Troyer et al. [sent-174, score-0.514]

75 These studies also reported differences in the total number of words produced, likely due to investigating already impaired individuals. [sent-176, score-0.054]

76 Our findings show that the low- and highexposure groups produced similar numbers of words, but the high-exposure group tended to produce longer sequences of phonetically similar words. [sent-177, score-0.331]

77 The latter phenomenon may be interpreted as a mild form of perseverative (stuck-inset/repetitive) behavior that is characteristic of disorders involving damage to frontal and subcortical brain structures. [sent-178, score-0.27]

78 To test this interpretation, we correlated MCL and MXCL, the two measures with greatest differences between low- and high-exposure fighters, with the count of repeated words (RCNT). [sent-179, score-0.09]

79 Clearly, these findings are preliminary and need to be confirmed in larger samples; however, they plainly demonstrate the utility of our fully automated and quantifiable approach to characterizing and measuring clustering behavior on PVF tests. [sent-185, score-0.184]

80 Pending further clinical validation, this system may be used for large-scale screening for subtle signs of certain types of brain damage or degeneration not only in contact-sports athletes, but also in the general population. [sent-186, score-0.355]

81 Chronic traumatic encephalopathy: A potential late effect of sport-related concussive and subconcussive head trauma. [sent-219, score-0.198]

82 Verbal fluency in Huntington’s disease: A longitudinal analysis of phonemic and semantic clustering and switching. [sent-230, score-0.354]

83 Chronic traumatic encephalopathy, suicides and parasuicides in professional American athletes: The role of the forensic pathologist. [sent-264, score-0.339]

84 Clustering strategies on tasks ofverbal fluency in Parkinson’s disease. [sent-281, score-0.126]

85 Verbal fluency in individuals with mild traumatic brain injury. [sent-287, score-0.501]

86 Clustering and switching as two components of verbal fluency: Evidence from younger and older healthy adults. [sent-305, score-0.233]

87 Clustering and switching on verbal fluency: The effects of focal frontal- and temporal-lobe lesions. [sent-311, score-0.209]

88 Clustering and switching on verbal fluency tests in Alzheimer’s and Parkinson’s disease. [sent-316, score-0.334]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fighters', 0.333), ('phonetic', 0.292), ('pvf', 0.256), ('fights', 0.23), ('mxcl', 0.205), ('mcl', 0.181), ('brain', 0.173), ('traumatic', 0.158), ('cmudict', 0.154), ('troyer', 0.154), ('clustering', 0.135), ('professional', 0.13), ('mpss', 0.128), ('fluency', 0.126), ('phonetically', 0.107), ('chains', 0.105), ('commonbiphone', 0.102), ('highexposure', 0.102), ('letter', 0.1), ('pss', 0.098), ('chain', 0.098), ('switching', 0.092), ('raskin', 0.091), ('cbs', 0.084), ('groups', 0.082), ('repetitive', 0.079), ('verbal', 0.078), ('encephalopathy', 0.077), ('martial', 0.077), ('moscovitch', 0.077), ('neurological', 0.077), ('parkinson', 0.077), ('wcnt', 0.077), ('impairment', 0.068), ('chronic', 0.068), ('injury', 0.068), ('trauma', 0.068), ('subtle', 0.067), ('symbols', 0.065), ('pilot', 0.065), ('morris', 0.065), ('signs', 0.063), ('healthy', 0.063), ('athletes', 0.063), ('manual', 0.062), ('participants', 0.057), ('phonemic', 0.056), ('angela', 0.056), ('pronunciations', 0.056), ('differences', 0.054), ('damage', 0.052), ('arpabet', 0.051), ('benton', 0.051), ('biphone', 0.051), ('forensic', 0.051), ('gavett', 0.051), ('huntington', 0.051), ('neuropsychologia', 0.051), ('neuropsychological', 0.051), ('neuropsychology', 0.051), ('omalu', 0.051), ('pfbh', 0.051), ('rcnt', 0.051), ('winocur', 0.051), ('gordon', 0.05), ('automated', 0.049), ('year', 0.049), ('sarah', 0.046), ('determined', 0.046), ('bernick', 0.045), ('disorders', 0.045), ('boxers', 0.045), ('fighter', 0.045), ('pagel', 0.045), ('rabiner', 0.045), ('raman', 0.045), ('umn', 0.045), ('pronunciation', 0.045), ('individuals', 0.044), ('subjected', 0.042), ('artists', 0.042), ('fujii', 0.042), ('pronouncing', 0.042), ('analyses', 0.041), ('edit', 0.04), ('head', 0.04), ('group', 0.04), ('vowel', 0.039), ('effects', 0.039), ('responses', 0.038), ('tests', 0.038), ('variability', 0.038), ('similarity', 0.038), ('patients', 0.037), ('fight', 0.037), ('longitudinal', 0.037), ('sensitive', 0.037), ('sd', 0.037), ('measures', 0.036), ('exposure', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

2 0.091027305 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

3 0.074577138 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

Author: Johann-Mattis List ; Steven Moran

Abstract: Given the increasing interest and development of computational and quantitative methods in historical linguistics, it is important that scholars have a basis for documenting, testing, evaluating, and sharing complex workflows. We present a novel open-source toolkit for quantitative tasks in historical linguistics that offers these features. This toolkit also serves as an interface between existing software packages and frequently used data formats, and it provides implementations of new and existing algorithms within a homogeneous framework. We illustrate the toolkit’s functionality with an exemplary workflow that starts with raw language data and ends with automatically calculated phonetic alignments, cognates and borrowings. We then illustrate evaluation metrics on gold standard datasets that are provided with the toolkit.

4 0.073181823 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation

Author: Gozde Ozbal ; Daniele Pighin ; Carlo Strapparava

Abstract: Daniele Pighin Google Inc. Z ¨urich, Switzerland danie le . pighin@ gmai l com . Carlo Strapparava FBK-irst Trento, Italy st rappa@ fbk . eu you”. As another scenario, creative sentence genWe present BRAINSUP, an extensible framework for the generation of creative sentences in which users are able to force several words to appear in the sentences and to control the generation process across several semantic dimensions, namely emotions, colors, domain relatedness and phonetic properties. We evaluate its performance on a creative sentence generation task, showing its capability of generating well-formed, catchy and effective sentences that have all the good qualities of slogans produced by human copywriters.

5 0.064606018 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

Author: Manaal Faruqui ; Chris Dyer

Abstract: We present an information theoretic objective for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages. The monolingual component of our objective is the average mutual information of clusters of adjacent words in each language, while the bilingual component is the average mutual information of the aligned clusters. To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.

6 0.062980041 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

7 0.059308805 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

8 0.056761347 114 acl-2013-Detecting Chronic Critics Based on Sentiment Polarity and Userâ•Žs Behavior in Social Media

9 0.055594742 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

10 0.055413499 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

11 0.051047973 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

12 0.04910817 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

13 0.04797323 63 acl-2013-Automatic detection of deception in child-produced speech using syntactic complexity features

14 0.046585646 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

15 0.046530098 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

16 0.045464482 128 acl-2013-Does Korean defeat phonotactic word segmentation?

17 0.043405749 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

18 0.042549733 135 acl-2013-English-to-Russian MT evaluation campaign

19 0.040456083 154 acl-2013-Extracting bilingual terminologies from comparable corpora

20 0.040454477 278 acl-2013-Patient Experience in Online Support Forums: Modeling Interpersonal Interactions and Medication Use

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.114), (1, 0.021), (2, 0.02), (3, -0.034), (4, -0.002), (5, -0.038), (6, -0.005), (7, 0.021), (8, -0.003), (9, -0.021), (10, -0.037), (11, -0.022), (12, -0.021), (13, -0.01), (14, -0.064), (15, -0.041), (16, -0.024), (17, 0.017), (18, 0.021), (19, 0.007), (20, -0.018), (21, -0.019), (22, 0.05), (23, -0.045), (24, -0.02), (25, 0.045), (26, 0.016), (27, 0.017), (28, -0.009), (29, -0.021), (30, -0.002), (31, -0.047), (32, -0.04), (33, -0.078), (34, 0.037), (35, 0.062), (36, -0.07), (37, 0.046), (38, -0.125), (39, 0.086), (40, 0.029), (41, 0.064), (42, 0.016), (43, -0.025), (44, -0.033), (45, -0.052), (46, 0.066), (47, 0.013), (48, 0.029), (49, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9073 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

2 0.7963708 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

Author: Thomas Mayer ; Christian Rohrdantz

Abstract: This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns.

3 0.7029922 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

Author: Johann-Mattis List ; Steven Moran

4 0.68583071 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

5 0.64671922 29 acl-2013-A Visual Analytics System for Cluster Exploration

Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel

Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114

6 0.63799691 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

7 0.57841426 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

8 0.53376687 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation

9 0.5147416 220 acl-2013-Learning Latent Personas of Film Characters

10 0.50308877 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

11 0.48907012 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

12 0.48174354 21 acl-2013-A Statistical NLG Framework for Aggregated Planning and Realization

13 0.47843167 192 acl-2013-Improved Lexical Acquisition through DPP-based Verb Clustering

14 0.46454713 37 acl-2013-Adaptive Parser-Centric Text Normalization

15 0.46001863 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

16 0.45823601 278 acl-2013-Patient Experience in Online Support Forums: Modeling Interpersonal Interactions and Medication Use

17 0.4578968 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints

18 0.45290023 337 acl-2013-Tag2Blog: Narrative Generation from Satellite Tag Data

19 0.43885952 14 acl-2013-A Novel Classifier Based on Quantum Computation

20 0.43418935 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.058), (6, 0.016), (11, 0.043), (24, 0.025), (26, 0.022), (35, 0.047), (42, 0.043), (48, 0.018), (70, 0.573), (88, 0.014), (90, 0.021), (95, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97844404 384 acl-2013-Visual Features for Linguists: Basic image analysis techniques for multimodally-curious NLPers

Author: Elia Bruni ; Marco Baroni

Abstract: unkown-abstract

same-paper 2 0.93147349 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

3 0.92063373 296 acl-2013-Recognizing Identical Events with Graph Kernels

Author: Goran Glavas ; Jan Snajder

Abstract: Identifying news stories that discuss the same real-world events is important for news tracking and retrieval. Most existing approaches rely on the traditional vector space model. We propose an approach for recognizing identical real-world events based on a structured, event-oriented document representation. We structure documents as graphs of event mentions and use graph kernels to measure the similarity between document pairs. Our experiments indicate that the proposed graph-based approach can outperform the traditional vector space model, and is especially suitable for distinguishing between topically similar, yet non-identical events.

4 0.90770972 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

Author: Shay B. Cohen ; Mark Johnson

Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.

5 0.88788742 218 acl-2013-Latent Semantic Tensor Indexing for Community-based Question Answering

Author: Xipeng Qiu ; Le Tian ; Xuanjing Huang

Abstract: Retrieving similar questions is very important in community-based question answering(CQA) . In this paper, we propose a unified question retrieval model based on latent semantic indexing with tensor analysis, which can capture word associations among different parts of CQA triples simultaneously. Thus, our method can reduce lexical chasm of question retrieval with the help of the information of question content and answer parts. The experimental result shows that our method outperforms the traditional methods.

6 0.88555729 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

7 0.86484253 220 acl-2013-Learning Latent Personas of Film Characters

8 0.79737186 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia

9 0.64391255 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities

10 0.60486412 249 acl-2013-Models of Semantic Representation with Visual Attributes

11 0.59639233 380 acl-2013-VSEM: An open library for visual semantics representation

12 0.58479553 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

13 0.55645716 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars

14 0.54510814 80 acl-2013-Chinese Parsing Exploiting Characters

15 0.54285401 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus

16 0.51721078 168 acl-2013-Generating Recommendation Dialogs by Extracting Information from User Reviews

17 0.51537955 339 acl-2013-Temporal Signals Help Label Temporal Relations

18 0.51342523 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

19 0.51308787 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation

20 0.50577211 165 acl-2013-General binarization for parsing and translation