acl acl2013 acl2013-279 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas Mayer ; Christian Rohrdantz
Abstract: This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns.
Reference: text
sentIndex sentText sentNum sentScore
1 PhonMatrix: Visualizing co-occurrence constraints of sounds Thomas Mayer Research Unit Quantitative Language Comparison Philipps University of Marburg thoma s . [sent-1, score-0.177]
2 de Abstract This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. [sent-3, score-0.248]
3 The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. [sent-4, score-0.098]
4 The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. [sent-5, score-0.935]
5 The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns. [sent-6, score-1.017]
6 1 Introduction In this paper, we introduce the PhonMatrix1 tool, which is designed to visualize co-occurrence constraints of sounds within words given a reasonably sized word list of the language. [sent-7, score-0.196]
7 It is a web-based implementation of the visualization method proposed in (Mayer et al. [sent-8, score-0.268]
8 , 2010a), including some further development such as an interactive component and a range of association measures and sorting methods to choose from. [sent-9, score-0.112]
9 The original motivation for this tool is to give linguists the opportunity to upload their own word lists in order to visually explore co-occurrence constraints in languages. [sent-10, score-0.125]
10 The basic idea behind the visual component of the tool is to provide for a first, at-a-glance mode of analysis which can be used to generate hypotheses about the data by simply looking at the visualization matrices. [sent-11, score-0.388]
11 One of the most well-known and wide-spread constraints is commonly referred to as vowel harmony (van der Hulst and van de Weijer, 1995). [sent-16, score-0.978]
12 In vowel harmony languages, vowels are separated into groups where vowels of the same group tend to co-occur within words, while vowels from different groups rarely co-occur. [sent-17, score-2.187]
13 Likewise, in some languages there are patterns of consonant harmony (Hansson, 2010) that show a similar behavior with respect to consonants. [sent-18, score-0.668]
14 251) where both vowels and consonants form such groups and words usually only contain sounds from the same group (e. [sent-20, score-0.774]
15 Whereas vowel harmony patterns are easily detectable in many harmonic languages due to the harmonic alternants in affixes, other co-occurrence constraints are less obvious. [sent-23, score-1.145]
16 In our view, there are many more phonotactic constraints that wait to be discovered by linguists. [sent-28, score-0.096]
17 The PhonMatrix tool is part of an ongoing effort to integrate methods and techniques from the field of visual analytics (Thomas and Cook, 2005) into linguistic research. [sent-33, score-0.102]
18 2 Related work A related tool that quantifies the co-occurrence of sounds in a given corpus is the Vowel Harmony Calculator (Harrison et al. [sent-35, score-0.213]
19 The Vowel Harmony Calculator quantifies the notion of vowel harmony for the input corpus by giving the percentage ofharmonic words and the harmony threshold. [sent-38, score-1.512]
20 The harmony threshold is the percentage of words that would be expected to be harmonic purely by chance. [sent-39, score-0.644]
21 The output of the Vowel Harmony Calculator consists of a list of values (number of polysyllabic words, harmony threshold, percentage of harmonic words, harmony index, among other things) but does not give any information about the harmonic strength of individual vowel pairs. [sent-40, score-1.689]
22 In short, the Vowel Harmony Calculator is a way to quantify the notion of harmony given the harmony classes of the language whereas PhonMatrix is intended to help detect such patterns. [sent-41, score-1.176]
23 3 System overview PhonMatrix is a web-based visualization tool that statistically analyzes sound co-occurrences within words and displays the result in a symmetric sound matrix. [sent-42, score-0.416]
24 The statistical components are written in Python whereas the visualization part is in Javascript, using the D3 library (Bostock et al. [sent-43, score-0.299]
25 In the first step, the user has to upload the text file containing the word list that serves as the input to the analysis process. [sent-46, score-0.167]
26 2 After the file has been uploaded to the server all symbols in the word list are analyzed according to their unigram and bigram frequencies. [sent-51, score-0.168]
27 These frequencies are used to infer an automatic distinction between vowels, consonants and infrequent symbols. [sent-52, score-0.273]
28 Infrequent symbols are considered to be noise in the data and can be ignored for further processing. [sent-53, score-0.109]
29 A distinction between vowels and consonants is automatically inferred from the word list by means of Sukhotin’s algorithm (Sukhotin, 1962). [sent-54, score-0.678]
30 The results of Sukhotin’s algorithm are presented to the user together with the frequency counts of the individual symbols in the word list. [sent-55, score-0.181]
31 In the third step, the user can make changes to the automatic classification of symbols into vowels and consonants and exclude infrequent symbols from further consideration. [sent-56, score-0.889]
32 The subsequent calculations of co-occurrence values are mostly based on the distinction of input symbols into consonants (C) and vowels (V). [sent-57, score-0.83]
33 Depending on the user’s choice, the co-occurrences in the selected context are calculated and analyzed with respect to a number of statistical association measures from which the user can choose one for the visualization. [sent-60, score-0.125]
34 In the last step, the results of the statistical analysis of the co-occurrence counts are displayed in a quadratic matrix of sounds. [sent-61, score-0.155]
35 The rows and columns of the matrix represent the individual sounds that are relevant for the selected context (e. [sent-62, score-0.37]
36 The rows thereby stand for the first members of the relevant sound pairs, whereas the columns contain the second members. [sent-65, score-0.188]
37 Each cell of the matrix then shows the result for the pair of sounds in the respective row and column. [sent-66, score-0.271]
38 The final result is a visualization of the cooccurrence matrix with rows and columns sorted according to the similarity of the sound vectors and statistical values represented as colors in the matrix cells. [sent-67, score-0.624]
39 The visualization features a number 2For information on the minimum amount of data necessary see (Mayer et al. [sent-68, score-0.268]
40 more 74 File%upload%PVrCe4pdrisoc5necs 5inogn,%%Usceorn%ste lxetc%ts%Co4sotac5cus5r cesn%ce%Visualiza5on% Figure 1: The processing pipeline of the PhonMatrix visualization tool. [sent-71, score-0.268]
41 of interactive components that facilitate the detection of potential patterns in the results by the user. [sent-72, score-0.081]
42 In what follows, we will describe each component in more detail, with special emphasis on the visualization component. [sent-74, score-0.294]
43 1 Vowel-consonant distinction Most of the co-occurrence restrictions that might be of interest make reference to a distinction between vowels and consonants. [sent-76, score-0.517]
44 Since a manual classification of all sounds in the input into vowels and consonants is a tedious task (especially with a larger number of symbols), the first component deals with an automatic inference of such a distinction. [sent-77, score-0.796]
45 Many methods have been discussed in the literature on how to discriminate vowels from consonants on the basis of their distribution in texts. [sent-78, score-0.606]
46 The basic idea of Sukhotin’s algorithm is that vowels and consonants have the tendency not to occur in groups within words but to alternate. [sent-81, score-0.649]
47 Based on the additional assumption that the most frequent symbol in the text is a vowel, the algorithm iteratively selects the symbol which occurs most frequently adjacent to a vowel and determines it to be a consonant. [sent-82, score-0.391]
48 The algorithm stops if no more consonants can be selected because no co-occurrence counts with any remaining vowel are positive. [sent-83, score-0.568]
49 2 Co-occurrence statistics With the distinction of symbols into vowels and consonants at hand, the user can then select a relevant context for the co-occurrence counts. [sent-88, score-0.852]
50 Here we will illustrate the statistical analysis with the context of VCV sequences to investigate vowel harmony in Turkish. [sent-90, score-0.971]
51 4 The tool automatically extracts all VCV sequences in the words and counts the co-occurrences of sounds in these sequences. [sent-92, score-0.27]
52 The φ value is a normalized χ2 measure which allows for an easier mapping of values to the color scale because it is always between −1 and 1. [sent-95, score-0.083]
53 Apart from probability and φ values, the user can also choose among a number of other association measures such as pointwise mutual information, likelihood ratios or t-scores (Manning and Sch u¨tze, 1999). [sent-97, score-0.089]
54 3 Visualization component The input to the visualization component is a matrix of association measures for each sound pair in the relevant context. [sent-99, score-0.529]
55 Two additional steps have to be performed in order to arrive at the final matrix visualization: 1) the rows and columns of the matrix have to be sorted in a meaningful way; 2) the association measures have to be mapped to visual variables. [sent-100, score-0.336]
56 For the matrix ar- rangement, we decided to have the same order of symbols for the rows and columns. [sent-101, score-0.253]
57 of symbols is determined The order by a clustering of the 4Turkish orthography represents the modern pronunciation with a high degree of accuracy. [sent-102, score-0.109]
58 58548 Table 1: φ values of VCV sequences in Turkish. [sent-168, score-0.09]
59 symbols based on the similarity of their row values. [sent-169, score-0.128]
60 Whereas the preprocessing steps and the datadriven sorting of rows and columns have been written in Python, the actual visualization of the results in the browser is implemented in Javascript using the D3 library (Bostock et al. [sent-174, score-0.378]
61 The association measures and the order of the symbols are referenced as Javascript variables in the visualization document. [sent-176, score-0.407]
62 The mapping from association measures to color values is made with the linear scale method from the d3 . [sent-178, score-0.113]
63 The input domain for the φ values is the interval [−1; 1] , while the output range can s be th given as a [c−o1lo;r1 ]s,ca wleh ranging ufrtpomut one color to the other. [sent-181, score-0.104]
64 mIn − −or1d etor to reserve a larger color range for the densely populated area of low values we did not linearly map the numerical association measures but used the square roots of the numerical values as the input for the scale function. [sent-183, score-0.218]
65 n6t lTyh teh ar nesu elxtp poefc ttehed ,m iastdr iixs pvliasyueadliz ianti tohne for the φ values of the vowels in Turkish VCV sequences is shown in Section 5. [sent-185, score-0.501]
66 The matrix visualization also features some interaction to explore the results in more detail. [sent-188, score-0.36]
67 On mouse-over, the respective matrix cell shows the actual values that serve as the input for the data mapping process. [sent-189, score-0.171]
68 Additionally, the row and column labels are highlighted in order to show more clearly which pair of symbols is currently selected (see Figure 2). [sent-190, score-0.128]
69 The size of the matrix can also be adjusted to the user’s needs with the help of a slider above the matrix. [sent-191, score-0.12]
70 Next to the slider is a dropdown menu from which users can choose the association measure that they want to be displayed in the visualization. [sent-192, score-0.078]
71 5 Case studies After the description of the PhonMatrix system we will illustrate the usefulness of the visualization of co-occurrence patterns in sounds with three case studies. [sent-193, score-0.478]
72 They are presented as a proof of concept that the visualization component allows for an ata-glance exploration of potential patterns. [sent-194, score-0.312]
73 The visualization part is thereby not considered to be a replacement of more detailed linguistic investigations but rather serves as a way to explore a multitude of different contexts and data in a comparatively short period of time. [sent-195, score-0.288]
74 After a suspicious pattern has been detected it is indispensable to look at the actual data to see whether the visualization result is an artifact of the method or data at hand or whether the detected pattern is an interesting phonotactic feature of the language under consideration. [sent-196, score-0.374]
75 1 Turkish vowel harmony The first case study shows the results of the VCV sequences in Turkish described above. [sent-198, score-0.953]
76 For this purpose the vowels a, e, i, o, u, o¨, u¨, ı are selected as the relevant sounds that are to be compared in 76 Figure 2: The visualization ofthe φ values ofVCV sequences in the Turkish text. [sent-199, score-0.938]
77 Figure 2 shows the results for the φ values that have been computed from the co-occurrence counts of the symbols in VCV sequences. [sent-201, score-0.182]
78 The arrangement of the symbols in the Umsers/athotmrmiy/xDr rows oannMatrdix/ catorix. [sent-202, score-0.161]
79 hltumlmns already show a distinc- tion between front (the first four vowels) and back (the last four vowels) vowels, reflecting the palatal harmony in Turkish. [sent-203, score-0.623]
80 This distinction can best be seen when looking at the e- and a-columns where the top four vowels all have positive φ values for e and negative φ values for a, whereas the bottom four vowels show the opposite behavior. [sent-204, score-1.005]
81 On closer inspection, the labial harmony for high vowels can also be seen in the matrix visualization. [sent-205, score-1.1]
82 From top to bottom there are always pairs of vowels that take the same harmonic vowel, starting with ( o¨, u¨) taking ¨u and followed by (e, i) taking i, (o, u) taking u and finally (a, ı) taking ı. [sent-206, score-0.492]
83 The usefulness of the visualization component to detect such patterns can best be seen when comparing Figure 2 with Table 1, which contains the same information. [sent-207, score-0.38]
84 2 Finnish vowel harmony The second case study shows that the harmonic patterns can also be detected in orthographic words of the Finnish Bible text. [sent-209, score-1.077]
85 Finnish differs from Turkish in having only one type of harmony öä ä (palatal harmony) and neutral vowels, i. [sent-210, score-0.585]
86 , vowels that do not (directly) participate in the harmony process. [sent-212, score-0.974]
87 As a different underlying association measure for the visualization consider the probability values in Figure 3. [sent-213, score-0.309]
88 For probability values o a u y e i o a u y e i Figure 3: The visualization of the probabilities of VCV sequences in the Finnish text. [sent-214, score-0.358]
89 The probability matrix clearly shows the relevant blocks of vowels that mark the harmony groups. [sent-217, score-1.118]
90 7 TSheiete 1 vocnl 1ustering algorithm separates the back vowels (first three vowels o, a, u) from the front vowels (vowels four to six, o¨, y, a¨) and the neutral vowels (e, i). [sent-218, score-1.692]
91 The blocks along the main diagonal of the matrix show the harmonic pattern among the harmony groups, whereas the neutral vowels do not display any regular behavior. [sent-219, score-1.226]
92 3 Maltese verbal roots PhonMatrix is not only useful to find vowel harmony patterns. [sent-221, score-0.967]
93 To illustrate this, we show the visualization of CC patterns in a comprehensive list of Maltese verbal roots (Spagnol, 2011). [sent-223, score-0.395]
94 The consonant matrix in Figure 4 shows two clusters, with one cluster (the first twelve consonants in the top row) containing labial and dorsal and the other cluster (the last eleven consonants) comprising only coronal consonants. [sent-224, score-0.419]
95 8 The visualization also reveals that, unlike in vowel harmony, consonants from the same cluster do not occur next to each other in the CC sequences, as shown by the red blocks in the top left and bottom right. [sent-225, score-0.849]
96 7The +/− signs in the matrix are taken from the φ values. [sent-227, score-0.092]
97 77 q ħ j g għ h m b w f k p n l r ġ d ż z x t s ċ q ħ j g għ h m b w f k p n l r ġ d ż + z x t s + ċ Figure 4: The visualization of the φ values of consonant sequences in Maltese verbal roots. [sent-230, score-0.438]
98 6 Conclusions In this paper, we have presented PhonMatrix, a web-based, interactive visualization tool for investigating co-occurrence restrictions of sounds within words. [sent-231, score-0.475]
99 The case studies of vowel harmony and SPA have shown that interesting patterns in the data can easily be seen only by looking at the matrix visualizations. [sent-232, score-1.058]
100 Consonant co-occurrence in stems across languages: Automatic analysis and visualization of a phonotactic constraint. [sent-288, score-0.33]
wordName wordTfidf (topN-words)
[('harmony', 0.563), ('vowels', 0.411), ('vowel', 0.341), ('visualization', 0.268), ('phonmatrix', 0.239), ('consonants', 0.195), ('vcv', 0.154), ('sounds', 0.143), ('sukhotin', 0.119), ('symbols', 0.109), ('matrix', 0.092), ('mayer', 0.09), ('calculator', 0.085), ('harmonic', 0.081), ('spa', 0.065), ('phonotactic', 0.062), ('consonant', 0.06), ('turkish', 0.058), ('distinction', 0.053), ('rows', 0.052), ('bostock', 0.051), ('sequences', 0.049), ('tool', 0.046), ('patterns', 0.045), ('rohrdantz', 0.045), ('upload', 0.045), ('maltese', 0.045), ('avoidance', 0.045), ('roots', 0.043), ('finnish', 0.042), ('color', 0.042), ('goldsmith', 0.042), ('values', 0.041), ('sound', 0.04), ('user', 0.04), ('javascript', 0.039), ('columns', 0.039), ('hulst', 0.034), ('labial', 0.034), ('palatal', 0.034), ('pozdniakov', 0.034), ('constraints', 0.034), ('counts', 0.032), ('visual', 0.031), ('whereas', 0.031), ('displayed', 0.031), ('measures', 0.03), ('frans', 0.03), ('cryptologia', 0.03), ('python', 0.029), ('borg', 0.028), ('konstanz', 0.028), ('slider', 0.028), ('harrison', 0.028), ('thomas', 0.027), ('relevant', 0.026), ('plank', 0.026), ('blocks', 0.026), ('front', 0.026), ('component', 0.026), ('symbol', 0.025), ('orthographic', 0.025), ('cc', 0.025), ('analytics', 0.025), ('groups', 0.025), ('infrequent', 0.025), ('butt', 0.024), ('quantifies', 0.024), ('semitic', 0.024), ('guy', 0.022), ('christian', 0.022), ('file', 0.022), ('detected', 0.022), ('neutral', 0.022), ('usefulness', 0.022), ('analyzes', 0.022), ('phonological', 0.022), ('contingency', 0.021), ('der', 0.021), ('input', 0.021), ('miriam', 0.02), ('serves', 0.02), ('verbal', 0.02), ('row', 0.019), ('church', 0.019), ('van', 0.019), ('choose', 0.019), ('visualizing', 0.019), ('list', 0.019), ('cluster', 0.019), ('detect', 0.019), ('sorting', 0.019), ('context', 0.018), ('potential', 0.018), ('tendency', 0.018), ('interactive', 0.018), ('analyzed', 0.018), ('cells', 0.018), ('cell', 0.017), ('looking', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds
Author: Thomas Mayer ; Christian Rohrdantz
Abstract: This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns.
2 0.25007063 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.
3 0.20786881 29 acl-2013-A Visual Analytics System for Cluster Exploration
Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel
Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114
4 0.061233204 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks
Author: Markus Gartner ; Gregor Thiele ; Wolfgang Seeker ; Anders Bjorkelund ; Jonas Kuhn
Abstract: We present ICARUS, a versatile graphical search tool to query dependency treebanks. Search results can be inspected both quantitatively and qualitatively by means of frequency lists, tables, or dependency graphs. ICARUS also ships with plugins that enable it to interface with tool chains running either locally or remotely.
5 0.057852954 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
Author: Guillaume Wisniewski
Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.
6 0.054495241 128 acl-2013-Does Korean defeat phonotactic word segmentation?
7 0.039573532 89 acl-2013-Computerized Analysis of a Verbal Fluency Test
8 0.036723696 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
10 0.030146476 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
11 0.029115815 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics
12 0.029028319 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing
13 0.026279811 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations
14 0.026270417 249 acl-2013-Models of Semantic Representation with Visual Attributes
15 0.025760194 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation
16 0.025405338 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees
17 0.025398783 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit
18 0.024871359 380 acl-2013-VSEM: An open library for visual semantics representation
19 0.023935422 311 acl-2013-Semantic Neighborhoods as Hypergraphs
20 0.023460483 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users
topicId topicWeight
[(0, 0.083), (1, 0.02), (2, -0.004), (3, -0.033), (4, -0.005), (5, -0.05), (6, 0.018), (7, -0.011), (8, -0.018), (9, 0.0), (10, -0.07), (11, -0.065), (12, -0.023), (13, 0.011), (14, -0.039), (15, -0.133), (16, -0.008), (17, -0.006), (18, 0.018), (19, 0.004), (20, -0.017), (21, 0.055), (22, 0.025), (23, -0.009), (24, -0.009), (25, -0.044), (26, 0.025), (27, 0.028), (28, -0.03), (29, 0.009), (30, 0.018), (31, -0.015), (32, -0.055), (33, -0.108), (34, 0.091), (35, 0.062), (36, -0.069), (37, 0.041), (38, -0.158), (39, 0.061), (40, -0.008), (41, 0.017), (42, -0.043), (43, -0.081), (44, -0.087), (45, -0.183), (46, 0.114), (47, 0.127), (48, 0.043), (49, 0.072)]
simIndex simValue paperId paperTitle
same-paper 1 0.93654799 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds
Author: Thomas Mayer ; Christian Rohrdantz
Abstract: This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns.
2 0.71142161 29 acl-2013-A Visual Analytics System for Cluster Exploration
Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel
Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114
3 0.70430255 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.
4 0.62395322 89 acl-2013-Computerized Analysis of a Verbal Fluency Test
Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks
Abstract: We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e.g., ‘F’) for one minute. Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. Our system provides an automated alternative. In a pilot study, we applied this system to tests of 55 novice and experienced professional fighters (boxers and mixed martial artists) and found that experienced fighters produced significantly longer chains of phonetically similar words, while no differences were found in the total number of words produced. These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired.
5 0.5091477 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics
Author: Johann-Mattis List ; Steven Moran
Abstract: Given the increasing interest and development of computational and quantitative methods in historical linguistics, it is important that scholars have a basis for documenting, testing, evaluating, and sharing complex workflows. We present a novel open-source toolkit for quantitative tasks in historical linguistics that offers these features. This toolkit also serves as an interface between existing software packages and frequently used data formats, and it provides implementations of new and existing algorithms within a homogeneous framework. We illustrate the toolkit’s functionality with an exemplary workflow that starts with raw language data and ends with automatically calculated phonetic alignments, cognates and borrowings. We then illustrate evaluation metrics on gold standard datasets that are provided with the toolkit.
6 0.49142903 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures
7 0.42448151 220 acl-2013-Learning Latent Personas of Film Characters
8 0.37322736 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks
9 0.35763702 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation
10 0.3530007 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach
11 0.34759626 192 acl-2013-Improved Lexical Acquisition through DPP-based Verb Clustering
12 0.33108416 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison
13 0.31975448 268 acl-2013-PATHS: A System for Accessing Cultural Heritage Collections
14 0.31574458 370 acl-2013-Unsupervised Transcription of Historical Documents
15 0.30301285 128 acl-2013-Does Korean defeat phonotactic word segmentation?
16 0.29697865 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
17 0.29398826 269 acl-2013-PLIS: a Probabilistic Lexical Inference System
18 0.29377294 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors
19 0.29114479 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications
20 0.28536049 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs
topicId topicWeight
[(0, 0.046), (6, 0.02), (11, 0.042), (24, 0.159), (26, 0.055), (35, 0.065), (39, 0.281), (42, 0.036), (48, 0.039), (70, 0.041), (88, 0.028), (90, 0.021), (95, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.81635183 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds
Author: Thomas Mayer ; Christian Rohrdantz
Abstract: This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns.
2 0.5913083 303 acl-2013-Robust multilingual statistical morphological generation models
Author: Ondrej Dusek ; Filip Jurcicek
Abstract: We present a novel method of statistical morphological generation, i.e. the prediction of inflected word forms given lemma, part-of-speech and morphological features, aimed at robustness to unseen inputs. Our system uses a trainable classifier to predict “edit scripts” that are then used to transform lemmas into inflected word forms. Suffixes of lemmas are included as features to achieve robustness. We evaluate our system on 6 languages with a varying degree of morphological richness. The results show that the system is able to learn most morphological phenomena and generalize to unseen inputs, producing significantly better results than a dictionarybased baseline.
3 0.58779204 184 acl-2013-Identification of Speakers in Novels
Author: Hua He ; Denilson Barbosa ; Grzegorz Kondrak
Abstract: Speaker identification is the task of at- tributing utterances to characters in a literary narrative. It is challenging to auto- mate because the speakers of the majority ofutterances are not explicitly identified in novels. In this paper, we present a supervised machine learning approach for the task that incorporates several novel features. The experimental results show that our method is more accurate and general than previous approaches to the problem.
4 0.58481997 229 acl-2013-Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition
Author: Man Lan ; Yu Xu ; Zhengyu Niu
Abstract: To overcome the shortage of labeled data for implicit discourse relation recognition, previous works attempted to automatically generate training data by removing explicit discourse connectives from sentences and then built models on these synthetic implicit examples. However, a previous study (Sporleder and Lascarides, 2008) showed that models trained on these synthetic data do not generalize very well to natural (i.e. genuine) implicit discourse data. In this work we revisit this issue and present a multi-task learning based system which can effectively use synthetic data for implicit discourse relation recognition. Results on PDTB data show that under the multi-task learning framework our models with the use of the prediction of explicit discourse connectives as auxiliary learning tasks, can achieve an averaged F1 improvement of 5.86% over baseline models.
5 0.58434367 128 acl-2013-Does Korean defeat phonotactic word segmentation?
Author: Robert Daland ; Kie Zuraw
Abstract: Computational models of infant word segmentation have not been tested on a wide range of languages. This paper applies a phonotactic segmentation model to Korean. In contrast to the undersegmentation pattern previously found in English and Russian, the model exhibited more oversegmentation errors and more errors overall. Despite the high error rate, analysis suggested that lexical acquisition might not be problematic, provided that infants attend only to frequently segmented items. 1
6 0.57856333 271 acl-2013-ParaQuery: Making Sense of Paraphrase Collections
7 0.57583266 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model
8 0.57310075 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization
9 0.56470311 29 acl-2013-A Visual Analytics System for Cluster Exploration
10 0.541673 244 acl-2013-Mining Opinion Words and Opinion Targets in a Two-Stage Framework
11 0.522681 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization
12 0.52111906 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations
13 0.51243681 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data
14 0.51186532 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks
15 0.50658315 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays
16 0.50612998 85 acl-2013-Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis
17 0.50520664 230 acl-2013-Lightly Supervised Learning of Procedural Dialog Systems
18 0.49947393 318 acl-2013-Sentiment Relevance
19 0.49923256 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation
20 0.49678802 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri