acl acl2013 acl2013-270 knowledge-graph by maker-knowledge-mining

270 acl-2013-ParGramBank: The ParGram Parallel Treebank

Source: pdf

Author: Sebastian Sulger ; Miriam Butt ; Tracy Holloway King ; Paul Meurer ; Tibor Laczko ; Gyorgy Rakosi ; Cheikh Bamba Dione ; Helge Dyvik ; Victoria Rosen ; Koenraad De Smedt ; Agnieszka Patejuk ; Ozlem Cetinoglu ; I Wayan Arka ; Meladel Mistica

Abstract: This paper discusses the construction of a parallel treebank currently involving ten languages from six language families. The treebank is based on deep LFG (LexicalFunctional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. The grammars produce output that is maximally parallelized across languages and language families. This output forms the basis of a parallel treebank covering a diverse set of phenomena. The treebank is publicly available via the INESS treebanking environment, which also allows for the alignment of language pairs. We thus present a unique, multilayered parallel treebank that represents more and different types of languages than are avail- able in other treebanks, that represents me ladel .mi st ica@ gmai l com . deep linguistic knowledge and that allows for the alignment of sentences at several levels: dependency structures, constituency structures and POS information.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au , Abstract This paper discusses the construction of a parallel treebank currently involving ten languages from six language families. [sent-23, score-0.394]

2 The treebank is based on deep LFG (LexicalFunctional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. [sent-24, score-0.315]

3 The grammars produce output that is maximally parallelized across languages and language families. [sent-25, score-0.185]

4 This output forms the basis of a parallel treebank covering a diverse set of phenomena. [sent-26, score-0.333]

5 The treebank is publicly available via the INESS treebanking environment, which also allows for the alignment of language pairs. [sent-27, score-0.345]

6 We thus present a unique, multilayered parallel treebank that represents more and different types of languages than are avail- able in other treebanks, that represents me ladel . [sent-28, score-0.394]

7 deep linguistic knowledge and that allows for the alignment of sentences at several levels: dependency structures, constituency structures and POS information. [sent-30, score-0.194]

8 1 Introduction This paper discusses the construction of a parallel treebank currently involving ten languages that represent several different language families, including non-Indo-European. [sent-31, score-0.394]

9 The treebank is based on the output of individual deep LFG (LexicalFunctional Grammar) grammars that were developed independently at different sites but within the overall framework of ParGram (the Parallel Grammar project) (Butt et al. [sent-32, score-0.315]

10 The aim of ParGram is to produce deep, wide coverage grammars for a variety of languages. [sent-35, score-0.08]

11 Deep grammars provide detailed syntactic analysis, encode grammatical functions as well as 550 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-36, score-0.08]

12 The ParGram grammars are couched within the linguistic framework of LFG (Bresnan, 2001 ; Dalrymple, 2001) and are constructed with a set of grammatical features that have been commonly agreed upon within the ParGram group. [sent-39, score-0.112]

13 ParGram grammars are implemented using XLE, an efficient, industrialstrength grammar development platform that includes a parser, a generator and a transfer system (Crouch et al. [sent-40, score-0.163]

14 Over the years, ParGram has continuously grown and includes grammars for Arabic, Chinese, English, French, German, Georgian, Hungarian, Indonesian, Irish, Japanese, Mala- gasy, Murrinh-Patha, Norwegian, Polish, Spanish, Tigrinya, Turkish, Urdu, Welsh and Wolof. [sent-43, score-0.08]

15 ParGram grammars produce output that has been parallelized maximally across languages according to a set of commonly agreed upon universal proto-type analyses and feature values. [sent-44, score-0.261]

16 This output forms the basis of the ParGramBank parallel treebank discussed here. [sent-45, score-0.333]

17 , 2009) in which grammar parallelism is presupposed to propagate alignment across different projections (section 6). [sent-47, score-0.283]

18 In recent years, parallel treebanking1 has gained in importance within NLP. [sent-52, score-0.147]

19 An obvious application for parallel treebanking is machine translation, where treebank size is a deciding factor for whether a particular treebank can support a particular kind of research project. [sent-53, score-0.616]

20 The treebanking effort reported on in this paper supports work of the latter focus, including efforts at multilingual dependency parsing (Naseem et al. [sent-55, score-0.132]

21 created a parallel treebank whose prototype includes ten typologically diverse languages and reflects a diverse set of phenomena. [sent-58, score-0.464]

22 We thus present a unique, multilayered parallel treebank that represents more languages than are currently available in other treebanks, and different types of languages as well. [sent-59, score-0.455]

23 It contains deep linguistic knowledge and allows for the parallel and simultaneous alignment of sentences at several levels. [sent-60, score-0.258]

24 0 license via the INESS treebanking environment and comes in two formats: a Prolog format and an XML format. [sent-65, score-0.097]

25 Section 3 presents ParGram and its approach to parallel treebanking. [sent-68, score-0.147]

26 Section 4 focuses on the treebank design and its construction. [sent-69, score-0.186]

27 Section 6 elaborates on the mechanisms for parallel alignment of the treebank. [sent-71, score-0.209]

28 2 Related Work There have been several efforts in parallel treebanking across theories and annotation schemes. [sent-72, score-0.244]

29 Kuhn and Jellinghaus (2006) take a minimal approach towards multilingual parallel treebanking. [sent-73, score-0.182]

30 They bootstrap phrasal alignments over a sentence-aligned parallel corpus of English, French, German and Spanish and report concrete treebank annotation work on a sample of sentences from the Europarl corpus. [sent-74, score-0.368]

31 Klyueva and Mare c˘ek (2010) present a small parallel treebank using data and tools from two existing treebanks. [sent-83, score-0.333]

32 They take a syntactically annotated gold standard text for one language and run an automated annotation on the parallel text for the other language. [sent-84, score-0.147]

33 Manually annotated Russian data are taken from the SynTagRus treebank (Nivre et al. [sent-85, score-0.186]

34 The SMULTRON project is concerned with constructing a parallel treebank of English, German and Swedish. [sent-87, score-0.333]

35 Additionally, the German and Swedish monolingual treebanks contain lemma information. [sent-90, score-0.066]

36 The treebank is distributed in TIGERXML format (Volk et al. [sent-91, score-0.186]

37 3 A further parallel treebanking effort is ParTUT, a parallel treebank (Sanguinetti and Bosco, 2011; Bosco et al. [sent-96, score-0.577]

38 Closest to our work is the ParDeepBank, which is engaged in the creation of a highly parallel treebank of English, Portuguese and Bulgarian. [sent-98, score-0.333]

39 ParDeepBank is couched within the linguistic framework of HPSG (Head-Driven Phrase Structure Grammar) and uses parallel automatic HPSG grammars, employing the same tools and implementation strategies across languages (Flickinger et al. [sent-99, score-0.24]

40 The parallel treebank is aligned on the sentence, phrase and word level. [sent-101, score-0.386]

41 In sum, parallel treebanks have so far fo- cused exclusively on Indo-European languages 3The paper mentions Hindi as the fourth language, but this is not yet available: http : / / stp . [sent-102, score-0.274]

42 In contrast, our ParGramBank treebank currently includes ten typologically different languages from six different language families (Altaic, Austronesian, Indo-European, Kartvelian, Niger-Congo, Uralic). [sent-108, score-0.317]

43 However, with the methodology developed within XPAR, align- ments can easily be recomputed from f-structure alignments in case of grammar or feature changes, so that we also have the flexible capability of allowing ParGramBank to include dynamic treebanks. [sent-114, score-0.118]

44 3 ParGram and its Feature Space The ParGram grammars use the LFG formalism which produces c(onstituent)-structures (trees) and f(unctional)-structures as the syntactic analysis. [sent-115, score-0.08]

45 Within LFG, f-structures encode a language universal level of syntactic analysis, allowing for crosslinguistic parallelism at this level of abstraction. [sent-117, score-0.138]

46 ParGram tests the LFG formalism for its universality and coverage limitations to see how far parallelism can be maintained across languages. [sent-121, score-0.138]

47 Where possible, analyses produced by the grammars for similar constructions in each language are parallel, with the computational advantage that the grammars can be used in similar applications and that machine translation can be simplified. [sent-122, score-0.283]

48 Adherence to feature committee decisions is supported technically by a routine that checks the grammars for compatibility with a feature declaration (King et al. [sent-125, score-0.08]

49 , 2005); the feature space for each grammar is included in ParGramBank. [sent-126, score-0.083]

50 ParGram also conducts regular meetings to discuss constructions, analyses and features. [sent-127, score-0.076]

51 The f-structures, in contrast, are parallel aside from grammar-specific characteristics such as the absence of grammatical gender marking in English and the absence of articles in Urdu. [sent-134, score-0.147]

52 ’ With parallel analyses and parallel features, maximal parallelism across typologically different languages is maintained. [sent-159, score-0.639]

53 4The Urdu ParGram grammar makes use of a transliteration scheme that abstracts away from the Arabic-based script; the transliteration scheme is detailed in Malik et al. [sent-161, score-0.083]

54 , in Figure 1, the NP the farmer and the VP sell his tractor belong to different f-structures: the former maps onto the SUBJ f-structure, while the latter maps onto the topmost f-structure (Dyvik et al. [sent-165, score-0.322]

55 Figure 1: English and Urdu c-structures We emphasize the fact that ParGramBank is characterized by a maximally reliable, humancontrolled and linguistically deep parallelism across aligned sentences. [sent-169, score-0.284]

56 Generally, the result of automatic sentence alignment procedures are parallel corpora where the corresponding sentences normally have the same purported meaning as intended by the translator, but they do not necessarily match in terms of structural expression. [sent-170, score-0.209]

57 In building ParGramBank, conscious attention is paid to maintaining semantic and constructional parallelism as much as possible. [sent-171, score-0.138]

58 This design feature renders our treebank reliable in cases when the constructional parallelism is reduced even at fstructure. [sent-172, score-0.324]

59 For example, typological variation in the presence or absence of finite passive constructions represents a case of potential mismatch. [sent-173, score-0.1]

60 Hungarian, one of the treebank languages, has no productive finite passives. [sent-174, score-0.186]

61 ’ In this case, a topicalized object in Hungarian has to be aligned with a (topical) subject in English. [sent-177, score-0.085]

62 Given that both the sentence level and the phrase level alignments are human-controlled in the treebank (see sections 4 and 6), the greatest possible parallelism is reliably captured even in such cases of relative grammatical divergence. [sent-178, score-0.359]

63 (201 1) in using coverage of grammatical constructions as a key component for grammar development. [sent-183, score-0.13]

64 My neighbor was given an old tractor by the farmer. [sent-210, score-0.215]

65 The sentences were translated from English into the other treebank languages. [sent-225, score-0.186]

66 The translations were done by ParGram grammar developers (i. [sent-227, score-0.083]

67 The sentences were automatically parsed with ParGram grammars using XLE. [sent-230, score-0.08]

68 Since the parsing was performed sentence by sentence, our resulting treebank is automatically aligned at the sentence level. [sent-231, score-0.239]

69 The banked analyses can be exported and downloaded in a Prolog format using the LFG Parsebanker interface. [sent-235, score-0.108]

70 554 5 Challenges for Parallelism We detail some challenges in maintaining parallelism across typologically distinct languages. [sent-241, score-0.208]

71 For example, Urdu uses a combination of predicates to express concepts that in languages like English are expressed with a single verb, e. [sent-244, score-0.106]

72 The strategy within ParGram is to abstract away from the particular surface morphosyntactic expression and aim at parallelism at the level of f-structure. [sent-248, score-0.138]

73 ’ The f-structure analysis of complex predicates is thus similar to that of languages which do not use complex predicates, resulting in a strong syntactic parallelism at this level, even across typologically diverse languages. [sent-283, score-0.314]

74 The languages in ParGramBank differ with respect to their negation strategies. [sent-286, score-0.113]

75 Other languages employ nonindependent, morphological negation techniques; Turkish, for instance, uses an affix on the verb, as in (6). [sent-288, score-0.145]

76 3 Copula Constructions Another challenge to parallelism comes from copula constructions. [sent-296, score-0.226]

77 The possible analyses are demonstrated here with respect to the sentence The tractor is red. [sent-301, score-0.249]

78 The English grammar (Figure 7) uses a raising approach that reflects the earliest treatments of copulas in LFG (Bresnan, 1982). [sent-302, score-0.083]

79 The copula takes a non-finite complement whose subject is raised to the matrix clause as a non-thematic subject of the copula. [sent-303, score-0.088]

80 In contrast, in Urdu (Figure 8), the Figure 6: Different f-structural analyses for negation (English vs. [sent-304, score-0.128]

81 Turkish) copula is a two-place predicate, assigning SUBJ and PREDLINK functions. [sent-305, score-0.088]

82 Finally, in languages like Indonesian (Figure 9), there is no overt copula and the adjective is the main predicational element of the clause. [sent-307, score-0.149]

83 4 Summary This section discussed some challenges for maintaining parallel analyses across typologically diverse languages. [sent-309, score-0.293]

84 A further extension to the capabilities of the treebank could be the addition of pointers from the alternative structure used in the translation to the parallel aligned set of sentences that correspond to this alternative structure. [sent-314, score-0.386]

85 6 Linguistically Motivated Alignment The treebank is automatically aligned on the sentence level, the top level of alignment within ParGramBank. [sent-315, score-0.301]

86 The tool automatically computes the alignment of c-structure nodes on the basis of the manually aligned corresponding fstructures. [sent-320, score-0.115]

87 Within a source and a target functional domain, two nodes are automatically aligned only if they dominate corresponding word forms. [sent-325, score-0.107]

88 In Figure 10 the nodes in each functional domain in the trees are connected by whole lines while dotted lines connect different functional domains. [sent-326, score-0.108]

89 Within a functional domain, thick whole lines connect the nodes that share alignment; for simplicity the alignment is only indicated for the top nodes. [sent-327, score-0.116]

90 The alignment information is stored as an additional layer and can be used to explore alignments at the string (word), phrase (c)structure, and functional (f-)structure levels. [sent-329, score-0.151]

91 We have so far aligned the treebank pairs English-Urdu, English-German, English-Polish and Norwegian-Georgian. [sent-330, score-0.239]

92 As Figure 10 illustrates for (7) in an English-Urdu pairing, the English object neighbor is aligned with the Urdu indirect object (OBJ-GO) hamsAyA ‘neighbor’, while the English indirect object (OBJ-TH) tractor is aligned with the Urdu object TrEkTar ‘tractor’ . [sent-331, score-0.321]

93 case we 557 will measure IAA for this Figure 10: Phrase-aligned treebank example English-Urdu: The farmer gave his neighbor an old tractor. [sent-369, score-0.344]

94 quent inspection of parallel treebanks which contain highly complex linguistic structures. [sent-370, score-0.213]

95 8 7 Discussion and Future Work We have discussed the construction of ParGramBank, a parallel treebank for ten typologically different languages. [sent-371, score-0.403]

96 The analyses in ParGramBank are the output of computational LFG ParGram grammars. [sent-372, score-0.076]

97 As a result of ParGram’s cen- trally agreed upon feature sets and prototypical analyses, the representations are not only deep in nature, but maximally parallel. [sent-373, score-0.093]

98 Third, the treebank will be expanded to include 100 more sentences within the next year. [sent-381, score-0.186]

99 We also plan to include more languages as other ParGram groups contribute structures to ParGramBank. [sent-382, score-0.103]

100 0 license via the INESS platform, which supports alignment methodology developed in the XPAR project and provides search and visualization methods for parallel treebanks. [sent-384, score-0.209]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pargram', 0.457), ('pargrambank', 0.315), ('lfg', 0.276), ('urdu', 0.21), ('treebank', 0.186), ('tractor', 0.173), ('parallel', 0.147), ('holloway', 0.145), ('tracy', 0.138), ('parallelism', 0.138), ('farmer', 0.116), ('crouch', 0.115), ('dyvik', 0.111), ('parsebanker', 0.11), ('king', 0.106), ('csli', 0.101), ('turkish', 0.099), ('iness', 0.098), ('treebanking', 0.097), ('xle', 0.095), ('butt', 0.088), ('copula', 0.088), ('grammar', 0.083), ('grammars', 0.08), ('pardeepbank', 0.079), ('ros', 0.077), ('dalrymple', 0.077), ('analyses', 0.076), ('typologically', 0.07), ('dick', 0.07), ('helge', 0.07), ('treebanks', 0.066), ('miriam', 0.066), ('koenraad', 0.063), ('trektar', 0.063), ('alignment', 0.062), ('languages', 0.061), ('flickinger', 0.058), ('joan', 0.058), ('functional', 0.054), ('aligned', 0.053), ('typological', 0.053), ('negation', 0.052), ('bender', 0.051), ('meurer', 0.051), ('indonesian', 0.051), ('hungarian', 0.05), ('deep', 0.049), ('victoria', 0.048), ('bresnan', 0.048), ('bosco', 0.047), ('georgian', 0.047), ('kqk', 0.047), ('laczk', 0.047), ('sanguinetti', 0.047), ('xpar', 0.047), ('constructions', 0.047), ('uni', 0.046), ('predicates', 0.045), ('maximally', 0.044), ('neighbor', 0.042), ('structures', 0.042), ('sulger', 0.042), ('smedt', 0.042), ('prolog', 0.042), ('constituency', 0.041), ('polish', 0.04), ('driver', 0.039), ('german', 0.036), ('norwegian', 0.036), ('multilingual', 0.035), ('alignments', 0.035), ('chomsky', 0.035), ('sell', 0.033), ('affix', 0.032), ('arka', 0.032), ('banked', 0.032), ('bobrow', 0.032), ('couched', 0.032), ('declaratives', 0.032), ('dione', 0.032), ('fouvry', 0.032), ('frederik', 0.032), ('fstructures', 0.032), ('interrogatives', 0.032), ('kartvelian', 0.032), ('kisan', 0.032), ('klyueva', 0.032), ('malik', 0.032), ('manuela', 0.032), ('nordlinger', 0.032), ('norway', 0.032), ('onstituent', 0.032), ('predlink', 0.032), ('smultron', 0.032), ('topicalized', 0.032), ('unctional', 0.032), ('uralic', 0.032), ('volk', 0.032), ('wayan', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 270 acl-2013-ParGramBank: The ParGram Parallel Treebank

2 0.16371195 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

Author: Ryan McDonald ; Joakim Nivre ; Yvonne Quirmbach-Brundage ; Yoav Goldberg ; Dipanjan Das ; Kuzman Ganchev ; Keith Hall ; Slav Petrov ; Hao Zhang ; Oscar Tackstrom ; Claudia Bedini ; Nuria Bertomeu Castello ; Jungmee Lee

Abstract: We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean. To show the usefulness of such a resource, we present a case study of crosslingual transfer parsing with more reliable evaluation than has been possible before. This ‘universal’ treebank is made freely available in order to facilitate research on multilingual dependency parsing.1

3 0.14928752 357 acl-2013-Transfer Learning for Constituency-Based Grammars

Author: Yuan Zhang ; Regina Barzilay ; Amir Globerson

Abstract: In this paper, we consider the problem of cross-formalism transfer in parsing. We are interested in parsing constituencybased grammars such as HPSG and CCG using a small amount of data specific for the target formalism, and a large quantity of coarse CFG annotations from the Penn Treebank. While all of the target formalisms share a similar basic syntactic structure with Penn Treebank CFG, they also encode additional constraints and semantic features. To handle this apparent discrepancy, we design a probabilistic model that jointly generates CFG and target formalism parses. The model includes features of both parses, allowing trans- fer between the formalisms, while preserving parsing efficiency. We evaluate our approach on three constituency-based grammars CCG, HPSG, and LFG, augmented with the Penn Treebank-1. Our experiments show that across all three formalisms, the target parsers significantly benefit from the coarse annotations.1 —

4 0.12311864 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

Author: Xiang Li ; Wenbin Jiang ; Yajuan Lu ; Qun Liu

Abstract: This paper presents an effective algorithm of annotation adaptation for constituency treebanks, which transforms a treebank from one annotation guideline to another with an iterative optimization procedure, thus to build a much larger treebank to train an enhanced parser without increasing model complexity. Experiments show that the transformed Tsinghua Chinese Treebank as additional training data brings significant improvement over the baseline trained on Penn Chinese Treebank only.

5 0.089851215 29 acl-2013-A Visual Analytics System for Cluster Exploration

Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel

Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114

6 0.086394623 94 acl-2013-Coordination Structures in Dependency Treebanks

7 0.084824659 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

8 0.083237633 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies

9 0.081784844 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

10 0.08024203 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

11 0.079087578 240 acl-2013-Microblogs as Parallel Corpora

12 0.068367876 372 acl-2013-Using CCG categories to improve Hindi dependency parsing

13 0.067616023 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

14 0.06331566 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

15 0.063250929 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach

16 0.058245298 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

17 0.055450756 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

18 0.054265622 311 acl-2013-Semantic Neighborhoods as Hypergraphs

19 0.053880114 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

20 0.052760322 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, -0.057), (2, -0.039), (3, -0.013), (4, -0.091), (5, -0.03), (6, -0.016), (7, 0.012), (8, 0.104), (9, -0.075), (10, 0.01), (11, -0.029), (12, 0.024), (13, 0.076), (14, -0.12), (15, -0.061), (16, 0.027), (17, -0.027), (18, -0.035), (19, -0.013), (20, -0.057), (21, -0.013), (22, -0.057), (23, -0.02), (24, -0.013), (25, -0.034), (26, -0.036), (27, 0.025), (28, 0.021), (29, -0.018), (30, 0.001), (31, 0.012), (32, -0.045), (33, -0.015), (34, 0.065), (35, -0.009), (36, -0.046), (37, -0.025), (38, 0.084), (39, -0.033), (40, 0.055), (41, -0.024), (42, -0.044), (43, -0.018), (44, 0.041), (45, -0.102), (46, 0.106), (47, -0.052), (48, -0.004), (49, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9337855 270 acl-2013-ParGramBank: The ParGram Parallel Treebank

2 0.72502756 94 acl-2013-Coordination Structures in Dependency Treebanks

Author: Martin Popel ; David Marecek ; Jan StÄłpanek ; Daniel Zeman ; ZdÄłnÄłk Zabokrtsky

Abstract: Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a systematizing view of various formal means developed for encoding coordination structures. We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. In addition, empirical observations on convertibility between selected styles of representations are shown too.

3 0.72392458 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

4 0.66184437 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies

Author: Reut Tsarfaty

Abstract: Stanford Dependencies (SD) provide a functional characterization of the grammatical relations in syntactic parse-trees. The SD representation is useful for parser evaluation, for downstream applications, and, ultimately, for natural language understanding, however, the design of SD focuses on structurally-marked relations and under-represents morphosyntactic realization patterns observed in Morphologically Rich Languages (MRLs). We present a novel extension of SD, called Unified-SD (U-SD), which unifies the annotation of structurally- and morphologically-marked relations via an inheritance hierarchy. We create a new resource composed of U-SDannotated constituency and dependency treebanks for the MRL Modern Hebrew, and present two systems that can automatically predict U-SD annotations, for gold segmented input as well as raw texts, with high baseline accuracy.

5 0.64406067 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

Author: Markus Gartner ; Gregor Thiele ; Wolfgang Seeker ; Anders Bjorkelund ; Jonas Kuhn

Abstract: We present ICARUS, a versatile graphical search tool to query dependency treebanks. Search results can be inspected both quantitatively and qualitatively by means of frequency lists, tables, or dependency graphs. ICARUS also ships with plugins that enable it to interface with tool chains running either locally or remotely.

6 0.60079885 367 acl-2013-Universal Conceptual Cognitive Annotation (UCCA)

7 0.59375018 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

8 0.55539149 372 acl-2013-Using CCG categories to improve Hindi dependency parsing

9 0.543001 357 acl-2013-Transfer Learning for Constituency-Based Grammars

10 0.53865349 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach

11 0.5375424 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics

12 0.52734876 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

13 0.50950861 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

14 0.5061962 331 acl-2013-Stop-probability estimates computed on a large corpus improve Unsupervised Dependency Parsing

15 0.50328761 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

16 0.49783829 335 acl-2013-Survey on parsing three dependency representations for English

17 0.4889653 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

18 0.48270977 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

19 0.48165897 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages

20 0.46736768 280 acl-2013-Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.045), (6, 0.025), (11, 0.046), (14, 0.012), (24, 0.055), (26, 0.066), (28, 0.012), (35, 0.059), (42, 0.064), (48, 0.03), (52, 0.01), (53, 0.346), (61, 0.015), (70, 0.029), (88, 0.031), (90, 0.019), (95, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78766769 270 acl-2013-ParGramBank: The ParGram Parallel Treebank

2 0.77720058 370 acl-2013-Unsupervised Transcription of Historical Documents

Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein

Abstract: We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 3 1% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.

3 0.44941282 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.

4 0.40357447 225 acl-2013-Learning to Order Natural Language Texts

Author: Jiwei Tan ; Xiaojun Wan ; Jianguo Xiao

Abstract: Ordering texts is an important task for many NLP applications. Most previous works on summary sentence ordering rely on the contextual information (e.g. adjacent sentences) of each sentence in the source document. In this paper, we investigate a more challenging task of ordering a set of unordered sentences without any contextual information. We introduce a set of features to characterize the order and coherence of natural language texts, and use the learning to rank technique to determine the order of any two sentences. We also propose to use the genetic algorithm to determine the total order of all sentences. Evaluation results on a news corpus show the effectiveness of our proposed method. 1

5 0.39933199 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

6 0.39740267 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

7 0.39702505 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

8 0.39693633 318 acl-2013-Sentiment Relevance

9 0.39575827 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

10 0.3941282 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

11 0.39343217 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

12 0.39288786 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

13 0.39262068 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

14 0.39197817 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

15 0.39112949 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

16 0.39032283 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

17 0.38952014 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

18 0.38907272 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

19 0.38822272 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

20 0.38768667 172 acl-2013-Graph-based Local Coherence Modeling