acl acl2013 acl2013-271 knowledge-graph by maker-knowledge-mining

271 acl-2013-ParaQuery: Making Sense of Paraphrase Collections


Source: pdf

Author: Lili Kotlerman ; Nitin Madnani ; Aoife Cahill

Abstract: Pivoting on bilingual parallel corpora is a popular approach for paraphrase acquisition. Although such pivoted paraphrase collections have been successfully used to improve the performance of several different NLP applications, it is still difficult to get an intrinsic estimate of the quality and coverage of the paraphrases contained in these collections. We present ParaQuery, a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources all within a single interface.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Pivoting on bilingual parallel corpora is a popular approach for paraphrase acquisition. [sent-3, score-0.496]

2 Although such pivoted paraphrase collections have been successfully used to improve the performance of several different NLP applications, it is still difficult to get an intrinsic estimate of the quality and coverage of the paraphrases contained in these collections. [sent-4, score-1.119]

3 We present ParaQuery, a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources all within a single interface. [sent-5, score-1.054]

4 1 Introduction – Paraphrases are widely used in many Natural Language Processing (NLP) tasks, such as information retrieval, question answering, recognizing textual entailment, text simplification etc. [sent-6, score-0.031]

5 For example, a question answering system facing a question “Who invented bifocals and lightning rods? [sent-7, score-0.31]

6 ” could retrieve the correct answer from the text “Benjamin Franklin invented strike termination devices and bifocal reading glasses” given the information that “bifocal reading glasses” is a paraphrase of “bifocals” and “strike termination devices” is a paraphrase of “lightning rods”. [sent-8, score-1.229]

7 There are numerous approaches for automatically extracting paraphrases from text (Madnani and Dorr, 2010). [sent-9, score-0.15]

8 We focus on generating paraphrases by pivoting on bilingual parallel corpora as originally suggested by Bannard and CallisonBurch (2005). [sent-10, score-0.271]

9 This technique operates by attempting to infer semantic equivalence between phrases in the same language by using a second language as a bridge. [sent-11, score-0.088]

10 org l – , a tabulation of correspondences between phrases in the source language and phrases in the target language. [sent-14, score-0.328]

11 These tables are usually extracted by inducing word alignments between sentence pairs in a parallel training corpus and then incrementally building longer phrasal correspondences from individual words and shorter phrases. [sent-15, score-0.108]

12 Once such a tabulation of bilingual correspondences is available, correspondences between phrases in one language may be inferred simply by using the phrases in the other language as pivots, e. [sent-16, score-0.401]

13 Each paraphrase pair (rule) in a pivoted paraphrase collection is defined by a source phrase e1, the target phrase e2 that has been inferred as its paraphrase, and a probability score p(e2 |e1) obtained from the probability values in the bilingual phrase table. [sent-19, score-1.529]

14 1 Pivoted paraphrase collections have been successfully used in different NLP tasks including automated document summarization (Zhou et al. [sent-20, score-0.588]

15 Yet, it is still difficult to get an estimate of the intrinsic quality and coverage of the paraphrases contained in these collections. [sent-23, score-0.226]

16 To remedy this, we propose ParaQuery a tool that can help explore and analyze pivoted paraphrase collections. [sent-24, score-0.869]

17 1) and then demonstrate its use iunp d PeatarialQ Qfuore interactively exploring annsdtr achtear itasct uesreizing a paraphrase collection, analyzing its utility for a particular domain, and comparing it with other word-similarity resources (§2. [sent-26, score-0.543]

18 This format is commonly used in the machine translation and paraphrase generation community. [sent-36, score-0.429]

19 , 2012) toolkits to generate a pivoted paraphrase collection using the English-French EuroParl parallel corpus, which we use as our example collection for demonstrating ParaQuery. [sent-38, score-1.085]

20 Once a pivoted collection is generated, ParaQuery needs to convert it into an SQLite database against which queries can be run. [sent-39, score-0.574]

21 This is done by issuing the index command at the ParaQuery command-line interface (described in §2. [sent-40, score-0.241]

22 2 Exploration and Analysis In order to provide meaningful exploration and analysis, we studied various scenarios in which paraphrase collections are used, and found that the following issues typically interest the developers and users of such collections: 1. [sent-44, score-0.671]

23 Semantic relations between the paraphrases in the collection (e. [sent-45, score-0.371]

24 The frequency of inaccurate paraphrases, possible ways of de-noising the collection, and the meaningfulness of scores (better paraphrases should be scored higher). [sent-49, score-0.15]

25 The utility of the collection for a specific domain, i. [sent-51, score-0.222]

26 whether domain terms of interest are present in the collection. [sent-53, score-0.066]

27 Comparison of different collections based on the above dimensions. [sent-55, score-0.159]

28 We note that paraphrase collections are used in many tasks with different acceptability thresholds for semantic relations, noisy paraphrases etc. [sent-56, score-0.861]

29 We do not intend to provide an exhaustive judgment of paraphrase quality, but instead allow users to characterize a collection, enabling an analysis of the aforesaid issues and providing information for them to decide whether a given collection is suitable for their specific task and/or domain. [sent-57, score-0.67]

30 1 Command line interface ParaQuery allows interactive exploration and analysis via a simple command line interface, by processing user issued queries such as: show : display the rules which satisfy the conditions of the given query. [sent-60, score-0.555]

31 explain : display information about the pivots which yielded each of these rules. [sent-62, score-0.437]

32 analyze : display statistics about these rules and save a report to an output file. [sent-63, score-0.251]

33 The following information is stored in the SQLite database for each paraphrase rule:2 • • • • • • The source and the target phrases, and the probability score tohfe eth tear gruelte. [sent-64, score-0.686]

34 Do the source and the target have the same part hofe s spoeuercche? [sent-66, score-0.109]

35 3 a Length of the source and the target, and the dLieffnegrtehn coef tihn eth seoiru lengths. [sent-67, score-0.132]

36 Are both the source and the target found in AWroerd bNoteht (WN)? [sent-69, score-0.075]

37 Therefore, all of the above can be used, alone or in combination, to constrain the queries and define the rule(s) of interest. [sent-71, score-0.076]

38 Figure 1 presents simple queries processed by the show command: the first query displays top-scoring rules with “man” as their source phrase, while the second adds restriction on the rules’ score. [sent-72, score-0.329]

39 By default, the tool displays the 10 best-scoring rules per query, but this limit can be changed as shown. [sent-73, score-0.199]

40 2Although some of this information is available in the paraphrase collection that was indexed, the remaining is automatically computed and injected into the database during the indexing process. [sent-75, score-0.689]

41 Indexing the French-pivoted paraphrase collection (containing 3,633,015 paraphrase rules) used in this paper took about 6 hours. [sent-76, score-1.02]

42 3We use the simple parts of speech provided by WordNet (nouns, verbs, adjectives and adverbs). [sent-77, score-0.041]

43 146 The queries provide a flexible way to define and work with the rule set of interest, starting from filtering low-scoring rules till extracting specific semantic relations or constraining on the number of pivots. [sent-78, score-0.396]

44 The tool also enables filtering out target terms with a recurrent lemma, as illustrated in the same figure. [sent-80, score-0.108]

45 Note that ParaQuery also contains a batch mode (in addition to the interactive mode illustrated so far) to automatically extract the output for a set of queries contained in a batch script. [sent-81, score-0.268]

46 Figure 1: Examples of the show command and the probability constraint. [sent-82, score-0.206]

47 2 Analyzing pivot information It is well known that pivoted paraphrase collections contain a lot of noisy rules. [sent-85, score-1.03]

48 To understand the origins of such rules, an explain query can be used, which displays the pivots that yielded each paraphrase rule, and the probability share of each pivot in the final probability score. [sent-86, score-1.064]

49 We see that noisy rules can originate from stopword pivots, e. [sent-88, score-0.213]

50 It is common to filter rules containing stop-words, yet perhaps it is also important to exclude stop-word pivots, which was never considered in the past. [sent-91, score-0.134]

51 We can use ParaQuery to further explore whether discarding stopword pivots is a good idea. [sent-92, score-0.412]

52 Figure 4 presents a more complex query showing paraphrase rules that were extracted via a single pivot “l”. [sent-93, score-0.669]

53 We see that the top 5 such rules are indeed noisy, indicating that perhaps all of the 5,360 rules satisfying the query can be filtered out. [sent-94, score-0.296]

54 3 Analysis of rule sets In order to provide an overall analysis of a rule set or a complete collection, ParaQuery includes the Figure 2: Restricting the output of the show command using WordNet relations and distance, and the unique lemma constraint. [sent-97, score-0.448]

55 In addition, a report is generated to a file, including the analysis information for the whole rule set and for its three parts: top, middle and bottom, as defined by the scores of the rules in the set. [sent-101, score-0.239]

56 The output to the file is more detailed and expands on the information presented in Figure 5. [sent-102, score-0.033]

57 For example, it also includes, for each part, rule samples and score distributions for each semantic relation and different WordNet distances. [sent-103, score-0.093]

58 The information contained in the report can be 147 Figure 4: Exploring French stop-word pivots using the pivots condition of the show command. [sent-104, score-0.634]

59 Figure 5: An example of the analyze command (full output not shown for space reasons). [sent-105, score-0.24]

60 148 rules from our collection’s top and bottom parts. [sent-106, score-0.17]

61 For example, Figure 6 shows the distribution of semantic relations in the three parts of our example paraphrase collection. [sent-108, score-0.529]

62 Among other conclusions, the figure shows, that discarding the lower-scoring middle and bottom parts of the collection would allow retaining almost all the synonyms and derivations, while filtering out most of the co-hyponyms and a considerable number of undefined relations. [sent-110, score-0.517]

63 Yet from Figure 6 we see that undefined rela- tions constitute the majority of the rules in the collection. [sent-111, score-0.219]

64 To better understand this, random rule samples provided in the analysis output can be used, as shown in Table 1. [sent-112, score-0.093]

65 From this table, we see that the top-part rules are indeed mostly valid for paraphrasing, unlike the noisy bottom-part rules. [sent-113, score-0.166]

66 The score distributions reported as part ofthe analysis can be used to further explore the collection and set sound thresholds suitable for different tasks and needs. [sent-114, score-0.27]

67 4 Analysis of domain utility One of the frequent questions of interest is whether a given collection is suitable for a specific domain. [sent-117, score-0.329]

68 To answer this question, ParaQuery allows the user to run the analysis from §2. [sent-118, score-0.033]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('paraquery', 0.499), ('paraphrase', 0.429), ('pivoted', 0.305), ('pivots', 0.295), ('command', 0.175), ('collection', 0.162), ('collections', 0.159), ('paraphrases', 0.15), ('undefined', 0.112), ('rules', 0.107), ('rule', 0.093), ('correspondences', 0.081), ('madnani', 0.081), ('display', 0.079), ('pivot', 0.078), ('bifocal', 0.077), ('bifocals', 0.077), ('lightning', 0.077), ('rods', 0.077), ('queries', 0.076), ('glasses', 0.068), ('sqlite', 0.068), ('tabulation', 0.068), ('analyze', 0.065), ('bottom', 0.063), ('termination', 0.063), ('utility', 0.06), ('relations', 0.059), ('invented', 0.059), ('strike', 0.059), ('noisy', 0.059), ('query', 0.055), ('displays', 0.054), ('interactively', 0.054), ('pivoting', 0.054), ('phrases', 0.052), ('devices', 0.05), ('stopword', 0.047), ('exploration', 0.047), ('contained', 0.044), ('batch', 0.042), ('suitable', 0.041), ('parts', 0.041), ('bilingual', 0.04), ('middle', 0.039), ('discarding', 0.038), ('interface', 0.038), ('tool', 0.038), ('characterize', 0.038), ('target', 0.038), ('source', 0.037), ('wordnet', 0.036), ('interest', 0.036), ('operates', 0.036), ('derivations', 0.035), ('thresholds', 0.035), ('answering', 0.035), ('explain', 0.034), ('injected', 0.034), ('tear', 0.034), ('franklin', 0.034), ('holonym', 0.034), ('coef', 0.034), ('hofe', 0.034), ('man', 0.034), ('paraphrasing', 0.033), ('user', 0.033), ('file', 0.033), ('indexing', 0.033), ('filtering', 0.032), ('intrinsic', 0.032), ('mode', 0.032), ('explore', 0.032), ('tihn', 0.031), ('database', 0.031), ('probability', 0.031), ('question', 0.031), ('synonyms', 0.03), ('domain', 0.03), ('eth', 0.03), ('constraining', 0.029), ('kotlerman', 0.029), ('lili', 0.029), ('meronym', 0.029), ('mto', 0.029), ('aoife', 0.029), ('acceptability', 0.029), ('wheh', 0.029), ('yielded', 0.029), ('stored', 0.028), ('lemma', 0.028), ('bannard', 0.028), ('origins', 0.028), ('issuing', 0.028), ('hyponymy', 0.028), ('cahill', 0.028), ('tohfe', 0.028), ('inferred', 0.027), ('parallel', 0.027), ('perhaps', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 271 acl-2013-ParaQuery: Making Sense of Paraphrase Collections

Author: Lili Kotlerman ; Nitin Madnani ; Aoife Cahill

Abstract: Pivoting on bilingual parallel corpora is a popular approach for paraphrase acquisition. Although such pivoted paraphrase collections have been successfully used to improve the performance of several different NLP applications, it is still difficult to get an intrinsic estimate of the quality and coverage of the paraphrases contained in these collections. We present ParaQuery, a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources all within a single interface.

2 0.24590181 273 acl-2013-Paraphrasing Adaptation for Web Search Ranking

Author: Chenguang Wang ; Nan Duan ; Ming Zhou ; Ming Zhang

Abstract: Mismatch between queries and documents is a key issue for the web search task. In order to narrow down such mismatch, in this paper, we present an in-depth investigation on adapting a paraphrasing technique to web search from three aspects: a search-oriented paraphrasing model; an NDCG-based parameter optimization algorithm; an enhanced ranking model leveraging augmented features computed on paraphrases of original queries. Ex- periments performed on the large scale query-document data set show that, the search performance can be significantly improved, with +3.28% and +1.14% NDCG gains on dev and test sets respectively.

3 0.16678755 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

Author: Anthony Fader ; Luke Zettlemoyer ; Oren Etzioni

Abstract: We study question answering as a machine learning problem, and induce a function that maps open-domain questions to queries over a database of web extractions. Given a large, community-authored, question-paraphrase corpus, we demonstrate that it is possible to learn a semantic lexicon and linear ranking function without manually annotating questions. Our approach automatically generalizes a seed lexicon and includes a scalable, parallelized perceptron parameter estimation scheme. Experiments show that our approach more than quadruples the recall of the seed lexicon, with only an 8% loss in precision.

4 0.11477876 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

Author: Vasile Rus ; Mihai Lintean ; Rajendra Banjade ; Nobal Niraula ; Dan Stefanescu

Abstract: We present in this paper SEMILAR, the SEMantic simILARity toolkit. SEMILAR implements a number of algorithms for assessing the semantic similarity between two texts. It is available as a Java library and as a Java standalone ap-plication offering GUI-based access to the implemented semantic similarity methods. Furthermore, it offers facilities for manual se-mantic similarity annotation by experts through its component SEMILAT (a SEMantic simILarity Annotation Tool). 1

5 0.086405501 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

Author: Ahmed El Kholy ; Nizar Habash ; Gregor Leusch ; Evgeny Matusov ; Hassan Sawaf

Abstract: An important challenge to statistical machine translation (SMT) is the lack of parallel data for many language pairs. One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages. Although pivoting is a robust technique, it introduces some low quality translations. In this paper, we present two language-independent features to improve the quality of phrase-pivot based SMT. The features, source connectivity strength and target connectivity strength reflect the quality of projected alignments between the source and target phrases in the pivot phrase table. We show positive results (0.6 BLEU points) on Persian-Arabic SMT as a case study.

6 0.085914001 311 acl-2013-Semantic Neighborhoods as Hypergraphs

7 0.085381828 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

8 0.082103632 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

9 0.076272994 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees

10 0.071720392 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

11 0.064823486 314 acl-2013-Semantic Roles for String to Tree Machine Translation

12 0.064328946 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

13 0.055937413 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

14 0.054525327 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

15 0.054454397 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

16 0.051497024 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

17 0.051360745 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner

18 0.050265398 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

19 0.049792439 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

20 0.048924003 290 acl-2013-Question Analysis for Polish Question Answering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.138), (1, -0.003), (2, 0.072), (3, -0.065), (4, -0.007), (5, 0.013), (6, 0.0), (7, -0.11), (8, 0.048), (9, 0.013), (10, 0.013), (11, 0.045), (12, 0.038), (13, 0.038), (14, 0.024), (15, -0.018), (16, 0.043), (17, -0.01), (18, -0.011), (19, 0.045), (20, -0.066), (21, -0.005), (22, -0.03), (23, 0.059), (24, 0.003), (25, -0.012), (26, -0.03), (27, 0.107), (28, -0.059), (29, 0.014), (30, -0.067), (31, 0.087), (32, -0.178), (33, -0.052), (34, 0.028), (35, -0.075), (36, 0.092), (37, -0.058), (38, 0.078), (39, 0.075), (40, -0.029), (41, -0.031), (42, -0.017), (43, -0.024), (44, 0.025), (45, -0.017), (46, 0.057), (47, 0.022), (48, 0.003), (49, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92253524 271 acl-2013-ParaQuery: Making Sense of Paraphrase Collections

Author: Lili Kotlerman ; Nitin Madnani ; Aoife Cahill

Abstract: Pivoting on bilingual parallel corpora is a popular approach for paraphrase acquisition. Although such pivoted paraphrase collections have been successfully used to improve the performance of several different NLP applications, it is still difficult to get an intrinsic estimate of the quality and coverage of the paraphrases contained in these collections. We present ParaQuery, a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources all within a single interface.

2 0.81606501 273 acl-2013-Paraphrasing Adaptation for Web Search Ranking

Author: Chenguang Wang ; Nan Duan ; Ming Zhou ; Ming Zhang

Abstract: Mismatch between queries and documents is a key issue for the web search task. In order to narrow down such mismatch, in this paper, we present an in-depth investigation on adapting a paraphrasing technique to web search from three aspects: a search-oriented paraphrasing model; an NDCG-based parameter optimization algorithm; an enhanced ranking model leveraging augmented features computed on paraphrases of original queries. Ex- periments performed on the large scale query-document data set show that, the search performance can be significantly improved, with +3.28% and +1.14% NDCG gains on dev and test sets respectively.

3 0.63185984 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

Author: Anthony Fader ; Luke Zettlemoyer ; Oren Etzioni

Abstract: We study question answering as a machine learning problem, and induce a function that maps open-domain questions to queries over a database of web extractions. Given a large, community-authored, question-paraphrase corpus, we demonstrate that it is possible to learn a semantic lexicon and linear ranking function without manually annotating questions. Our approach automatically generalizes a seed lexicon and includes a scalable, parallelized perceptron parameter estimation scheme. Experiments show that our approach more than quadruples the recall of the seed lexicon, with only an 8% loss in precision.

4 0.6253258 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

Author: Markus Gartner ; Gregor Thiele ; Wolfgang Seeker ; Anders Bjorkelund ; Jonas Kuhn

Abstract: We present ICARUS, a versatile graphical search tool to query dependency treebanks. Search results can be inspected both quantitatively and qualitatively by means of frequency lists, tables, or dependency graphs. ICARUS also ships with plugins that enable it to interface with tool chains running either locally or remotely.

5 0.58468688 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval

Author: K. Tamsin Maxwell ; Jon Oberlander ; W. Bruce Croft

Abstract: Techniques that compare short text segments using dependency paths (or simply, paths) appear in a wide range of automated language processing applications including question answering (QA). However, few models in ad hoc information retrieval (IR) use paths for document ranking due to the prohibitive cost of parsing a retrieval collection. In this paper, we introduce a flexible notion of paths that describe chains of words on a dependency path. These chains, or catenae, are readily applied in standard IR models. Informative catenae are selected using supervised machine learning with linguistically informed features and compared to both non-linguistic terms and catenae selected heuristically with filters derived from work on paths. Automatically selected catenae of 1-2 words deliver significant performance gains on three TREC collections.

6 0.54380858 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

7 0.49743247 290 acl-2013-Question Analysis for Polish Question Answering

8 0.49458992 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees

9 0.49129197 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

10 0.47651759 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

11 0.464075 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

12 0.46111584 311 acl-2013-Semantic Neighborhoods as Hypergraphs

13 0.42450744 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

14 0.41721591 268 acl-2013-PATHS: A System for Accessing Cultural Heritage Collections

15 0.41654304 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

16 0.41621491 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

17 0.40357786 202 acl-2013-Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web

18 0.39482546 270 acl-2013-ParGramBank: The ParGram Parallel Treebank

19 0.3856672 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

20 0.38058162 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.06), (6, 0.011), (11, 0.053), (14, 0.011), (24, 0.533), (26, 0.026), (35, 0.071), (42, 0.029), (48, 0.024), (70, 0.043), (88, 0.014), (95, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9760024 29 acl-2013-A Visual Analytics System for Cluster Exploration

Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel

Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114

same-paper 2 0.95154905 271 acl-2013-ParaQuery: Making Sense of Paraphrase Collections

Author: Lili Kotlerman ; Nitin Madnani ; Aoife Cahill

Abstract: Pivoting on bilingual parallel corpora is a popular approach for paraphrase acquisition. Although such pivoted paraphrase collections have been successfully used to improve the performance of several different NLP applications, it is still difficult to get an intrinsic estimate of the quality and coverage of the paraphrases contained in these collections. We present ParaQuery, a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources all within a single interface.

3 0.94967759 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

Author: Zede Zhu ; Miao Li ; Lei Chen ; Zhenxin Yang

Abstract: Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent top- ics own better adaptability and stability performance.

4 0.92974919 184 acl-2013-Identification of Speakers in Novels

Author: Hua He ; Denilson Barbosa ; Grzegorz Kondrak

Abstract: Speaker identification is the task of at- tributing utterances to characters in a literary narrative. It is challenging to auto- mate because the speakers of the majority ofutterances are not explicitly identified in novels. In this paper, we present a supervised machine learning approach for the task that incorporates several novel features. The experimental results show that our method is more accurate and general than previous approaches to the problem.

5 0.91169918 128 acl-2013-Does Korean defeat phonotactic word segmentation?

Author: Robert Daland ; Kie Zuraw

Abstract: Computational models of infant word segmentation have not been tested on a wide range of languages. This paper applies a phonotactic segmentation model to Korean. In contrast to the undersegmentation pattern previously found in English and Russian, the model exhibited more oversegmentation errors and more errors overall. Despite the high error rate, analysis suggested that lexical acquisition might not be problematic, provided that infants attend only to frequently segmented items. 1

6 0.90481693 229 acl-2013-Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition

7 0.84959579 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization

8 0.79632646 244 acl-2013-Mining Opinion Words and Opinion Targets in a Two-Stage Framework

9 0.68083972 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

10 0.60692859 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

11 0.59633553 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays

12 0.57390034 140 acl-2013-Evaluating Text Segmentation using Boundary Edit Distance

13 0.56966782 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

14 0.56794292 230 acl-2013-Lightly Supervised Learning of Procedural Dialog Systems

15 0.56611747 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

16 0.55987358 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

17 0.55936021 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

18 0.55474579 85 acl-2013-Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis

19 0.54676199 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

20 0.53001899 360 acl-2013-Translating Italian connectives into Italian Sign Language