acl acl2013 acl2013-281 knowledge-graph by maker-knowledge-mining

281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

Source: pdf

Author: Jose G. Moreno ; Gael Dias ; Guillaume Cleuziou

Abstract: Post-retrieval clustering is the task of clustering Web search results. Within this context, we propose a new methodology that adapts the classical K-means algorithm to a third-order similarity measure initially developed for NLP tasks. Results obtained with the definition of a new stopping criterion over the ODP-239 and the MORESQUE gold standard datasets evidence that our proposal outperforms all reported text-based approaches.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract Post-retrieval clustering is the task of clustering Web search results. [sent-4, score-0.479]

2 Within this context, we propose a new methodology that adapts the classical K-means algorithm to a third-order similarity measure initially developed for NLP tasks. [sent-5, score-0.419]

3 Results obtained with the definition of a new stopping criterion over the ODP-239 and the MORESQUE gold standard datasets evidence that our proposal outperforms all reported text-based approaches. [sent-6, score-0.26]

4 1 Introduction Post-retrieval clustering (PRC), also known as search results clustering or ephemeral clustering, is the task of clustering Web search results. [sent-7, score-0.694]

5 For a given query, the retrieved Web snippets are automatically clustered and presented to the user with meaningful labels in order to minimize the information search process. [sent-8, score-0.232]

6 Indeed, as opposed to classical text clustering, PRC must deal with small collections of short text fragments (Web snippets) and be processed in run-time. [sent-11, score-0.166]

7 As a consequence, most of the successful methodologies follow a monothetic approach (Zamir and Etzioni, 1998; Ferragina and Gulli, 2008; Carpineto and Romano, 2010; Navigli and Crisafulli, 2010; Scaiella et al. [sent-12, score-0.056]

8 The underlying idea is to discover the most discriminant topical words in the collection and group together Web snippets containing these relevant terms. [sent-14, score-0.239]

9 On the other hand, the polythetic approach which main idea is to represent Web snippets as word feature vectors has received less attention, the only relevant work being (Osinski and Weiss, 2005). [sent-15, score-0.319]

10 This paper is motivated by the fact that the poly- thetic approach should lead to improved results if correctly applied to small collections of short text fragments. [sent-21, score-0.035]

11 For that purpose, we propose a new methodology that adapts the classical K-means algorithm to a third-order similarity measure initially developed for Topic Segmentation (Dias et al. [sent-22, score-0.419]

12 Moreover, the adapted K-means algorithm allows to label each cluster directly from its centroids thus avoiding the abovementioned extra task. [sent-24, score-0.264]

13 Finally, the evolution of the objective function of the adapted K-means is modeled to automatically define the “best” number of clusters. [sent-25, score-0.039]

14 A new evaluation measure called the b-cubed Fmeasure (Fb3) and defined in (Amig ´o et al. [sent-27, score-0.077]

15 , 2009) is then calculated to evaluate both cluster homogeneity and completeness. [sent-28, score-0.204]

16 Results evidence that our proposal outperforms all state-of-the-art approaches with a maximum Fb3 = 0. [sent-29, score-0.077]

17 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 153–158, 2 Polythetic Post-Retrieval Clustering The K-means is a geometric clustering algorithm (Lloyd, 1982). [sent-36, score-0.23]

18 Given a set of n data points, the algorithm uses a local search approach to partition the points into K clusters. [sent-37, score-0.087]

19 Each point is then assigned to the center closest to it and the centers are recomputed as centers of mass of their assigned points. [sent-39, score-0.174]

20 To assure convergence, an objective function Q is defined which decreases at each processing step. [sent-41, score-0.037]

21 The classical objective function is defined in Equation (1) where πk is a cluster labeled k, xi ∈ πk is an object in the cluster, mπk is the centroi∈d o πf the cluster πk and E(. [sent-42, score-0.429]

22 (1) Within the context of PRC, the K-means algorithm needs to be adapted to integrate third-order similarity measures (Mihalcea et al. [sent-46, score-0.204]

23 Third-order similarity measures, also called weighted second-order similarity measures, do not rely on exact matches of word features as classical second-order similarity measures (e. [sent-49, score-0.369]

24 the cosine metric), but rather evaluate similarity based on related matches. [sent-51, score-0.071]

25 In this paper, we propose to use the third-order similarity measure called InfoSimba introduced in (Dias et al. [sent-52, score-0.145]

26 (2) Given two Web snippets Xi and Xj, their similarity is evaluated by the similarity of its constituents based on any symmetric similarity measure S(. [sent-55, score-0.398]

27 A direct consequence of the change in similarity measure is the definition of a new objective function QS3s to ensure convergence. [sent-63, score-0.163]

28 This function is defined in Equation (3) and must be maximized2. [sent-64, score-0.037]

29 (3) A cluster centroid mπk is defined by a vector of p words (wπ1k , . [sent-66, score-0.267]

30 As a consequence, each cluster centroid must be instantiated in such a way that QS3s increases at each step of the clustering process. [sent-70, score-0.426]

31 The choice of the best p words repre- senting each cluster is a way of assuring convergence. [sent-71, score-0.148]

32 So, for each word w ∈ V and any symmetricS similarity measure S(. [sent-74, score-0.111]

33 ), i atsn interestingness λk(w) is computed as regards to cluster πk. [sent-76, score-0.204]

34 This operation is defined in Equation (4) where si ∈ πk is any Web snippet from cluster πk. [sent-77, score-0.242]

35 Finally, t∈he π p words with higher λk(w) are selected to construct the cluster centroid. [sent-78, score-0.148]

36 Note that a word which is not part of cluster πk may be part of the centroid mπk . [sent-80, score-0.23]

37 (4) Finally, we propose to rely on a modified version of the K-means algorithm called Global Kmeans (Likasa et al. [sent-82, score-0.068]

38 To solve a clustering problem with M clusters, all intermediate problems with 1, 2, . [sent-84, score-0.196]

39 tThh 1e, underlying 1id cealu sist ethrsat a an optimal solution for a clustering problem with M clusters can be obtained using a series of local searches using the K-means algorithm. [sent-88, score-0.301]

40 At each local search, the M 1 cluster centers are always initially placed a −t th 1ei rc optimal positions corresponding ltoy the clustering problem with M 1clusters. [sent-89, score-0.476]

41 The remaining iMngth p rcolbusletmer c weintther M Mis initially placed haet several positions within the data space. [sent-90, score-0.045]

42 − − − 3 Stopping Criterion Once clustering has been processed, selecting the best number of clusters still remains to be decided. [sent-93, score-0.301]

43 So, we proposed a procedure based on the definition of a rational function which models the quality criterion QS3s . [sent-96, score-0.112]

44 To better understand the behaviour of QS3s at each step of the adapted GK-means algorithm, we present its values for K = 10 in Figure 1. [sent-97, score-0.039]

45 The underlying idea is is given by the β value which maximizes the difference with the average βmean. [sent-102, score-0.035]

46 thast the best number of clusters ∀K,f(K) = α −Kγβ. [sent-104, score-0.105]

47 (5) As α can theoretically or operationally be defined and it can easily be proved that γ = α −Q1S3 , β neede adnsd dt iot bcaen nd eeafsinileyd b bea pserodv on γ or α. [sent-105, score-0.069]

48 (6) Now, the value of α which best approximates the limit of the rational function must be defined. [sent-108, score-0.094]

49 Best results were obtained with the maximum experimental value which is defined as building the cluster centroid mπk for each Web snippet individually. [sent-111, score-0.324]

50 Finally, the best number of clusters is defined as in Algorithm (1) and each one receives its label based on the p words with greater interestingness of its centroid mπk . [sent-112, score-0.28]

51 Return K as the best number of partitions − This situation is illustrated in Figure (1) where the red line corresponds to the rational functional for βmean and the blue line models the best β value (i. [sent-120, score-0.103]

52 In this case, the best number would correspond to β6 and as a consequence, the best number of clusters would be 6. [sent-123, score-0.105]

53 In order to illustrate the soundness of the procedure, we present the different values for β at each K iteration and the differences between consecutive values of β at each iteration in Figure 2. [sent-124, score-0.04]

54 We clearly see that the highest inclination of the curve is between cluster 5 and 6 which also corresponds to the highest difference between two consecutive values of β. [sent-125, score-0.188]

55 Figure 2: Values of β (on the left) and differences between consecutive values of β (on the right). [sent-126, score-0.04]

56 Indeed, a successful PRC system must evidence high quality level clustering. [sent-129, score-0.047]

57 Ideally, each query subtopic should be rep- resented by a unique cluster containing all the relevant Web pages inside. [sent-130, score-0.213]

58 As such, this constraint is reformulated as follows: the task of PRC systems is to provide complete topical cluster coverage of a given query, while avoiding excessive 155 Table1:FSP3MCfIorp5324an0 d. [sent-132, score-0.209]

59 So, in order to evaluate our methodology, we propose two different evaluations. [sent-146, score-0.034]

60 First, we want to evidence the quality of the stopping criterion when compared to an exhaustive search over all tunable parameters. [sent-147, score-0.293]

61 Second, we propose a comparative evaluation with existing state-of-theart algorithms over gold standard datasets and re- cent clustering evaluation metrics. [sent-148, score-0.29]

62 1 Text Processing Before the clustering process takes place, Web snippets are represented as word feature vectors. [sent-150, score-0.341]

63 In particular, it assigns a relevance score to any token present in the set of retrieved Web snippets based on the analysis of left and right token contexts. [sent-153, score-0.179]

64 Then, each Web snippet is represented by the set of its p most relevant tokens in the sense of the W(. [sent-155, score-0.09]

65 Note that within the proposed Web service, multiword units are also identified. [sent-158, score-0.035]

66 They are exclusively composed of relevant individual tokens and their weight is given by the arithmetic mean of their constituents scores. [sent-159, score-0.067]

67 2 Intrinsic Evaluation The first set ofexperiments focuses on understanding the behaviour of our methodology within a greedy search strategy for different tunable parameters defined as a tuple < p, K, S(Wik, Wjl) >. [sent-162, score-0.19]

68 In particular, p is the size of the word feature vectors representing both Web snippets and centroids (p = 2. [sent-163, score-0.188]

69 5), K is the number of clusters to be found (K = 2. [sent-165, score-0.105]

70 10) and S(Wik, Wjl) is the collocation measure integrated in the InfoSimba similarity measure. [sent-167, score-0.14]

71 In these experiments, two association measures which are known to have different behaviours (Pecina and Schlesinger, 2006) are tested. [sent-168, score-0.06]

72 Then, best < p, K, S(Wik, Wjl) > configurations are compared to our stopping criterion. [sent-171, score-0.101]

73 (8) In order to perform this task, we evaluate performance based on the Fb3 measure defined in (Amig ´o et al. [sent-174, score-0.077]

74 , 2009) indicate that common metrics such as the Fβ-measure are good to assign higher scores to clusters with high homogeneity, but fail to evaluate cluster completeness. [sent-177, score-0.253]

75 First results are provided in Table 1and evidence that the best configurations for different < p, K, S(Wik, Wjl) > tuples are obtained for high values of p, K ranging from 4 to 6 clusters and PMI steadily improving over SCP. [sent-178, score-0.152]

76 As such, we proposed a new stopping cri- terion which evidences coherent results as it (1) does not depend on the used association measure (FbS3CP = 0. [sent-180, score-0.141]

77 450), (2) discovers similar numbers of clusters independently of the length of the p-context vector and (3) increases performance with high values of p. [sent-182, score-0.105]

78 3 Comparative Evaluation The second evaluation aims to compare our methodology to current state-of-the-art text-based PRC algorithms. [sent-184, score-0.061]

79 STC: (Zamir and Etzioni, 1998) defined the Suffix Tree Clustering algorithm which is still a difficult standard to beat in the field. [sent-187, score-0.071]

80 In particular, they propose a monothetic clustering technique which merges base clusters with high string overlap. [sent-188, score-0.391]

81 Indeed, instead of using the classical Vector Space Model (VSM) representation, they propose to represent Web snippets as compact tries. [sent-189, score-0.275]

82 LINGO: (Osinski and Weiss, 2005) proposed a polythetic solution called LINGO which takes into account the string representation proposed by (Zamir and Etzioni, 1998). [sent-190, score-0.141]

83 OPTIMSRC: (Carpineto and Romano, 2010) showed that the characteristics of the outputs returned by PRC algorithms suggest the adoption of a meta clustering approach. [sent-194, score-0.261]

84 As such, they introduce a novel criterion to measure the concordance of two partitions of objects into different clusters based on the information content associated to the series of decisions made by the partitions on single pairs of objects. [sent-195, score-0.329]

85 Then, the meta clustering phase is casted to an optimization problem of the concordance between the clustering combination and the given set of clusterings. [sent-196, score-0.5]

86 With respect to implementation, we used the Carrot2 APIs4 which are freely available for STC, LINGO and the classical BIK. [sent-197, score-0.096]

87 They evidence clear improvements of our methodology when compared to state-of-theart text-based PRC algorithms, over both datasets and all evaluation metrics. [sent-201, score-0.137]

88 But more important, even when the p-context vector is small (p = 3), the adapted GK-means outperforms all other ex- isting text-based PRC which is particularly important as they need to perform in real-time. [sent-202, score-0.039]

89 5 Conclusions In this paper, we proposed a new PRC approach which (1) is based on the adaptation of the K-means algorithm to third-order similarity measures and (2) proposes a coherent stopping criterion. [sent-203, score-0.266]

90 Results evidenced clear improvements over the evaluated state-of-the-art textbased approaches for two gold standard datasets. [sent-204, score-0.059]

91 These results are promising and in future works, we propose to define new knowledge-based third-order similarity measures based on studies in entity-linking (Ferragina and Scaiella, 2010). [sent-209, score-0.165]

92 5Notice that the authors only propose the F1-measure although different results can be obtained for different Fβ- measures and Fb3 as evidenced in Table 2. [sent-213, score-0.153]

93 A comparison of extrinsic clustering evaluation metrics based on formal constraints. [sent-226, score-0.196]

94 Clustering and diversifying web search results with graph-based word sense induction. [sent-252, score-0.148]

95 A personalized search engine based on web-snippet hierarchical clustering. [sent-267, score-0.053]

96 Tagme: On-thefly annotation of short text fragments (by wikipedia entities). [sent-273, score-0.035]

97 Acceleration of the em and ecm algorithms using the aitken δ2 method for log-linear models with partially classified data. [sent-280, score-0.056]

98 An examination of procedures for determining the number of clusters in a data set. [sent-318, score-0.105]

99 Inducing word senses to improve web search result clustering. [sent-324, score-0.148]

100 Using localmaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. [sent-354, score-0.069]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('prc', 0.394), ('carpineto', 0.366), ('romano', 0.207), ('osinski', 0.197), ('clustering', 0.196), ('wjl', 0.169), ('zamir', 0.169), ('dias', 0.151), ('cluster', 0.148), ('snippets', 0.145), ('ferragina', 0.141), ('polythetic', 0.141), ('scaiella', 0.141), ('wik', 0.13), ('lingo', 0.125), ('optimsrc', 0.113), ('clusters', 0.105), ('stopping', 0.101), ('caen', 0.1), ('stc', 0.1), ('classical', 0.096), ('web', 0.095), ('weiss', 0.089), ('amig', 0.087), ('centers', 0.087), ('machado', 0.085), ('moresque', 0.085), ('centroid', 0.082), ('similarity', 0.071), ('etzioni', 0.066), ('meta', 0.065), ('equation', 0.062), ('topical', 0.061), ('methodology', 0.061), ('navigli', 0.06), ('measures', 0.06), ('evidenced', 0.059), ('rational', 0.059), ('snippet', 0.057), ('aitken', 0.056), ('bisecting', 0.056), ('homogeneity', 0.056), ('infosimba', 0.056), ('interestingness', 0.056), ('iwk', 0.056), ('kmeans', 0.056), ('likasa', 0.056), ('milligan', 0.056), ('monothetic', 0.056), ('normandie', 0.056), ('orl', 0.056), ('xjl', 0.056), ('search', 0.053), ('criterion', 0.053), ('consequence', 0.052), ('unicaen', 0.05), ('crisafulli', 0.05), ('xk', 0.048), ('evidence', 0.047), ('greyc', 0.046), ('initially', 0.045), ('partitions', 0.044), ('centroids', 0.043), ('silva', 0.043), ('moreno', 0.043), ('concordance', 0.043), ('france', 0.041), ('cnrs', 0.041), ('pecina', 0.041), ('measure', 0.04), ('consecutive', 0.04), ('tunable', 0.039), ('adapted', 0.039), ('vsm', 0.038), ('adapts', 0.038), ('kuroda', 0.038), ('ans', 0.038), ('defined', 0.037), ('sigir', 0.036), ('collections', 0.035), ('maximizes', 0.035), ('approximates', 0.035), ('multiword', 0.035), ('fragments', 0.035), ('algorithm', 0.034), ('retrieved', 0.034), ('fr', 0.034), ('mean', 0.034), ('propose', 0.034), ('service', 0.033), ('uni', 0.033), ('relevant', 0.033), ('query', 0.032), ('proved', 0.032), ('ik', 0.032), ('church', 0.032), ('comparative', 0.031), ('proposal', 0.03), ('collocation', 0.029), ('datasets', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

Author: Jose G. Moreno ; Gael Dias ; Guillaume Cleuziou

2 0.11748174 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa

Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.

3 0.10961289 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

Author: Manaal Faruqui ; Chris Dyer

Abstract: We present an information theoretic objective for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages. The monolingual component of our objective is the average mutual information of clusters of adjacent words in each language, while the bilingual component is the average mutual information of the aligned clusters. To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.

4 0.10574865 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

Author: Kashyap Popat ; Balamurali A.R ; Pushpak Bhattacharyya ; Gholamreza Haffari

Abstract: Expensive feature engineering based on WordNet senses has been shown to be useful for document level sentiment classification. A plausible reason for such a performance improvement is the reduction in data sparsity. However, such a reduction could be achieved with a lesser effort through the means of syntagma based word clustering. In this paper, the problem of data sparsity in sentiment analysis, both monolingual and cross-lingual, is addressed through the means of clustering. Experiments show that cluster based data sparsity reduction leads to performance better than sense based classification for sentiment analysis at document level. Similar idea is applied to Cross Lingual Sentiment Analysis (CLSA), and it is shown that reduction in data sparsity (after translation or bilingual-mapping) produces accuracy higher than Machine Translation based CLSA and sense based CLSA.

5 0.084164083 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

Author: Yukari Ogura ; Ichiro Kobayashi

Abstract: In this paper, we propose a method to raise the accuracy of text classification based on latent topics, reconsidering the techniques necessary for good classification for example, to decide important sentences in a document, the sentences with important words are usually regarded as important sentences. In this case, tf.idf is often used to decide important words. On the other hand, we apply the PageRank algorithm to rank important words in each document. Furthermore, before clustering documents, we refine the target documents by representing them as a collection of important sentences in each document. We then classify the documents based on latent information in the documents. As a clustering method, we employ the k-means algorithm and inves– tigate how our proposed method works for good clustering. We conduct experiments with Reuters-21578 corpus under various conditions of important sentence extraction, using latent and surface information for clustering, and have confirmed that our proposed method provides better result among various conditions for clustering.

6 0.081333451 192 acl-2013-Improved Lexical Acquisition through DPP-based Verb Clustering

7 0.071906485 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

8 0.071880169 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

9 0.067804262 29 acl-2013-A Visual Analytics System for Cluster Exploration

10 0.062613383 97 acl-2013-Cross-lingual Projections between Languages from Different Families

11 0.060792666 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

12 0.059862621 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

13 0.058720399 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

14 0.05850374 242 acl-2013-Mining Equivalent Relations from Linked Data

15 0.055562221 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

16 0.052636247 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

17 0.051087841 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

18 0.051047973 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

19 0.050830159 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

20 0.049868483 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.144), (1, 0.04), (2, 0.019), (3, -0.064), (4, 0.03), (5, -0.077), (6, -0.013), (7, 0.015), (8, -0.059), (9, -0.051), (10, 0.007), (11, -0.019), (12, -0.001), (13, 0.013), (14, -0.004), (15, 0.017), (16, -0.002), (17, 0.021), (18, 0.003), (19, 0.025), (20, 0.024), (21, 0.025), (22, 0.02), (23, -0.021), (24, 0.02), (25, 0.01), (26, 0.031), (27, 0.061), (28, -0.004), (29, -0.018), (30, 0.013), (31, 0.06), (32, -0.074), (33, -0.148), (34, 0.013), (35, 0.013), (36, -0.031), (37, 0.016), (38, -0.144), (39, 0.072), (40, 0.033), (41, 0.071), (42, 0.006), (43, 0.018), (44, 0.069), (45, -0.076), (46, 0.094), (47, 0.045), (48, -0.018), (49, -0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91702205 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

Author: Jose G. Moreno ; Gael Dias ; Guillaume Cleuziou

2 0.72652251 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

Abstract: We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e.g., ‘F’) for one minute. Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. Our system provides an automated alternative. In a pilot study, we applied this system to tests of 55 novice and experienced professional fighters (boxers and mixed martial artists) and found that experienced fighters produced significantly longer chains of phonetically similar words, while no differences were found in the total number of words produced. These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired.

3 0.66655821 29 acl-2013-A Visual Analytics System for Cluster Exploration

Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel

Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114

4 0.64080453 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

Author: Thomas Mayer ; Christian Rohrdantz

Abstract: This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns.

5 0.58737141 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

Author: Manaal Faruqui ; Chris Dyer

6 0.5655278 192 acl-2013-Improved Lexical Acquisition through DPP-based Verb Clustering

7 0.56318337 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

8 0.55028576 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

9 0.53683352 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

10 0.52815062 242 acl-2013-Mining Equivalent Relations from Linked Data

11 0.51798344 220 acl-2013-Learning Latent Personas of Film Characters

12 0.51275414 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

13 0.47620189 293 acl-2013-Random Walk Factoid Annotation for Collective Discourse

14 0.46750504 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

15 0.4556756 62 acl-2013-Automatic Term Ambiguity Detection

16 0.44548482 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

17 0.44246832 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

18 0.43935019 128 acl-2013-Does Korean defeat phonotactic word segmentation?

19 0.43897125 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization

20 0.43314484 140 acl-2013-Evaluating Text Segmentation using Boundary Edit Distance

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.051), (6, 0.034), (11, 0.071), (24, 0.046), (26, 0.03), (35, 0.058), (42, 0.228), (48, 0.049), (70, 0.028), (80, 0.223), (88, 0.014), (90, 0.013), (95, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84220207 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

Author: Jose G. Moreno ; Gael Dias ; Guillaume Cleuziou

2 0.81072265 14 acl-2013-A Novel Classifier Based on Quantum Computation

Author: Ding Liu ; Xiaofang Yang ; Minghu Jiang

Abstract: In this article, we propose a novel classifier based on quantum computation theory. Different from existing methods, we consider the classification as an evolutionary process of a physical system and build the classifier by using the basic quantum mechanics equation. The performance of the experiments on two datasets indicates feasibility and potentiality of the quantum classifier.

3 0.7795037 227 acl-2013-Learning to lemmatise Polish noun phrases

Author: Adam Radziszewski

Abstract: We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while our approach allows to separate chunking from lemmatisation.

4 0.76016372 64 acl-2013-Automatically Predicting Sentence Translation Difficulty

Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl

Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.

5 0.75695622 206 acl-2013-Joint Event Extraction via Structured Prediction with Global Features

Author: Qi Li ; Heng Ji ; Liang Huang

Abstract: Traditional approaches to the task of ACE event extraction usually rely on sequential pipelines with multiple stages, which suffer from error propagation since event triggers and arguments are predicted in isolation by independent local classifiers. By contrast, we propose a joint framework based on structured prediction which extracts triggers and arguments together so that the local predictions can be mutually improved. In addition, we propose to incorporate global features which explicitly capture the dependencies of multiple triggers and arguments. Experimental results show that our joint approach with local features outperforms the pipelined baseline, and adding global features further improves the performance significantly. Our approach advances state-ofthe-art sentence-level event extraction, and even outperforms previous argument labeling methods which use external knowledge from other sentences and documents.

6 0.75657994 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

7 0.75304776 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

8 0.75280243 125 acl-2013-Distortion Model Considering Rich Context for Statistical Machine Translation

9 0.74881244 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation

10 0.74707228 372 acl-2013-Using CCG categories to improve Hindi dependency parsing

11 0.7298038 86 acl-2013-Combining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures

12 0.70271039 91 acl-2013-Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning

13 0.68904936 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction

14 0.67241043 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

15 0.67177355 166 acl-2013-Generalized Reordering Rules for Improved SMT

16 0.66757023 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?

17 0.65819526 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

18 0.65712816 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

19 0.65613657 69 acl-2013-Bilingual Lexical Cohesion Trigger Model for Document-Level Machine Translation

20 0.6559853 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation