acl acl2013 acl2013-369 knowledge-graph by maker-knowledge-mining

369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Source: pdf

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. [sent-3, score-0.829]

2 Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. [sent-4, score-0.632]

3 We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. [sent-5, score-0.163]

4 We further show that our methodology can be used to predict more fine-grained phonetic distinctions. [sent-6, score-0.161]

5 1 Introduction Over the past centuries, dozens of lost languages have been deciphered through the painstaking work of scholars, often after decades of slow progress and dead ends. [sent-8, score-0.205]

6 However, several important writing systems and languages remain undeciphered to this day. [sent-9, score-0.32]

7 In this paper, we present a successful solution to one aspect of the decipherment puzzle: automatically identifying basic phonetic properties of letters in an unknown alphabetic writing system. [sent-10, score-0.64]

8 Our key idea is to use knowledge of the phonetic regularities encoded in known language vocabularies to automatically build a universal probabilistic model to successfully decode new languages. [sent-11, score-0.356]

9 We assume that each language has an unobserved set of parameters explaining its observed vocabulary. [sent-13, score-0.203]

10 We further assume that each language-specific set of parameters was itself drawn from an unobserved common prior, shared across a cluster of typologically related languages. [sent-14, score-0.509]

11 In turn, each cluster derives its parameters from a universal prior common to all language groups. [sent-15, score-0.322]

12 This approach allows us to mix together data from languages with various levels of observations and perform joint posterior inference over unobserved variables of interest. [sent-16, score-0.515]

13 Each word is modeled as an emitted sequence of characters, depending on a corresponding Markov sequence of phonetic tags. [sent-18, score-0.202]

14 Since individual letters are highly constrained in their range of phonetic values, we make the assumption of one-tag-perobservation-type (e. [sent-19, score-0.238]

15 a single letter is constrained to be always a consonant or always a vowel across all words in a language). [sent-21, score-0.505]

16 Going one layer up, we posit that the languagespecific HMM parameters are themselves drawn from informative, non-symmetric distributions representing a typologically coherent language grouping. [sent-22, score-0.296]

17 By applying the model to a mix of languages with observed and unobserved phonetic sequences, the cluster-level distributions can be inferred and help guide prediction for unknown languages and alphabets. [sent-23, score-0.963]

18 We apply this approach to two small decipherment tasks: 1. [sent-24, score-0.148]

19 predicting whether individual characters in an unknown alphabet and language represent vowels or consonants, and 2. [sent-25, score-0.464]

20 predicting whether individual characters in an unknown alphabet and language represent vowels, nasals, or non-nasal consonants. [sent-26, score-0.278]

21 We experiment with a data set consisting of vocabularies of 503 languages from around the world, written in a mix of Latin, Cyrillic, and Greek alphabets. [sent-30, score-0.333]

22 In turn for each language, we consider it and its alphabet “unobserved” we hide the graphic and phonetic properties of the symbols while treating the vocabularies of the remaining languages as fully observed with phonetic tags on each of the letters. [sent-31, score-0.74]

23 As in our model, individual characters are treated as the observed emissions ofthe hidden states. [sent-37, score-0.147]

24 Their experiments show that the HMM trained with EM successfully clusters Spanish letters into consonants and vowels. [sent-39, score-0.497]

25 Experiments with the second model indicate that it can distinguish sonorous consonants (such as n, m, l, r) from non-sonorous consonants in Spanish. [sent-41, score-0.694]

26 In lieu of a linguistically designed model structure, we choose an empirical approach, allowing posterior inference over hundreds of known languages to guide the model’s decisions for the unknown script and language. [sent-46, score-0.57]

27 In this sense, our model bears some similarity to the decipherment model of Snyder et al. [sent-47, score-0.148]

28 While the aim of the present work is more modest (discovering very basic phonetic properties of letters) it is also more widely applicable, as we don’t required detailed analysis of a known related language. [sent-49, score-0.221]

29 In similar veins, Berg-Kirkpatrick and Klein (2010) develop hierarchically tied grammar priors over languages within the same family, and BouchardCôté et al. [sent-53, score-0.302]

30 In our own previous work, we have developed the idea that supervised knowledge of some num- ber of languages can help guide the unsupervised induction of linguistic structure, even in the absence of parallel text (Kim et al. [sent-55, score-0.289]

31 In the latter work we also tackled the problem of unsupervised phonemic prediction for unknown languages by using textual regularities of known languages. [sent-57, score-0.452]

32 However, we assumed that the target language was written in a known (Latin) alphabet, greatly reducing the difficulty of the prediction task. [sent-58, score-0.135]

33 In our present case, we assume no knowledge of any relationship between the writing system of the target language and known languages, other than that they are all alphabetic in nature. [sent-59, score-0.241]

34 , 2011), and the power of type-based sampling has been demonstrated, even in the absence of explicit model con- straints (Liang et al. [sent-64, score-0.176]

35 3 Model Our generative Bayesian model over the observed vocabularies of hundreds of languages is 1We note that similar ideas were simultaneously proposed by other researchers (Cohen et al. [sent-66, score-0.394]

36 1528 1529 For example, the cluster Poisson parameter over vowel observation types might be λ = 9 (indicating 9 vowel letters on average for the cluster), while the parameter over consonant observation types might be λ = 20 (indicating 20 consonant letters on average). [sent-68, score-1.264]

37 These priors will be distinct for each language cluster and serve to characterize its general linguistic and typological properties. [sent-69, score-0.301]

38 Thus, tthhee mDeiarinch lileest parameters of a language cluster characterize both the average HMMs of individual languages within the cluster, as well as how much we expect the HMMs to vary from the mean. [sent-98, score-0.473]

39 In the case of emission distributions, we assume symmetric Dirichlet priors i. [sent-99, score-0.236]

40 θk|β) ∝ Q This assumption is necessary, as we ∝haQve no way to identify characters across languages in the decipherment scenario, and even the number of consonants and vowels (and thus multinomial/Dirichlet dimensions) can vary across the languages of a cluster. [sent-104, score-1.299]

41 Thus, the mean of these Dirichlets will always be a uniform emission distribution. [sent-105, score-0.139]

42 The single Dirichlet emission parameter per cluster will specify whether this mean is on a peak (large β) or in a valley (small β). [sent-106, score-0.381]

43 In other words, it will control the expected sparsity of the resulting per-language emission multinomials. [sent-107, score-0.139]

44 In contrast, the transition Dirichlet parameters may be asymmetric, and thus very specific and informative. [sent-109, score-0.147]

45 For example, one cluster may have the property that CCC consonant clusters are exceedingly rare across all its languages. [sent-110, score-0.544]

46 3 Cluster Generation The generation of the cluster parameters (Algorithm 1) defines the highest layer of priors for our model. [sent-114, score-0.404]

47 For the cluster Poisson parameters, we use conjugate Gamma distributions with vague priors. [sent-116, score-0.262]

48 We run the procedure over data from 503 languages, assuming that all languages but one have observed character and tag sequences: w1, w2, . [sent-118, score-0.494]

49 Since each character type w is assumed to have a single tag category, this is equivalent to observing the character token sequence along with a character-type-to-tag mapping tw. [sent-124, score-0.384]

50 For the target language, we observe only character token sequence w1, w2, . [sent-125, score-0.132]

51 We assume fixed and known parameter values only at the cluster generation level. [sent-128, score-0.264]

52 Unobserved variables include (i) the cluster parameters α, β, λ, (ii) the cluster assignments z, (iii) the perlanguage HMM parameters θ, for all languages, and (iv) for the target language, the tag tokens t1, t2, . [sent-129, score-0.656]

53 1 Monte Carlo Approximation Our goal in inference is to predict the most likely tag tw,ℓ for each character type w in our target language ℓ according to the posterior: f (tw,ℓ | w, t−ℓ) =ˆf (tℓ,z,α,β | w,t−ℓ)dΘ (1) 3(1,19) for consonants, (1,10) for vowels, (0. [sent-134, score-0.297]

54 1530 where Θ = (t−w,ℓ, z, α, β), w are the observed character sequences for all languages, t−ℓ are the character-to-tag mappings for the observed languages, z are the language-to-cluster assignments, and α and are all the cluster-level transition and emission Dirichlet parameters. [sent-136, score-0.472]

55 Note that we leave out the language-level HMM parameters (θ, φ) as well as the cluster-level Poisson parameters λ from Equation 1 (and thus our sample space), as we can analytically integrate them out in our sampling equations. [sent-138, score-0.311]

56 , (6) βk) (7) The first term is the posterior predictive distribution for the Poisson-Gamma compound distribution and is easy to derive. [sent-153, score-0.259]

57 The second term is the tag transition predictive distribution given Dirich- let hyperparameters, yielding a familiar Polya urn scheme form. [sent-154, score-0.337]

58 Removing terms that don’t depend on the tag assignment tℓ,w gives us: Qt,t′ ? [sent-155, score-0.163]

59 dP Pn(t, t′) are, respectively, unigram and bigram tag counts excluding those containing character w. [sent-162, score-0.354]

60 Conversely, n′(t) and n′(t, t′) are, respectively, unigram and bigram tag counts only including those containing character w. [sent-163, score-0.354]

61 Finally, we tackle the third term, Equation 7, corresponding to tchkel predictive drmis-, tribution of emission observations given Dirichlet hyperparameters. [sent-165, score-0.227]

62 Again, removing constant terms gives us: a[n] Qt′Nβ[kℓn, t( w′β)k][n,t(′t′)] where n(w) is theQ unigram count of character w, and n(t′) is the unigram count of tag t, over all characters tokens (including w). [sent-166, score-0.46]

63 Sampling αk,t,t′ To sample the Dirichlet hyperparameter for cluster k and transition t → t′, we need to compute: f(αk,t,t′|t, z) ∝ f(t, z|αz,t,t′) = f(tk |αz,t,t′) where tk are the tag sequences for all languages currently assigned to cluster k. [sent-167, score-0.97]

64 This term is a predictive distribution of the multinomial-Dirichlet compound when the observations are grouped into multiple multinomials all with the same prior. [sent-168, score-0.143]

65 This gives us an efficient way to compute unnormalized posterior densities for α. [sent-170, score-0.211]

66 To do so, we turn to slice sampling (Neal, 2003), a simple yet effective auxiliary variable scheme for sampling values from unnormalized but otherwise computable densities. [sent-172, score-0.367]

67 The key idea is to supplement the variable x, distributed according to unnormalized density p˜(x), with a second variable u with joint density defined as p(x, u) ∝ I(u < ˜ p(x)). [sent-173, score-0.167]

68 Again, we have the predictive distribution of the multinomial-Dirichlet compound with multiple grouped observations. [sent-179, score-0.143]

69 As before, we use slice sampling for obtaining samples. [sent-181, score-0.185]

70 Sampling zℓ Finally, we consider sampling the cluster assignment zℓ for each language ℓ. [sent-182, score-0.378]

71 5 Experiments To test our model, we apply it to a corpus of 503 languages for two decipherment tasks. [sent-184, score-0.353]

72 In both cases, we will assume no knowledge of our target language or its writing system, other than that it is alphabetic in nature. [sent-185, score-0.181]

73 At the same time, we will assume basic phonetic knowledge of the writing systems of the other 502 languages. [sent-186, score-0.226]

74 For our first task, we will predict whether each character type is a consonant or a vowel. [sent-187, score-0.35]

75 In the second task, we further subdivide the consonants into two major categories: the nasal consonants, and the nonnasal consonants. [sent-188, score-0.529]

76 Nasal consonants are known to be perceptually very salient and are unique in being high frequency consonants in all known languages. [sent-189, score-0.814]

77 We have identified translations covering 503 distinct languages employing alphabetic writing systems. [sent-199, score-0.386]

78 Most of these languages (476) use variants of the Latin alphabet, a few (26) use Cyrillic, and one uses the Greek alphabet. [sent-200, score-0.205]

79 As Table 1 indicates, the languages cover a very diverse set of families and geographic regions, with Niger-Congo languages being the largest represented family. [sent-201, score-0.467]

80 Since the letter “y” can frequently represent both a consonant and vowel, we exclude it from our evaluation. [sent-205, score-0.26]

81 On average, the resulting vocabularies contain 2,388 unique words, with 19 consonant characters, two 2 nasal characters, and 9 vowels. [sent-206, score-0.424]

82 The simplest version, SYMM, disregards all information from other languages, using simple symmetric hyperparameters on the transition and emission Dirichlet priors (all hyperparameters set to 1). [sent-217, score-0.425]

83 Third panel: results for 27 non-Latin alphabet languages (Cyrillic and Greek). [sent-227, score-0.3]

84 our Gibbs sampling inference method for the typebased HMM, even in the absence of multilingual priors. [sent-229, score-0.278]

85 We next consider a variant of our model, MERGE, that assumes that all languages reside in a single cluster. [sent-230, score-0.205]

86 This allows knowledge from the other languages to affect our tag posteriors in a generic, language-neutral way. [sent-231, score-0.325]

87 By allowing for the division of languages into smaller groupings, we hope to learn more specific parameters tailored for typologically coherent clusters of languages. [sent-233, score-0.477]

88 Variance across languages is quite low: the standard deviations are about 2 percentage points. [sent-238, score-0.254]

89 Performance 1533 1534 Figure 4: Inferred Dirichlet transition hyperparameters for bigram CLUST on three-way classification task with four latent clusters. [sent-244, score-0.189]

90 Examining just the first row, we see that the languages are partially grouped by their preference for the initial tag of words. [sent-251, score-0.325]

91 All clusters favor languages which prefer initial consonants, though this preference is most weakly expressed in cluster 3. [sent-252, score-0.482]

92 In contrast, both clusters 2 and 4 have very dominant tendencies towards consonant-initial languages, but differ in the relative weight given to languages preferring either vowels or nasals initially. [sent-253, score-0.606]

93 Finally, we examine the relationship between the induced clusters and language families in Table 3, for the trigram consonant vs. [sent-254, score-0.386]

94 8 Conclusion In this paper, we presented a successful solution to one aspect of the decipherment task: the prediction ofconsonants and vowels for an unknown language and alphabet. [sent-259, score-0.482]

95 Adopting a classical Bayesian perspective, we develop a model that performs posterior inference over hundreds of languages, leveraging knowledge of known languages to uncover general linguistic patterns of typologically coherent language clusters. [sent-260, score-0.632]

96 Using this model, we automatically distinguish between consonant and vowel characters with nearly 99% accuracy across 503 languages. [sent-261, score-0.573]

97 Future work will take us in several new directions: first, we would like to move beyond the assumption of an alphabetic writing system so that we can apply our method to undeciphered syllabic scripts such as Linear A. [sent-263, score-0.231]

98 We would also like to extend our methods to achieve finer-grained reso- lution of phonetic properties beyond nasals, consonants, and vowels. [sent-264, score-0.161]

99 Automated reconstruction of ancient languages using probabilistic models of sound change. [sent-274, score-0.242]

100 Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. [sent-283, score-0.135]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('consonants', 0.347), ('consonant', 0.218), ('languages', 0.205), ('cluster', 0.204), ('vowel', 0.196), ('vowels', 0.186), ('phonetic', 0.161), ('decipherment', 0.148), ('nasals', 0.142), ('hmm', 0.14), ('emission', 0.139), ('character', 0.132), ('sampling', 0.131), ('nasal', 0.125), ('nk', 0.122), ('dirichlet', 0.121), ('tag', 0.12), ('posterior', 0.116), ('alphabetic', 0.116), ('characters', 0.11), ('unobserved', 0.102), ('priors', 0.097), ('alphabet', 0.095), ('typologically', 0.09), ('predictive', 0.088), ('clust', 0.085), ('tw', 0.084), ('transition', 0.083), ('snyder', 0.083), ('vocabularies', 0.081), ('letters', 0.077), ('cyrillic', 0.075), ('dirichlets', 0.075), ('prediction', 0.075), ('unknown', 0.073), ('clusters', 0.073), ('family', 0.072), ('hundreds', 0.071), ('poisson', 0.069), ('geman', 0.069), ('writing', 0.065), ('latin', 0.065), ('parameters', 0.064), ('bayesian', 0.064), ('known', 0.06), ('ascending', 0.059), ('monte', 0.058), ('tk', 0.058), ('density', 0.058), ('distributions', 0.058), ('families', 0.057), ('panel', 0.057), ('benjamin', 0.057), ('factorials', 0.057), ('integrand', 0.057), ('isolates', 0.057), ('nonnasal', 0.057), ('polya', 0.057), ('typebased', 0.057), ('compound', 0.055), ('gibbs', 0.054), ('slice', 0.054), ('universal', 0.054), ('hyperparameters', 0.053), ('bigram', 0.053), ('sample', 0.052), ('unnormalized', 0.051), ('greek', 0.051), ('carlo', 0.051), ('undeciphered', 0.05), ('austronesian', 0.05), ('bible', 0.05), ('unigram', 0.049), ('across', 0.049), ('mix', 0.047), ('kim', 0.047), ('christodoulopoulos', 0.046), ('urn', 0.046), ('inference', 0.045), ('coherent', 0.045), ('absence', 0.045), ('sequences', 0.044), ('densities', 0.044), ('shay', 0.044), ('assignment', 0.043), ('cohen', 0.042), ('letter', 0.042), ('emitted', 0.041), ('plurality', 0.04), ('unsupervised', 0.039), ('observation', 0.039), ('equation', 0.039), ('layer', 0.039), ('knight', 0.038), ('tying', 0.038), ('valley', 0.038), ('trigram', 0.038), ('observed', 0.037), ('ancient', 0.037), ('hmms', 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

2 0.25007063 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

Author: Thomas Mayer ; Christian Rohrdantz

Abstract: This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns.

3 0.16663851 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).

4 0.14678815 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections

Author: Long Duong ; Paul Cook ; Steven Bird ; Pavel Pecina

Abstract: We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-the-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying “good” training sentences from the parallel corpus and applying self-training. In experimental results on eight languages, our method achieves state-of-the-art results. 1 Unsupervised part-of-speech tagging Currently, part-of-speech (POS) taggers are available for many highly spoken and well-resourced languages such as English, French, German, Italian, and Arabic. For example, Petrov et al. (2012) build supervised POS taggers for 22 languages using the TNT tagger (Brants, 2000), with an average accuracy of 95.2%. However, many widelyspoken languages including Bengali, Javanese, and Lahnda have little data manually labelled for POS, limiting supervised approaches to POS tagging for these languages. However, with the growing quantity of text available online, and in particular, multilingual parallel texts from sources such as multilingual websites, government documents and large archives ofhuman translations ofbooks, news, and so forth, unannotated parallel data is becoming more widely available. This parallel data can be exploited to bridge languages, and in particular, transfer information from a highly-resourced language to a lesser-resourced language, to build unsupervised POS taggers. In this paper, we propose an unsupervised approach to POS tagging in a similar vein to the work of Das and Petrov (201 1). In this approach, — — pecina@ ufal .mff .cuni . c z a parallel corpus for a more-resourced language having a POS tagger, and a lesser-resourced language, is word-aligned. These alignments are exploited to infer an unsupervised tagger for the target language (i.e., a tagger not requiring manuallylabelled data in the target language). Our approach is substantially simpler than that of Das and Petrov, the current state-of-the art, yet performs comparably well. 2 Related work There is a wealth of prior research on building unsupervised POS taggers. Some approaches have exploited similarities between typologically similar languages (e.g., Czech and Russian, or Telugu and Kannada) to estimate the transition probabilities for an HMM tagger for one language based on a corpus for another language (e.g., Hana et al., 2004; Feldman et al., 2006; Reddy and Sharoff, 2011). Other approaches have simultaneously tagged two languages based on alignments in a parallel corpus (e.g., Snyder et al., 2008). A number of studies have used tag projection to copy tag information from a resource-rich to a resource-poor language, based on word alignments in a parallel corpus. After alignment, the resource-rich language is tagged, and tags are projected from the source language to the target language based on the alignment (e.g., Yarowsky and Ngai, 2001 ; Das and Petrov, 2011). Das and Petrov (201 1) achieved the current state-of-the-art for unsupervised tagging by exploiting high confidence alignments to copy tags from the source language to the target language. Graph-based label propagation was used to automatically produce more labelled training data. First, a graph was constructed in which each vertex corresponds to a unique trigram, and edge weights represent the syntactic similarity between vertices. Labels were then propagated by optimizing a convex function to favor the same tags for closely related nodes 634 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 634–639, ModelCoverageAccuracy Many-to-1 alignments88%68% 1-to-1 alignments 68% 78% 1-to-1 alignments: Top 60k sents 91% 80% Table 1: Token coverage and accuracy of manyto-one and 1-to-1 alignments, as well as the top 60k sentences based on alignment score for 1-to-1 alignments, using directly-projected labels only. while keeping a uniform tag distribution for unrelated nodes. A tag dictionary was then extracted from the automatically labelled data, and this was used to constrain a feature-based HMM tagger. The method we propose here is simpler to that of Das and Petrov in that it does not require convex optimization for label propagation or a feature based HMM, yet it achieves comparable results. 3 Tagset Our tagger exploits the idea ofprojecting tag information from a resource-rich to resource-poor language. To facilitate this mapping, we adopt Petrov et al.’s (2012) twelve universal tags: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET (de- terminers and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), “.” (punctuation), and X (all other categories, e.g., foreign words, abbreviations). These twelve basic tags are common across taggers for most languages. Adopting a universal tagset avoids the need to map between a variety of different, languagespecific tagsets. Furthermore, it makes it possible to apply unsupervised tagging methods to languages for which no tagset is available, such as Telugu and Vietnamese. 4 A Simpler Unsupervised POS Tagger Here we describe our proposed tagger. The key idea is to maximize the amount of information gleaned from the source language, while limiting the amount of noise. We describe the seed model and then explain how it is successively refined through self-training and revision. 4.1 Seed Model The first step is to construct a seed tagger from directly-projected labels. Given a parallel corpus for a source and target language, Algorithm 1provides a method for building an unsupervised tagger for the target language. In typical applications, the source language would be a better-resourced language having a tagger, while the target language would be lesser-resourced, lacking a tagger and large amounts of manually POS-labelled data. Algorithm 1 Build seed model Algorithm 1Build seed model 1:Tag source side. 2: Word align the corpus with Giza++ and remove the many-to-one mappings. 3: Project tags from source to target using the remaining 1-to-1 alignments. 4: Select the top n sentences based on sentence alignment score. 5: Estimate emission and transition probabilities. 6: Build seed tagger T. We eliminate many-to-one alignments (Step 2). Keeping these would give more POS-tagged tokens for the target side, but also introduce noise. For example, suppose English and French were the source and target language, respectively. In this case alignments such as English laws (NNS) to French les (DT) lois (NNS) would be expected (Yarowsky and Ngai, 2001). However, in Step 3, where tags are projected from the source to target language, this would incorrectly tag French les as NN. We build a French tagger based on English– French data from the Europarl Corpus (Koehn, 2005). We also compare the accuracy and coverage of the tags obtained through direct projection using the French Melt POS tagger (Denis and Sagot, 2009). Table 1confirms that the one-to-one alignments indeed give higher accuracy but lower coverage than the many-to-one alignments. At this stage of the model we hypothesize that highconfidence tags are important, and hence eliminate the many-to-one alignments. In Step 4, in an effort to again obtain higher quality target language tags from direct projection, we eliminate all but the top n sentences based on their alignment scores, as provided by the aligner via IBM model 3. We heuristically set this cutoff × to 60k to balance the accuracy and size of the seed model.1 Returning to our preliminary English– French experiments in Table 1, this process gives improvements in both accuracy and coverage.2 1We considered values in the range 60–90k, but this choice had little impact on the accuracy of the model. 2We also considered using all projected labels for the top 60k sentences, not just 1-to-1 alignments, but in preliminary experiments this did not perform as well, possibly due to the previously-observed problems with many-to-one alignments. 635 The number of parameters for the emission probability is |V | |T| where V is the vocabulary and aTb iilsi ttyh eis tag |s e×t. TTh| ew htrearnesi Vtio ins probability, on atnhed other hand, has only |T|3 parameters for the trigram hmaondde,l we use. TB|ecause of this difference in number of parameters, in step 5, we use different strategies to estimate the emission and transition probabilities. The emission probability is estimated from all 60k selected sentences. However, for the transition probability, which has less parameters, we again focus on “better” sentences, by estimating this probability from only those sen- tences that have (1) token coverage > 90% (based on direct projection of tags from the source language), and (2) length > 4 tokens. These criteria aim to identify longer, mostly-tagged sentences, which we hypothesize are particularly useful as training data. In the case of our preliminary English–French experiments, roughly 62% of the 60k selected sentences meet these criteria and are used to estimate the transition probability. For unaligned words, we simply assign a random POS and very low probability, which does not substantially affect transition probability estimates. In Step 6 we build a tagger by feeding the estimated emission and transition probabilities into the TNT tagger (Brants, 2000), an implementation of a trigram HMM tagger. 4.2 Self training and revision For self training and revision, we use the seed model, along with the large number of target language sentences available that have been partially tagged through direct projection, in order to build a more accurate tagger. Algorithm 2 describes this process of self training and revision, and assumes that the parallel source–target corpus has been word aligned, with many-to-one alignments removed, and that the sentences are sorted by alignment score. In contrast to Algorithm 1, all sentences are used, not just the 60k sentences with the highest alignment scores. We believe that sentence alignment score might correspond to difficulty to tag. By sorting the sentences by alignment score, sentences which are more difficult to tag are tagged using a more mature model. Following Algorithm 1, we divide sentences into blocks of 60k. In step 3 the tagged block is revised by comparing the tags from the tagger with those obtained through direct projection. Suppose source Algorithm 2 Self training and revision 1:Divide target language sentences into blocks of n sentences. 2: Tag the first block with the seed tagger. 3: Revise the tagged block. 4: Train a new tagger on the tagged block. 5: Add the previous tagger’s lexicon to the new tagger. 6: Use the new tagger to tag the next block. 7: Goto 3 and repeat until all blocks are tagged. language word wis is aligned with target language word wjt with probability p(wjt |wsi), Tis is the tag for wis using the tagger availa|bwle for the source language, and Tjt is the tag for wjt using the tagger learned for the > S, where S is a threshold which we heuristically set to 0.7, we replace Tjt by Tis. Self-training can suffer from over-fitting, in which errors in the original model are repeated and amplified in the new model (McClosky et al., 2006). To avoid this, we remove the tag of any token that the model is uncertain of, i.e., if p(wjt |wsi) < S and Tjt Tis then Tjt = Null. So, on th|ew target side, aligned words have a tag from direct projection or no tag, and unaligned words have a tag assigned by our model. Step 4 estimates the emission and transition target language. If p(wtj|wis) = probabilities as in Algorithm 1. In Step 5, emission probabilities for lexical items in the previous model, but missing from the current model, are added to the current model. Later models therefore take advantage of information from earlier models, and have wider coverage. 5 Experimental Results Using parallel data from Europarl (Koehn, 2005) we apply our method to build taggers for the same eight target languages as Das and Petrov (201 1) Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish with English as the source language. Our training data (Europarl) is a subset of the training data of Das and Petrov (who also used the ODS United Nations dataset which we were unable to obtain). The evaluation metric and test data are the same as that used by Das and Petrov. Our results are comparable to theirs, although our system is penalized by having less training data. We tag the source language with the Stanford POS tagger (Toutanova et al., 2003). — — 636 DanishDutchGermanGreekItalianPortugueseSpanishSwedishAverage Seed model83.781.183.677.878.684.981.478.981.3 Self training + revision 85.6 84.0 85.4 80.4 81.4 86.3 83.3 81.0 83.4 Das and Petrov (2011) 83.2 79.5 82.8 82.5 86.8 87.9 84.2 80.5 83.4 Table 2: Token-level POS tagging accuracy for our seed model, self training and revision, and the method of Das and Petrov (201 1). The best results on each language, and on average, are shown in bold. 1 1 Iteration 2 2 3 1 1 2 2 3 Iteration Figure 1: Overall accuracy, accuracy on known tokens, accuracy on unknown tokens, and proportion of known tokens for Italian (left) and Dutch (right). Table 2 shows results for our seed model, self training and revision, and the results reported by Das and Petrov. Self training and revision improve the accuracy for every language over the seed model, and gives an average improvement of roughly two percentage points. The average accuracy of self training and revision is on par with that reported by Das and Petrov. On individual languages, self training and revision and the method of Das and Petrov are split each performs better on half of the cases. Interestingly, our method achieves higher accuracies on Germanic languages the family of our source language, English while Das and Petrov perform better on Romance languages. This might be because our model relies on alignments, which might be more accurate for more-related languages, whereas Das and Petrov additionally rely on label propagation. Compared to Das and Petrov, our model performs poorest on Italian, in terms of percentage point difference in accuracy. Figure 1 (left panel) shows accuracy, accuracy on known words, accuracy on unknown words, and proportion of known tokens for each iteration of our model for Italian; iteration 0 is the seed model, and iteration 3 1 is the final model. Our model performs poorly on unknown words as indicated by the low accuracy on unknown words, and high accuracy on known — — — words compared to the overall accuracy. The poor performance on unknown words is expected because we do not use any language-specific rules to handle this case. Moreover, on average for the final model, approximately 10% of the test data tokens are unknown. One way to improve the performance of our tagger might be to reduce the proportion of unknown words by using a larger training corpus, as Das and Petrov did. We examine the impact of self-training and revision over training iterations. We find that for all languages, accuracy rises quickly in the first 5–6 iterations, and then subsequently improves only slightly. We exemplify this in Figure 1 (right panel) for Dutch. (Findings are similar for other languages.) Although accuracy does not increase much in later iterations, they may still have some benefit as the vocabulary size continues to grow. 6 Conclusion We have proposed a method for unsupervised POS tagging that performs on par with the current state- of-the-art (Das and Petrov, 2011), but is substantially less-sophisticated (specifically not requiring convex optimization or a feature-based HMM). The complexity of our algorithm is O(nlogn) compared to O(n2) for that of Das and Petrov 637 (201 1) where n is the size of training data.3 We made our code are available for download.4 In future work we intend to consider using a larger training corpus to reduce the proportion of unknown tokens and improve accuracy. Given the improvements of our model over that of Das and Petrov on languages from the same family as our source language, and the observation of Snyder et al. (2008) that a better tagger can be learned from a more-closely related language, we also plan to consider strategies for selecting an appropriate source language for a given target language. Using our final model with unsupervised HMM methods might improve the final performance too, i.e. use our final model as the initial state for HMM, then experiment with differ- ent inference algorithms such as Expectation Maximization (EM), Variational Bayers (VB) or Gibbs sampling (GS).5 Gao and Johnson (2008) compare EM, VB and GS for unsupervised English POS tagging. In many cases, GS outperformed other methods, thus we would like to try GS first for our model. 7 Acknowledgements This work is funded by Erasmus Mundus European Masters Program in Language and Communication Technologies (EM-LCT) and by the Czech Science Foundation (grant no. P103/12/G084). We would like to thank Prokopis Prokopidis for providing us the Greek Treebank and Antonia Marti for the Spanish CoNLL 06 dataset. Finally, we thank Siva Reddy and Spandana Gella for many discussions and suggestions. References Thorsten Brants. 2000. TnT: A statistical part-ofspeech tagger. In Proceedings of the sixth conference on Applied natural language processing (ANLP ’00), pages 224–231 . Seattle, Washington, USA. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of 3We re-implemented label propagation from Das and Petrov (2011). It took over a day to complete this step on an eight core Intel Xeon 3.16GHz CPU with 32 Gb Ram, but only 15 minutes for our model. 4https://code.google.com/p/universal-tagger/ 5We in fact have tried EM, but it did not help. The overall performance dropped slightly. This might be because selftraining with revision already found the local maximal point. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (ACL 2011), pages 600–609. Portland, Oregon, USA. Pascal Denis and Beno ıˆt Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proceedings of the 23rd PacificAsia Conference on Language, Information and Computation, pages 721–736. Hong Kong, China. Anna Feldman, Jirka Hana, and Chris Brew. 2006. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’06), pages 549–554. Genoa, Italy. Jianfeng Gao and Mark Johnson. 2008. A comparison of bayesian estimators for unsupervised hidden markov model pos taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 344–352. Association for Computational Linguistics, Stroudsburg, PA, USA. Jiri Hana, Anna Feldman, and Chris Brew. 2004. A resource-light approach to Russian morphology: Tagging Russian using Czech resources. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP ’04), pages 222–229. Barcelona, Spain. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the Tenth Machine Translation Summit (MT Summit X), pages 79–86. AAMT, Phuket, Thailand. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on Human Language Technology Conference ofthe North American Chapter of the Association of Computational Linguistics (HLT-NAACL ’06), pages 152–159. New York, USA. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 2089–2096. Istanbul, Turkey. Siva Reddy and Serge Sharoff. 2011. Cross language POS Taggers (and other tools) for Indian 638 languages: An experiment with Kannada using Telugu resources. In Proceedings of the IJCNLP 2011 workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA 2011). Chiang Mai, Thailand. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2008. Unsupervised multilingual learning for POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 1041–1050. Honolulu, Hawaii. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol- ume 1 (NAACL ’03), pages 173–180. Edmonton, Canada. David Yarowsky and Grace Ngai. 2001 . Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (NAACL ’01), pages 1–8. Pittsburgh, Pennsylvania, USA. 639

5 0.13942493 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

Author: Malte Nuhn ; Hermann Ney

Abstract: In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard. We show that in this case the decipherment problem is equivalent to the quadratic assignment problem (QAP). To the best of our knowledge, this connection between the QAP and the decipherment problem has not been known in the literature before.

6 0.11833544 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

7 0.10632417 220 acl-2013-Learning Latent Personas of Film Characters

8 0.10208998 97 acl-2013-Cross-lingual Projections between Languages from Different Families

9 0.1015051 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

10 0.10106042 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

11 0.09661366 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

12 0.096367024 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

13 0.095856726 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

14 0.092343606 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

15 0.091027305 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

16 0.08602991 29 acl-2013-A Visual Analytics System for Cluster Exploration

17 0.085511848 108 acl-2013-Decipherment

18 0.085144304 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

19 0.079735167 66 acl-2013-Beam Search for Solving Substitution Ciphers

20 0.079683021 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.203), (1, -0.02), (2, -0.025), (3, 0.0), (4, 0.03), (5, -0.114), (6, 0.002), (7, 0.019), (8, -0.1), (9, -0.07), (10, 0.017), (11, -0.157), (12, -0.025), (13, -0.101), (14, -0.109), (15, -0.279), (16, -0.021), (17, -0.033), (18, 0.044), (19, 0.031), (20, -0.038), (21, 0.009), (22, 0.048), (23, -0.03), (24, 0.004), (25, -0.017), (26, 0.052), (27, -0.022), (28, 0.001), (29, -0.018), (30, 0.069), (31, -0.005), (32, -0.0), (33, -0.067), (34, 0.05), (35, 0.075), (36, -0.062), (37, 0.009), (38, -0.193), (39, 0.101), (40, -0.031), (41, 0.028), (42, -0.041), (43, -0.109), (44, -0.009), (45, -0.156), (46, 0.054), (47, 0.144), (48, 0.068), (49, 0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95041883 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

2 0.80384231 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

Author: Thomas Mayer ; Christian Rohrdantz

3 0.64189947 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

Abstract: We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e.g., ‘F’) for one minute. Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. Our system provides an automated alternative. In a pilot study, we applied this system to tests of 55 novice and experienced professional fighters (boxers and mixed martial artists) and found that experienced fighters produced significantly longer chains of phonetically similar words, while no differences were found in the total number of words produced. These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired.

4 0.59053034 220 acl-2013-Learning Latent Personas of Film Characters

Author: David Bamman ; Brendan O'Connor ; Noah A. Smith

Abstract: We present two latent variable models for learning character types, or personas, in film, in which a persona is defined as a set of mixtures over latent lexical classes. These lexical classes capture the stereotypical actions of which a character is the agent and patient, as well as attributes by which they are described. As the first attempt to solve this problem explicitly, we also present a new dataset for the text-driven analysis of film, along with a benchmark testbed to help drive future work in this area.

5 0.58288229 29 acl-2013-A Visual Analytics System for Cluster Exploration

Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel

Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114

6 0.56912196 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

7 0.55500102 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

8 0.51651925 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

9 0.51604384 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

10 0.49927819 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

11 0.4983789 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

12 0.49333608 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

13 0.49274284 370 acl-2013-Unsupervised Transcription of Historical Documents

14 0.47802144 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

15 0.46752071 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

16 0.46675047 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

17 0.44886252 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

18 0.43968725 192 acl-2013-Improved Lexical Acquisition through DPP-based Verb Clustering

19 0.43784928 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

20 0.43722442 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.087), (6, 0.037), (11, 0.06), (15, 0.011), (22, 0.185), (24, 0.057), (26, 0.096), (35, 0.057), (42, 0.036), (48, 0.049), (64, 0.012), (70, 0.1), (88, 0.043), (90, 0.017), (95, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85325164 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

2 0.78490269 292 acl-2013-Question Classification Transfer

Author: Anne-Laure Ligozat

Abstract: Question answering systems have been developed for many languages, but most resources were created for English, which can be a problem when developing a system in another language such as French. In particular, for question classification, no labeled question corpus is available for French, so this paper studies the possibility to use existing English corpora and transfer a classification by translating the question and their labels. By translating the training corpus, we obtain results close to a monolingual setting.

3 0.74848562 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

Author: David Kauchak

Abstract: In this paper we examine language modeling for text simplification. Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model. We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. We evaluate the models intrinsically with perplexity and extrinsically on the lexical simplification task from SemEval 2012. We find that a combined model using both simplified and normal English data achieves a 23% improvement in perplexity and a 24% improvement on the lexical simplification task over a model trained only on simple data. Post-hoc analysis shows that the additional unsimplified data provides better coverage for unseen and rare n-grams.

4 0.70937502 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue

Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1

5 0.70723546 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

Author: Matt Post ; Shane Bergsma

Abstract: Syntactic features are useful for many text classification tasks. Among these, tree kernels (Collins and Duffy, 2001) have been perhaps the most robust and effective syntactic tool, appealing for their empirical success, but also because they do not require an answer to the difficult question of which tree features to use for a given task. We compare tree kernels to different explicit sets of tree features on five diverse tasks, and find that explicit features often perform as well as tree kernels on accuracy and always in orders of magnitude less time, and with smaller models. Since explicit features are easy to generate and use (with publicly avail- able tools) , we suggest they should always be included as baseline comparisons in tree kernel method evaluations.

6 0.70712417 318 acl-2013-Sentiment Relevance

7 0.7051819 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

8 0.70359743 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification

9 0.69756049 222 acl-2013-Learning Semantic Textual Similarity with Structural Representations

10 0.69723904 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

11 0.69706696 249 acl-2013-Models of Semantic Representation with Visual Attributes

12 0.69668263 80 acl-2013-Chinese Parsing Exploiting Characters

13 0.69602233 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

14 0.69322181 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction

15 0.69262117 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

16 0.69190574 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars

17 0.69167072 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

18 0.6903826 333 acl-2013-Summarization Through Submodularity and Dispersion

19 0.68861121 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals

20 0.68856102 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia