acl acl2013 acl2013-48 knowledge-graph by maker-knowledge-mining

48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

Source: pdf

Author: Johann-Mattis List ; Steven Moran

Abstract: Given the increasing interest and development of computational and quantitative methods in historical linguistics, it is important that scholars have a basis for documenting, testing, evaluating, and sharing complex workflows. We present a novel open-source toolkit for quantitative tasks in historical linguistics that offers these features. This toolkit also serves as an interface between existing software packages and frequently used data formats, and it provides implementations of new and existing algorithms within a homogeneous framework. We illustrate the toolkit’s functionality with an exemplary workflow that starts with raw language data and ends with automatically calculated phonetic alignments, cognates and borrowings. We then illustrate evaluation metrics on gold standard datasets that are provided with the toolkit.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de s i Abstract Given the increasing interest and development of computational and quantitative methods in historical linguistics, it is important that scholars have a basis for documenting, testing, evaluating, and sharing complex workflows. [sent-3, score-0.303]

2 We present a novel open-source toolkit for quantitative tasks in historical linguistics that offers these features. [sent-4, score-0.376]

3 This toolkit also serves as an interface between existing software packages and frequently used data formats, and it provides implementations of new and existing algorithms within a homogeneous framework. [sent-5, score-0.25]

4 We illustrate the toolkit’s functionality with an exemplary workflow that starts with raw language data and ends with automatically calculated phonetic alignments, cognates and borrowings. [sent-6, score-0.271]

5 We then illustrate evaluation metrics on gold standard datasets that are provided with the toolkit. [sent-7, score-0.025]

6 1 Introduction Since the turn of the 21st century, there has been an increasing amount of research that applies computational and quantitative approaches to historicalcomparative linguistic processes. [sent-8, score-0.173]

7 Among these are: phonetic alignment algorithms (Kondrak, 2000; Prokić et al. [sent-9, score-0.209]

8 , 2009), statistical tests for genealogical relatedness (Kessler, 2001), methods for phylogenetic reconstruction (Holman et al. [sent-10, score-0.147]

9 , 2012), and automatic detection of cognates (Turchin et al. [sent-12, score-0.153]

10 In contrast to traditional approaches to language comparison, quantitative methods are often em- phasized as advantageous with regard to objectivity, transparency and replicability of results. [sent-17, score-0.205]

11 Thus in order to replicate a study, researchers have to rebuild workflows from published descriptions and reimplement their approaches and algorithms. [sent-21, score-0.102]

12 These challenges make the replication of results difficult, or even impossible, and they hinder not only the evaluation and comparison of existing algorithms, but also the development of new approaches that build on them. [sent-22, score-0.041]

13 Another problem is that quantitative approaches that have been released as software are largely incompatible with each other and they show great differences in regard to their input and out formats, application range and flexibility. [sent-23, score-0.237]

14 Furthermore, the linguistic datasets upon which many analyses and tools are based are only – if at all – available in disparate formats that need manual or semi-automatic re-editing before they can be used as input elsewhere. [sent-25, score-0.191]

15 Scholars who want to analyze a dataset with different approaches often have to (time-consumingly) convert it into various input formats and they have to familiarize themselves with many different kinds of software. [sent-26, score-0.208]

16 For the comparison of different output formats or 1There is the STARLING database program for lexicostatistical and glottochronological analyses (Starostin, 2000). [sent-28, score-0.191]

17 The Rug/L04 software aligns sound sequences and calculates phonetic distances using the Levensthein distance (Kleiweg, 2009; Levenshtein, 1966). [sent-29, score-0.355]

18 The ASJP-Software also computes the Levenshtein distance (Holman et al. [sent-30, score-0.023]

19 , 2011), but its results are based on previously executed phonetic analyses. [sent-31, score-0.15]

20 The ALINE software carries out pairwise alignment analyses (Kondrak, 2000). [sent-32, score-0.148]

21 There are also software packages from evolutionary biology, which are adapted for linguistic pur- such as MrBayes (Ronquist and Huelsenbeck, 2003), PHYLIP (Felsenstein, 2005), and SplitsTree (Huson, 1998). [sent-33, score-0.14]

22 c e2 A0s1s3oc Aiastsioocnia fotiron C foomrp Cuotmatpiountaatlio Lninaglu Liisntgicusi,s ptaicgses 13–18, for the evaluation of competing quantitative approaches, gold standard datasets are desirable. [sent-36, score-0.157]

23 Apart from a large number ofdifferent functions for common automatic tasks, LingPy offers specific modules for implementing general workflows that are used in historical linguistics and which partially mimic the basic aspects of the traditional comparative method (Trask, 2000, 64-67). [sent-43, score-0.407]

24 Figure 1 illustrates the interaction between different modules along with the data they produce. [sent-44, score-0.033]

25 In the following subsections, these modules will be introduced in the order of a typical workflow to illustrate the basic capabilities ofthe LingPy toolkit in more detail. [sent-45, score-0.153]

26 1 Input Formats The basic input format read by LingPy is a tabdelimited text file in which the first line (the header) indicates the values of the columns and all words are listed in the following rows. [sent-47, score-0.202]

27 No specific order of columns or rows is required. [sent-49, score-0.024]

28 org Raw data Orthographic parsing Tokdeantiazed Cognate detection Phonetic alignment (PA) Figure 1: Basic Workflow in LingPy representation of the word,3 and (4) TAXON, the name ofthe language (or dialect) in which the word occurs. [sent-52, score-0.126]

29 Basic output formats are essentially the same, the difference being that the results of calculations are added as separate columns. [sent-53, score-0.185]

30 Table 1 illustrates the basic structure of the input format î for a dataset covering 325 concepts translated into 18 Dogon language varieties taken from the Dogon comparative lexical spreadsheet (Heath et al. [sent-54, score-0.191]

31 2 Parsing and Unicode Handling Given a dataset in the basic LingPy input format, the first step towards sound-based normal- ization for automatically identifying cognates and sound changes with quantitative parse words into tokens. [sent-57, score-0.368]

32 methods Orthographic is to tokeniza- tion is a non-trivial task, but it is needed to at- 3By this we mean a textual representation of the word, whether in a document or language-specific orthography or in some form of broad or narrow transcription, etc. [sent-58, score-0.052]

33 4This tokenized dataset and analyses that are discussed in this work are available for download from the LingPy website. [sent-59, score-0.076]

34 file ( t oo l ) file ( t oo l ) file ( t oo l ) file ( t oo l ) . [sent-69, score-0.976]

35 Tommo_So 1 50 2 file ( t oo l ) 1 51 2 file ( t oo l ) 1 52 2 file ( t oo l ) . [sent-84, score-0.732]

36 Table 1: Basic Input Format of LingPy tain interoperability across different orthographies or transcription systems and to enable the comparative analysis of languages. [sent-96, score-0.088]

37 LingPy includes a parser that takes as input a dataset and an optional orthography profile, i. [sent-97, score-0.078]

38 a description of the Unicode code points, characters, graphemes and orthographic rules that are needed to adequately model a writing system for a language variety as described in a particular document (Moran, 2012, 33 1). [sent-99, score-0.1]

39 The LingPy parser first normalizes all strings into Unicode Normalization Form D, which decomposes all character sequences and reorders them into one canonical order. [sent-100, score-0.035]

40 Next, if no orthography profile is specified, the parser will use a regular expression match \X for Unicode grapheme clusters, i. [sent-102, score-0.114]

41 combining character sequences typified by a base character followed by one or more Combing Diacritical Marks. [sent-104, score-0.07]

42 However, another layer of tokenization is usually required to match linguistic graphemes, or what Unicode calls ‘tailored grapheme clusters’. [sent-105, score-0.029]

43 Table 2 illustrates the different technological and linguistic levels involved in orthographic parsing. [sent-106, score-0.063]

44 3 Phonetic Alignments Although less common in traditional historical linguistics, phonetic alignment plays a crucial role in automatic approaches, with alignment analyses being currently used in many different subfields, such as dialectology (Prokić et al. [sent-211, score-0.492]

45 Furthermore, align- ment analyses are very useful for data visualiza- tion, since they directly show which sound segments correspond in cognate words. [sent-214, score-0.56]

46 LingPy offers implementations for many different approaches to pairwise and multiple phonetic alignment. [sent-215, score-0.279]

47 Among these, there are standard approaches that are directly taken from evolutionary biology and can be applied to linguistic data with only slight modifications, such as the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) and the Smith-Waterman algorithm (Smith and Waterman, 1981). [sent-216, score-0.142]

48 Furthermore, there are novel approaches that use more complex sequence models in order to meet linguisticspecific requirements, such as the Sound-Classbased phonetic Alignment (SCA) method (List, 2012b). [sent-217, score-0.191]

49 Figure 2 shows a plot of the multiple alignment of the counterparts of the concept “stool” in eight Dogon languages. [sent-218, score-0.083]

50 The color scheme for the sound segments follows the sound class distinction of Dolgopolsky (1964). [sent-219, score-0.18]

51 4 Automatic Cognate Detection The identification of cognates plays an impor- tant role in both traditional and quantitative approaches in historical linguistics. [sent-221, score-0.433]

52 Since the traditional approach to cognate detection within the framework of the comparative method is very time-consuming and difficult to evaluate for the non-expert, automatic approaches to cognate detection can play an important role in objectifying phylogenetic reconstructions. [sent-225, score-1.198]

53 Currently, LingPy offers four alternative approaches to cognate detection in multilingual wordlists. [sent-226, score-0.579]

54 (2010) employs sound classes as proposed by Dolgopolsky (1964) and assigns words that match in their first two consonant classes to the same cognate set. [sent-228, score-0.539]

55 The NED method calculates the normalized edit distance between words and groups them into cognate sets using a flat cluster algorithm. [sent-229, score-0.491]

56 As shown, LingPy follows the STARLING approach in displaying cognate judgments by assigning cognate words the same cognate ID (COGID). [sent-232, score-1.297]

57 In Table 4, the words judged to be cognate are shaded in the same color. [sent-233, score-0.42]

58 5 Automatic Borrowing Detection Automatic approaches for borrowing detection are still in their infancy in historical linguistics. [sent-236, score-0.356]

59 LingPy provides a full reimplementation (along with specifically linguistic modifications) of the minimal lateral network (MLN) approach (NelsonSathi et al. [sent-237, score-0.037]

60 This approach searches for cognate sets which are not compatible with a given ref6The normalized edit distance is calculated by dividing the edit distance (Levenshtein, 1966) by the length ofthe smaller sequence, see Holman et al. [sent-239, score-0.516]

61 1239file (tool)ki:́ràToro_Tegu68 1 40 2 file ( t oo l ) di : s 1 42 2 file ( t oo l ) di : j . [sent-256, score-0.612]

62 11225409ffi l ee ( t ooooll) bbiim̀m̀bbuú́DToogmumlo__DSoom7700 ( t oo l ) di : zu 1252file (tool)bi:́mbyéMombo70 . [sent-271, score-0.241]

63 Incompatible (patchy) cog- nate sets often point to either borrowings or wrong cognate assessments in the data. [sent-287, score-0.462]

64 The results can be visualized by connecting all taxa of the reference tree for which patchy cognate sets can be inferred with lateral links. [sent-288, score-0.499]

65 Cognate judgments for this analysis were carried out with help of LingPy’s LexStat method. [sent-290, score-0.037]

66 6 Output Formats The output formats supported by LingPy can be divided into three different classes. [sent-293, score-0.141]

67 The first class consists of text-based formats that can be used for manual correction and inspection by importing the data into spreadsheet programs, or simply editing and reviewing the results in a text editor. [sent-294, score-0.183]

68 The second class consists of specific formats for third-party toolkits, such as PHYLIP, SplitsTree, MrBayes, or STARLING. [sent-295, score-0.141]

69 LingPy currently offers support for PHYLIP’s distance calculations (DST-format), for tree-representation (Newick-format), for complex representations of character data (Nexus-format), and for the im- port into STARLING databases (CSV with STARLING markup). [sent-296, score-0.153]

70 The third class consists of new approaches to the visualization of phonetic alignments, cognate sets, and phylogenetic networks. [sent-297, score-0.708]

71 3 Evaluation In order to improve the performance of quantitative approaches, it is of crucial importance to test and evaluate them. [sent-299, score-0.132]

72 a gold standard, where the results of the analysis are known in advance. [sent-302, score-0.025]

73 LingPy comes with a module for the evaluation of 16 Figure 3: Borrowing Detection in LingPy basic tasks in historical linguistics, such as phonetic alignment and cognate detection. [sent-303, score-0.805]

74 This module offers both common evaluation measures that are used to assess the accuracy of the respective methods and gold standard datasets encoded in the LingPy input format. [sent-304, score-0.076]

75 For all approaches we chose the respective thresholds that tend to yield the best results on all of the gold standards. [sent-309, score-0.066]

76 However, the generally bad performance 7Gold standard here means that the cognate judgments were carried out manually by the compilers ofthe IELex database. [sent-311, score-0.457]

77 ofall approaches on this dataset shows that there is a clear need for improving automatic cognate detection approaches, especially in cases of remote relationship, such as Indo-European. [sent-312, score-0.554]

78 Figure 4: Evaluating Cognate Detection Methods 4 Conclusion Quantitative approaches in historical linguistics are still in their infancy, far away from being able to compete with the intuition of trained historical 17 linguists. [sent-313, score-0.325]

79 The toolkit we presented is a first at- tempt to close the gap between quantitative and traditional methods by providing a homogeneous framework that serves as an interface between existing packages and at the same time provides highquality implementations of new approaches. [sent-314, score-0.375]

80 Automated reconstruction of ancient languages using probabilistic models of sound change. [sent-322, score-0.14]

81 A new algorithm for the alignment of phonetic sequences. [sent-439, score-0.209]

82 Networks uncover hidden lexical borrowing in IndoEuropean language evolution. [sent-492, score-0.069]

83 Analyzing genetic connections between languages by matching consonant classes. [sent-545, score-0.029]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lingpy', 0.61), ('cognate', 0.42), ('phonetic', 0.15), ('dogon', 0.147), ('historical', 0.142), ('formats', 0.141), ('oo', 0.135), ('quantitative', 0.132), ('lexstat', 0.126), ('file', 0.109), ('holman', 0.105), ('starling', 0.105), ('unicode', 0.103), ('phylogenetic', 0.097), ('sound', 0.09), ('cognates', 0.086), ('mbu', 0.084), ('phylip', 0.084), ('turchin', 0.084), ('sca', 0.081), ('bouckaert', 0.074), ('bi', 0.071), ('borrowing', 0.069), ('detection', 0.067), ('mbye', 0.063), ('mrbayes', 0.063), ('proki', 0.063), ('taxon', 0.063), ('orthographic', 0.063), ('di', 0.062), ('packages', 0.061), ('workflows', 0.061), ('biology', 0.061), ('alignment', 0.059), ('comparative', 0.054), ('orthography', 0.052), ('offers', 0.051), ('toolkit', 0.051), ('analyses', 0.05), ('reconstruction', 0.05), ('ipa', 0.048), ('ki', 0.046), ('calculations', 0.044), ('molecular', 0.044), ('zu', 0.044), ('borrowings', 0.042), ('cogid', 0.042), ('documenting', 0.042), ('dolgopolsky', 0.042), ('genome', 0.042), ('ielex', 0.042), ('jams', 0.042), ('kir', 0.042), ('patchy', 0.042), ('ronquist', 0.042), ('splitstree', 0.042), ('spreadsheet', 0.042), ('approaches', 0.041), ('evolutionary', 0.04), ('software', 0.039), ('id', 0.038), ('levenshtein', 0.038), ('homogeneous', 0.038), ('graphemes', 0.037), ('heath', 0.037), ('infancy', 0.037), ('lateral', 0.037), ('needleman', 0.037), ('steiner', 0.037), ('judgments', 0.037), ('implementations', 0.037), ('workflow', 0.035), ('format', 0.035), ('character', 0.035), ('basic', 0.034), ('transcription', 0.034), ('profile', 0.033), ('modules', 0.033), ('traditional', 0.032), ('distances', 0.03), ('grapheme', 0.029), ('consonant', 0.029), ('ay', 0.029), ('scholars', 0.029), ('kondrak', 0.028), ('alignments', 0.027), ('bioinformatics', 0.027), ('ned', 0.027), ('dataset', 0.026), ('moran', 0.025), ('incompatible', 0.025), ('gold', 0.025), ('edit', 0.025), ('ra', 0.025), ('gray', 0.025), ('serves', 0.024), ('concept', 0.024), ('columns', 0.024), ('calculates', 0.023), ('distance', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

Author: Johann-Mattis List ; Steven Moran

2 0.16170041 154 acl-2013-Extracting bilingual terminologies from comparable corpora

Author: Ahmet Aker ; Monica Paramita ; Rob Gaizauskas

Abstract: In this paper we present a method for extracting bilingual terminologies from comparable corpora. In our approach we treat bilingual term extraction as a classification problem. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. The performance of our classifier reaches the 100% precision level for many language pairs. We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora. The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations.

3 0.074577138 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

Abstract: We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e.g., ‘F’) for one minute. Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. Our system provides an automated alternative. In a pilot study, we applied this system to tests of 55 novice and experienced professional fighters (boxers and mixed martial artists) and found that experienced fighters produced significantly longer chains of phonetically similar words, while no differences were found in the total number of words produced. These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired.

4 0.064368092 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

5 0.053967219 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou

Abstract: U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the METANET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with METANET’s aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both mul- tilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.

6 0.049539506 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

7 0.041873239 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation

8 0.040157449 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

9 0.039564069 29 acl-2013-A Visual Analytics System for Cluster Exploration

10 0.039172485 370 acl-2013-Unsupervised Transcription of Historical Documents

11 0.038503088 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

12 0.037358399 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner

13 0.037264921 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

14 0.036779184 234 acl-2013-Linking and Extending an Open Multilingual Wordnet

15 0.035116933 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

16 0.034446768 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

17 0.033567145 128 acl-2013-Does Korean defeat phonotactic word segmentation?

18 0.032527268 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

19 0.03180822 240 acl-2013-Microblogs as Parallel Corpora

20 0.031437736 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.099), (1, 0.0), (2, 0.013), (3, -0.02), (4, 0.013), (5, -0.045), (6, -0.009), (7, -0.003), (8, 0.015), (9, -0.031), (10, -0.033), (11, -0.026), (12, -0.01), (13, -0.022), (14, -0.024), (15, -0.031), (16, 0.003), (17, -0.003), (18, 0.002), (19, -0.01), (20, -0.049), (21, 0.008), (22, -0.025), (23, 0.01), (24, -0.016), (25, 0.015), (26, -0.037), (27, -0.026), (28, -0.005), (29, -0.016), (30, -0.034), (31, -0.012), (32, -0.05), (33, -0.07), (34, -0.006), (35, 0.035), (36, -0.065), (37, 0.074), (38, -0.068), (39, 0.071), (40, -0.034), (41, 0.069), (42, -0.004), (43, -0.049), (44, -0.05), (45, -0.053), (46, -0.008), (47, -0.033), (48, 0.083), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89619368 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

Author: Johann-Mattis List ; Steven Moran

2 0.70607829 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

3 0.64689875 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

Author: Thomas Mayer ; Christian Rohrdantz

Abstract: This paper describes the online tool PhonMatrix, which analyzes a word list with respect to the co-occurrence of sounds in a specified context within a word. The cooccurrence counts from the user-specified context are statistically analyzed according to a number of association measures that can be selected by the user. The statistical values then serve as the input for a matrix visualization where rows and columns represent the relevant sounds under investigation and the matrix cells indicate whether the respective ordered pair of sounds occurs more or less frequently than expected. The usefulness of the tool is demonstrated with three case studies that deal with vowel harmony and similar place avoidance patterns.

4 0.55627024 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

5 0.55481416 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

Author: Tingting Li ; Tiejun Zhao ; Andrew Finch ; Chunyue Zhang

Abstract: Machine Transliteration is an essential task for many NLP applications. However, names and loan words typically originate from various languages, obey different transliteration rules, and therefore may benefit from being modeled independently. Recently, transliteration models based on Bayesian learning have overcome issues with over-fitting allowing for many-to-many alignment in the training of transliteration models. We propose a novel coupled Dirichlet process mixture model (cDPMM) that simultaneously clusters and bilingually aligns transliteration data within a single unified model. The unified model decomposes into two classes of non-parametric Bayesian component models: a Dirichlet process mixture model for clustering, and a set of multinomial Dirichlet process models that perform bilingual alignment independently for each cluster. The experimental results show that our method considerably outperforms conventional alignment models.

6 0.54256052 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

7 0.53586149 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

8 0.52943426 29 acl-2013-A Visual Analytics System for Cluster Exploration

9 0.52349603 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

10 0.50728673 163 acl-2013-From Natural Language Specifications to Program Input Parsers

11 0.50705093 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

12 0.50462216 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

13 0.48515254 220 acl-2013-Learning Latent Personas of Film Characters

14 0.4616524 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

15 0.46026719 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us

16 0.45087716 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

17 0.43188331 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

18 0.42910954 14 acl-2013-A Novel Classifier Based on Quantum Computation

19 0.42768979 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

20 0.42469651 390 acl-2013-Word surprisal predicts N400 amplitude during reading

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.052), (6, 0.037), (11, 0.031), (24, 0.069), (26, 0.04), (28, 0.021), (31, 0.364), (35, 0.052), (42, 0.039), (48, 0.032), (70, 0.065), (88, 0.021), (90, 0.022), (95, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78118217 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars

Author: Yoav Artzi ; Nicholas FitzGerald ; Luke Zettlemoyer

Abstract: unkown-abstract

same-paper 2 0.78066683 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

Author: Johann-Mattis List ; Steven Moran

3 0.64604431 234 acl-2013-Linking and Extending an Open Multilingual Wordnet

Author: Francis Bond ; Ryan Foster

Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.

4 0.63461459 367 acl-2013-Universal Conceptual Cognitive Annotation (UCCA)

Author: Omri Abend ; Ari Rappoport

Abstract: Syntactic structures, by their nature, reflect first and foremost the formal constructions used for expressing meanings. This renders them sensitive to formal variation both within and across languages, and limits their value to semantic applications. We present UCCA, a novel multi-layered framework for semantic representation that aims to accommodate the semantic distinctions expressed through linguistic utterances. We demonstrate UCCA’s portability across domains and languages, and its relative insensitivity to meaning-preserving syntactic variation. We also show that UCCA can be effectively and quickly learned by annotators with no linguistic background, and describe the compilation of a UCCAannotated corpus.

5 0.51758206 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset

Author: Mohamed Aly ; Amir Atiya

Abstract: We introduce LABR, the largest sentiment analysis dataset to-date for the Arabic language. It consists of over 63,000 book reviews, each rated on a scale of 1 to 5 stars. We investigate the properties of the the dataset, and present its statistics. We explore using the dataset for two tasks: sentiment polarity classification and rating classification. We provide standard splits of the dataset into training and testing, for both polarity and rating classification, in both balanced and unbalanced settings. We run baseline experiments on the dataset to establish a benchmark.

6 0.4555459 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

7 0.40629721 382 acl-2013-Variational Inference for Structured NLP Models

8 0.39344013 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages

9 0.37915349 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

10 0.37772244 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

11 0.37575674 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

12 0.37574032 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

13 0.37509504 249 acl-2013-Models of Semantic Representation with Visual Attributes

14 0.37485552 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

15 0.37465811 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

16 0.37420112 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

17 0.3740254 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals

18 0.37338153 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

19 0.37296164 267 acl-2013-PARMA: A Predicate Argument Aligner

20 0.37235546 85 acl-2013-Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis