acl acl2012 acl2012-200 knowledge-graph by maker-knowledge-mining

200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts


Source: pdf

Author: Stephen Tyndall

Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. [sent-3, score-0.158]

2 The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. [sent-4, score-0.538]

3 Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem. [sent-5, score-0.143]

4 1 Introduction The Hittite empire, in existence for about 600 years between 1800 and 1200 BCE, left numerous historical, political, and literary documents behind, written in cuneiform in clay tablets. [sent-6, score-0.448]

5 There are a number of common problems that confront Hittite scholars interested in any subdiscipline of Hittitology, be it history, philology, or linguistics. [sent-7, score-0.052]

6 First, the bulk of the cuneiform material is fragmentary. [sent-9, score-0.351]

7 The tablets, discovered in various depots in the Hittite capital and in some provincial centers, normally were of a larger size. [sent-10, score-0.09]

8 When the archives were destroyed, the tablets for the most part broke into many pieces. [sent-11, score-0.125]

9 Therefore, the joining of fragments became an important prereq- uisite for interpretation(Klengel, 2002). [sent-12, score-0.282]

10 Most Hittite texts are broken, but a number exist in more than one fragmentary copy. [sent-13, score-0.101]

11 Figure 1 shows a photograph, taken from the University of Meinz Konkordanz der hethitischen Texte1, of a typical Hittite cuneiform fragment. [sent-14, score-0.351]

12 Complete or partially-complete texts are assembled from collections of fragments based on shape, writing size and style, and sentence similarity. [sent-15, score-0.425]

13 Joins between fragments are not made systematically, but are usually discovered by scholars assembling large numbers of fragments that reference a specific subject, like some joins recently made in Hittite treaty documents in (Beckman, 1997). [sent-16, score-0.8]

14 Such joins and the larger texts created therewith are catalogued according to a CTH (Catalogue des Textes Hittites2) number. [sent-18, score-0.273]

15 Each individual text is composed of one or more cuneiform fragments belonging to one or more copies of a single original work. [sent-19, score-0.704]

16 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi2c 4s3–247, Figure 2 shows a published join in hand-copied cuneiform fragments. [sent-29, score-0.404]

17 In this case, the fragments are not contiguous, and only the text on the two fragments was used to make the join. [sent-30, score-0.586]

18 The task then, for the purposes of this paper, is to connect unknown fragments of Hittite cuneiform tablets with larger texts. [sent-31, score-0.761]

19 I’m viewing this as a text classification task, where larger, CTH-numbered texts are the categories, and small fragments are the bits of text to be assigned to these categories. [sent-32, score-0.502]

20 2 The Corpus of Hittite Hittite cuneiform consists of a mix of syllabic writing for Hittite words and logographic writing, typically Sumerian ideograms, standing in for Hittite words. [sent-33, score-0.437]

21 Most words are written out phonologically using syllabic signs, in structure mostly CV and VC, and a few CVC. [sent-34, score-0.084]

22 Some common words are written with logograms from other Ancient Near Eastern languages, e. [sent-35, score-0.04]

23 Hittite antuh ˇsa- ‘man’ is commonly written with the Sumerian-language logogram tran- scribed LU´. [sent-37, score-0.04]

24 Such writings are called Sumerograms or Akkadograms, depending on the language from which the ideogram is taken. [sent-38, score-0.057]

25 The extant corpus of Hittite consists of more than 30,000 clay tablets and fragments excavated at sites in Turkey, Syria, and Egypt (Hoffner and Melchert, 2008, 2-3). [sent-39, score-0.435]

26 Many of these fragments are assigned to one of the 835 texts catalogued in the CTH. [sent-40, score-0.427]

27 3 Prior Work A large number of prior studies on text classification have informed the progress of this study. [sent-41, score-0.059]

28 Categorization of texts into genres is very well studied (Dewdney et al. [sent-42, score-0.101]

29 Measures of similarity among sections of a single document bear a closer relation to this project than the works above. [sent-46, score-0.03]

30 Very little computational work on cuneiform lan- guages or texts exists. [sent-48, score-0.452]

31 The most notable example is a study that examined grapheme distribution as a way to understand Hurrian substratal interference in the orthography of Akkadian-language cuneiform texts written in the Hurrian-speaking town of Nuzi (Smith, 2007). [sent-49, score-0.527]

32 4 The Project Corpus For this project, I use a corpus of neo-Hittite fragment transcriptions available from H. [sent-51, score-0.068]

33 The fragments themselves are included as plain text, with restorations by the transcribers left intact and set off by brackets, in the manner typical of cuneiform transcription. [sent-56, score-0.764]

34 In transcription, signs with phonemic value are written in lower case characters, while ideograms are represented in all caps. [sent-57, score-0.316]

35 Sign boundaries are represented by a hyphen, indicating the next sign is part of the current word, by an equals sign, indicating the next sign is a clitic, or a space, indicating that the next sign is part of a new word. [sent-58, score-0.213]

36 i s -t ar-ni=sum-m [ i ] ] x nu=kn ki-x [ [ ] KUR URUMi-i z -ri=y [ a [ i -t ar-ni ] =sum-mi e-e s -du [ s [ ] nu=kn A-NA KUR URUMi-i z -ri [ [ A-NA EGI ] R UDmi i -t ar-ni=su [ m-mi s This fragment, KUB XXI25, is very small and broken on both sides. [sent-60, score-0.035]

37 The areas between brackets are sections of the text broken off or effaced by erosion of tablet surface material. [sent-61, score-0.344]

38 Any text present between brackets has been inferred from context and transcriber experience with usual phrasing in Hittite. [sent-62, score-0.204]

39 In the last line, the sign EGIR, a Sumerian ideogram, which is split by a bracket, was partially effaced but still recognizable to the transcriber, and so is split by a bracket. [sent-63, score-0.115]

40 5 Methods For this project, Iused both Naive Bayes and Maximum Entropy classifiers as implemented by the MAchine Learning for LanguagE Toolkit, MALLET(McCallum, 2002). [sent-64, score-0.03]

41 In one, anything in brackets or partially remaining after brackets was removed, leaving only characters actu- ally preserved on the fragment. [sent-66, score-0.352]

42 The other has all bracket characters removed, leaving all actual characters and all characters suggested by the transcribers. [sent-68, score-0.212]

43 By removing the brackets but leaving the suggested characters, Ihoped to use the transcribers’ intuitions about Hittite texts to further improve the performance of both classifiers. [sent-70, score-0.297]

44 The tokens were defined only by spaces, capturing all words in the corpus. [sent-72, score-0.031]

45 The tokens were defined as a series of capital letters and punctuation marks, capturing only the Sumerian and Akkadian ideograms in the text, i. [sent-74, score-0.277]

46 The training and tests were all performed using MALLET’s standard algorithms, cross-validated, Table 1: Results for Plain Corpus IdTeAo klgeranToimzksaetOinosn lyNaiv. [sent-78, score-0.037]

47 6 Results and Discussion Accuracy values from the classifiers using the Plain corpus, and from the corpus with the Brackets Removed, are presented in Tables 1 and 2, respectively. [sent-86, score-0.03]

48 The measures are raw accuracy, the fraction of the test fragments that the methods categorized correctly. [sent-87, score-0.282]

49 The results for the Plain Corpus show that the Naive Bayes classifier was 55% accurate with all tokens, and 44% accurate with ideograms alone. [sent-88, score-0.303]

50 The Maximum Entropy classifier was 61% accurate with all tokens, and 51% accurate with ideograms only. [sent-89, score-0.303]

51 Both classifiers performed better with the Brackets Removed corpus. [sent-90, score-0.03]

52 The Naive Bayes classifier was accurate 64% of the time with all tokens and 49% of the time with ideograms only. [sent-91, score-0.292]

53 The Maximum Entropy classifier was 67% accurate with all tokens, and 54% accurate with ideograms only. [sent-92, score-0.303]

54 The predicted increase in accuracy using ideograms was not upheld by the above tests. [sent-93, score-0.219]

55 Some early tests suggested occasional excellent results for this tokenization scheme, including a single random 90-10 training/test run that showed a test accuracy of . [sent-95, score-0.06]

56 86, much higher than any larger cross-validated test included above. [sent-96, score-0.032]

57 This suggests, 246 perhaps unsurprisingly, that the accuracy of classification using Sumerograms and Akkadograms is heavily dependent on the structure of the fragments in question. [sent-97, score-0.319]

58 Maximum Entropy classification proved to be slightly better, in every instance, than Naive Bayes classification, a fact that will prove useful in future tests and applications. [sent-98, score-0.074]

59 The fact that removing the brackets and including the transcribers’ additions improved the performance of all classifiers will likewise prove useful, since transcriptions of fragments are typically published with such bracketed additions. [sent-99, score-0.542]

60 It also seems to demonstrate the quality of these additions made by transcribers. [sent-100, score-0.038]

61 Overall, these tests suggest that in general, the ‘use-everything’ approach is better for accurate classification of Hittite tablet fragments with larger CTH texts. [sent-101, score-0.535]

62 However, in some cases, when the fragments in question have a large number of Sumerograms and Akkadograms, using them exclusively may be the right choice. [sent-102, score-0.282]

63 regarding tablet fragments as elements for con- nection by clustering algorithms, might work well. [sent-106, score-0.387]

64 Given the large number of small fragments now coming to light, this method could speed the process of text assembly considerably. [sent-107, score-0.304]

65 A new set of archives, recently discovered in the Hittite city of are only now beginning to see publication. [sent-108, score-0.031]

66 This site contains more than 3000 new Hittite tablet fragments, with excavations ongoing(S u¨el, 2002). [sent-109, score-0.139]

67 The jumbled nature of the dig site means that the process of assembling new texts from this site will be one of the major tasks in for Hittite scholars in the near future. [sent-110, score-0.307]

68 , editor, Recent developments in Hittite archaeology and history: papers in memory of Hans G. [sent-147, score-0.077]

69 Applied bayesian and classical inference: The case of the federalist papers. [sent-170, score-0.044]

70 , editor, Recent developments in Hittite archaeology and history: papers in memory of Hans G. [sent-194, score-0.077]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hittite', 0.658), ('cuneiform', 0.351), ('fragments', 0.282), ('ideograms', 0.219), ('brackets', 0.138), ('sumerian', 0.132), ('akkadian', 0.11), ('tablet', 0.105), ('texts', 0.101), ('joins', 0.096), ('tablets', 0.096), ('akkadograms', 0.088), ('hoffner', 0.088), ('sumerograms', 0.088), ('naive', 0.073), ('sign', 0.071), ('kub', 0.066), ('kur', 0.066), ('melchert', 0.066), ('nuzi', 0.066), ('transcribers', 0.066), ('plain', 0.065), ('assembling', 0.057), ('clay', 0.057), ('ideogram', 0.057), ('signs', 0.057), ('scholars', 0.052), ('bayes', 0.051), ('copies', 0.049), ('cth', 0.046), ('removed', 0.045), ('mallet', 0.044), ('archaeology', 0.044), ('aslihan', 0.044), ('aus', 0.044), ('boghazk', 0.044), ('catalogued', 0.044), ('dewdney', 0.044), ('dhesi', 0.044), ('effaced', 0.044), ('federalist', 0.044), ('hethport', 0.044), ('horst', 0.044), ('hurrian', 0.044), ('ihope', 0.044), ('klengel', 0.044), ('mosteller', 0.044), ('simrit', 0.044), ('southeastcon', 0.044), ('syllabic', 0.044), ('tomokiyo', 0.044), ('transcriber', 0.044), ('uterbock', 0.044), ('yener', 0.044), ('zburg', 0.044), ('writing', 0.042), ('accurate', 0.042), ('characters', 0.041), ('history', 0.041), ('written', 0.04), ('fragment', 0.04), ('viewing', 0.038), ('photograph', 0.038), ('fragmented', 0.038), ('eastern', 0.038), ('additions', 0.038), ('tests', 0.037), ('classification', 0.037), ('broken', 0.035), ('leaving', 0.035), ('categorization', 0.035), ('interference', 0.035), ('hans', 0.035), ('ancient', 0.035), ('site', 0.034), ('harry', 0.033), ('developments', 0.033), ('larger', 0.032), ('discovered', 0.031), ('tokens', 0.031), ('craig', 0.031), ('bracket', 0.031), ('entropy', 0.03), ('classifiers', 0.03), ('project', 0.03), ('nu', 0.029), ('archives', 0.029), ('near', 0.029), ('transcriptions', 0.028), ('kn', 0.027), ('capital', 0.027), ('join', 0.027), ('classify', 0.026), ('published', 0.026), ('newspaper', 0.025), ('ent', 0.024), ('editor', 0.024), ('suggested', 0.023), ('smith', 0.022), ('text', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

Author: Stephen Tyndall

Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.

2 0.11285498 154 acl-2012-Native Language Detection with Tree Substitution Grammars

Author: Benjamin Swanson ; Eugene Charniak

Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.

3 0.038892094 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii

Abstract: The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.

4 0.035723098 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

Author: Alessandro Moschitti ; Qi Ju ; Richard Johansson

Abstract: In this paper, we encode topic dependencies in hierarchical multi-label Text Categorization (TC) by means of rerankers. We represent reranking hypotheses with several innovative kernels considering both the structure of the hierarchy and the probability of nodes. Additionally, to better investigate the role ofcategory relationships, we consider two interesting cases: (i) traditional schemes in which node-fathers include all the documents of their child-categories; and (ii) more general schemes, in which children can include documents not belonging to their fathers. The extensive experimentation on Reuters Corpus Volume 1 shows that our rerankers inject effective structural semantic dependencies in multi-classifiers and significantly outperform the state-of-the-art.

5 0.030588616 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata

Abstract: We propose Symbol-Refined Tree Substitution Grammars (SR-TSGs) for syntactic parsing. An SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. We aim to provide a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. We present a novel probabilistic SR-TSG model based on the hierarchical Pitman-Yor Process to encode backoff smoothing from a fine-grained SR-TSG to simpler CFG rules, and develop an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling. Our SR-TSG parser achieves an F1 score of 92.4% in the Wall Street Journal (WSJ) English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and better than state-of-the-art discriminative reranking parsers.

6 0.029717535 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

7 0.027221246 37 acl-2012-Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

8 0.026650742 15 acl-2012-A Meta Learning Approach to Grammatical Error Correction

9 0.024474276 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing

10 0.02405699 31 acl-2012-Authorship Attribution with Author-aware Topic Models

11 0.023254484 92 acl-2012-FLOW: A First-Language-Oriented Writing Assistant System

12 0.023080396 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

13 0.022782398 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

14 0.02245694 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters

15 0.020810928 56 acl-2012-Computational Approaches to Sentence Completion

16 0.020380059 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

17 0.019926189 78 acl-2012-Efficient Search for Transformation-based Inference

18 0.01979379 50 acl-2012-Collective Classification for Fine-grained Information Status

19 0.018455338 126 acl-2012-Labeling Documents with Timestamps: Learning from their Time Expressions

20 0.018158918 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.066), (1, 0.019), (2, -0.016), (3, -0.013), (4, -0.021), (5, 0.024), (6, 0.004), (7, 0.037), (8, -0.019), (9, -0.001), (10, -0.04), (11, -0.055), (12, -0.012), (13, 0.034), (14, 0.019), (15, -0.05), (16, 0.011), (17, -0.049), (18, 0.018), (19, 0.028), (20, -0.012), (21, 0.025), (22, 0.023), (23, -0.008), (24, 0.035), (25, 0.01), (26, -0.04), (27, 0.069), (28, 0.02), (29, -0.034), (30, 0.012), (31, 0.001), (32, 0.035), (33, -0.022), (34, 0.05), (35, -0.066), (36, 0.008), (37, 0.041), (38, 0.017), (39, 0.04), (40, 0.012), (41, 0.097), (42, -0.13), (43, 0.037), (44, 0.028), (45, 0.007), (46, 0.095), (47, 0.023), (48, -0.018), (49, -0.07)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.933451 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

Author: Stephen Tyndall

Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.

2 0.57291359 154 acl-2012-Native Language Detection with Tree Substitution Grammars

Author: Benjamin Swanson ; Eugene Charniak

Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.

3 0.54210287 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii

Abstract: The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.

4 0.49058679 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Author: Marco Lui ; Timothy Baldwin

Abstract: We present langid .py, an off-the-shelflanguage identification tool. We discuss the design and implementation of langid .py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid .py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

5 0.42652538 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum

Author: Matt Garley ; Julia Hockenmaier

Abstract: We investigate how novel English-derived words (anglicisms) are used in a Germanlanguage Internet hip hop forum, and what factors contribute to their uptake.

6 0.41197512 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

7 0.4026618 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

8 0.3874923 15 acl-2012-A Meta Learning Approach to Grammatical Error Correction

9 0.38003278 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

10 0.35994458 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

11 0.35681325 190 acl-2012-Syntactic Stylometry for Deception Detection

12 0.34600037 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters

13 0.33603638 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing

14 0.33350798 37 acl-2012-Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

15 0.31401616 126 acl-2012-Labeling Documents with Timestamps: Learning from their Time Expressions

16 0.30836412 218 acl-2012-You Had Me at Hello: How Phrasing Affects Memorability

17 0.30258012 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content

18 0.29861811 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

19 0.29592526 195 acl-2012-The Creation of a Corpus of English Metalanguage

20 0.29590479 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.026), (26, 0.032), (28, 0.039), (30, 0.02), (34, 0.46), (39, 0.029), (74, 0.031), (82, 0.028), (84, 0.025), (85, 0.012), (90, 0.06), (92, 0.073), (94, 0.012), (99, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80484778 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

Author: Stephen Tyndall

Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.

2 0.75528455 112 acl-2012-Humor as Circuits in Semantic Networks

Author: Igor Labutov ; Hod Lipson

Abstract: This work presents a first step to a general implementation of the Semantic-Script Theory of Humor (SSTH). Of the scarce amount of research in computational humor, no research had focused on humor generation beyond simple puns and punning riddles. We propose an algorithm for mining simple humorous scripts from a semantic network (ConceptNet) by specifically searching for dual scripts that jointly maximize overlap and incongruity metrics in line with Raskin’s Semantic-Script Theory of Humor. Initial results show that a more relaxed constraint of this form is capable of generating humor of deeper semantic content than wordplay riddles. We evaluate the said metrics through a user-assessed quality of the generated two-liners.

3 0.52723885 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Author: Marco Lui ; Timothy Baldwin

Abstract: We present langid .py, an off-the-shelflanguage identification tool. We discuss the design and implementation of langid .py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid .py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

4 0.37984854 191 acl-2012-Temporally Anchored Relation Extraction

Author: Guillermo Garrido ; Anselmo Penas ; Bernardo Cabaleiro ; Alvaro Rodrigo

Abstract: Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.

5 0.25212446 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

Author: Elif Yamangil ; Stuart Shieber

Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.

6 0.24767812 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

7 0.24580327 31 acl-2012-Authorship Attribution with Author-aware Topic Models

8 0.24531642 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

9 0.24419856 139 acl-2012-MIX Is Not a Tree-Adjoining Language

10 0.24029057 205 acl-2012-Tweet Recommendation with Graph Co-Ranking

11 0.23946916 154 acl-2012-Native Language Detection with Tree Substitution Grammars

12 0.23887119 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

13 0.23727576 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

14 0.23709382 167 acl-2012-QuickView: NLP-based Tweet Search

15 0.23606783 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

16 0.23582272 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

17 0.23207608 187 acl-2012-Subgroup Detection in Ideological Discussions

18 0.23198649 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

19 0.2319359 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

20 0.23176938 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base