acl acl2010 acl2010-40 knowledge-graph by maker-knowledge-mining

40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers


Source: pdf

Author: Vipul Mittal

Abstract: In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph an- alyzer.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 in i Abstract In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. [sent-5, score-0.133]

2 The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. [sent-6, score-0.422]

3 We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. [sent-7, score-0.9]

4 While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph an- alyzer. [sent-8, score-0.367]

5 1 Introduction Sanskrit has a rich tradition of oral transmission of texts and this process causes the text to undergo euphonic changes at the word boundaries. [sent-9, score-0.394]

6 In oral transmission, the text is predominantly spoken as a continuous speech. [sent-10, score-0.101]

7 In the written form, because of the dominance of oral transmission, the text is written as a continuous string of letters rather than a sequence of words. [sent-15, score-0.234]

8 Typically when a word w1 is followed by a word w2, some terminal segment of w1 merges with some initial segment of w2 to be replaced by a “smoothed” phonetic interpolation, corresponding to minimizing the energy necessary to reconfigurate the vocal organs at the juncture between the words. [sent-17, score-0.098]

9 long sequence of phonemes, with the word bound- aries having undergone euphonic changes. [sent-18, score-0.228]

10 This makes it difficult to split a continuous string into words and process the text automatically. [sent-19, score-0.208]

11 Sanskrit words are mostly analyzed by building a finite state transducer (Beesley, 1998). [sent-20, score-0.147]

12 In the first approach, this transducer was modified by linking the final states to appropriate intermediate states incorporating the sandhi rules. [sent-21, score-0.798]

13 This approach then allows one to traverse the string from left to right and generate all and only possible splits that are morphologically valid. [sent-22, score-0.232]

14 The second approach is very closely based on the Optimality Theory (Prince and Smolensky, 1993) where we generate all the possible splits for a word and validate each using a morphological analyzer. [sent-23, score-0.172]

15 We use one of the fastest morphological analyzers available viz. [sent-24, score-0.13]

16 The splits that are not validated are pruned out. [sent-26, score-0.081]

17 Based on the number of times the first answer is correct, we achieved an accuracy of around 92% using the second approach while the first approach performed with around 71% accuracy. [sent-27, score-0.09]

18 2 Issues involved in Sanskrit Processing The segmentizer is an important component of an NLP system. [sent-28, score-0.11]

19 So the problem of segmentation is basically twofold: (1) syllable segmentation followed by (2) word segmentation itself. [sent-33, score-0.194]

20 0c S20tu1d0e Ants Roecsiea tirconh f Woror Cksomhop u,t pa tgioensa 8l5 L–in9g0u,istics is segmented by predicting the word boundaries, where euphonic changes do not occur across the word boundaries and it is more like mere concate- nation of words. [sent-39, score-0.293]

21 However, in Sanskrit, euphonic changes occur across word boundaries leading to addition and deletion of some original part of the combining words. [sent-41, score-0.268]

22 These euphonic changes in Sanskrit introduce non-determinism in the segmentation. [sent-42, score-0.236]

23 Whereas in Sanskrit, only the compounds involve a certain level of dependency analysis, while sandhi is just gluing of words together, without the need for words to be related semantically. [sent-45, score-0.708]

24 For example, consider the following part of a verse, San: n ¯aradam v ¯alm¯ ıkirmunipu n˙gavam gloss: to the Narada to the wisest among sages paripapraccha asked Valmiki- Eng: Valmiki asked the Narada, the wisest among the sages. [sent-46, score-0.142]

25 h and munipu n˙gavam (wisest among the sages - an adjective of Narada) are not related semantically, but still undergo euphonic change and are glued together as v ¯alm¯ ıkirmunipu n˙gavam. [sent-48, score-0.275]

26 Here is an example, where a string m ¯atur a¯j n˜¯ amparip a¯laya may be decomposed in two different ways after undergoing euphonic changes across word boundaries. [sent-50, score-0.313]

27 • • m a¯tuh a¯j n˜ a¯m parip¯ alaya mother) and, (obey the order of m a¯ a¯tur¯ aj n˜ ¯am parip¯ alaya order of the diseased). [sent-51, score-0.074]

28 (do not obey the There are special cases where the sandhied forms are not necessarily written together. [sent-52, score-0.138]

29 In such cases, the white space that physically marks the boundary of the words, logically refers to a single sandhied form. [sent-53, score-0.165]

30 Thus, the white space is deceptive, and if treated as a word boundary, the morphological analyzer fails to recognize the word. [sent-54, score-0.19]

31 In this example, the space between ´s rutv a¯ and ca represent a proper word boundary and the word ´s rutv a¯ is recognized by the morphological analyzer whereas the space between n ¯arado and vaca. [sent-57, score-0.303]

32 In unsandhied form, it would be written as, San: ´s rutv a¯ ca n ¯arada. [sent-62, score-0.12]

33 gloss: after listening and Narada’s speech Eng: And after listening to Narada’s speech The third factor aggravating Sanskrit segmentation is productive compound formation. [sent-65, score-0.133]

34 Unlike English, where either the components of a compound are written as distinct words or are separated by a hyphen, the components of compounds in Sanskrit are always written together. [sent-66, score-0.166]

35 Moreover, before these components are joined, they undergo the euphonic changes. [sent-67, score-0.271]

36 The components of a compound typically do not carry inflection or in other words they are the bound morphemes used only in compounds. [sent-68, score-0.094]

37 Assuming that a sandhi handler to handle the sandhi involving spaces is available and a bound morpheme recognizer is available, we discuss the development of sandhi splitter or a segmentizer that splits a continuous string of letters into meaningful words. [sent-70, score-2.49]

38 We assume that the sandhi handler handling the sandhi involving spaces is available and it splits the above string as, ´srutv¯ a vaca. [sent-74, score-1.57]

39 h The sandhi splitter or segmentizer is supposed to split this into 86 ´srutv¯ a ca etat triloka-j n˜a. [sent-78, score-0.945]

40 h This presupposes the availability of rules corresponding to euphonic changes and a good coverage morphological analyzer that can also analyze the bound morphemes in compounds. [sent-83, score-0.456]

41 A segmentizer for Sanskrit developed by Huet (Huet, 2009), decorates the final states of its finite state transducer handling Sanskrit morphology with the possible sandhi rules. [sent-84, score-1.032]

42 However, it is still not clear how one can prioritize various splits with this approach. [sent-85, score-0.122]

43 Further, this system in current state demands some more work before the sandhi splitter of this system can be used as a standalone system allowing plugging in of different morphological analyzers. [sent-86, score-0.902]

44 With a variety of morphological analyzers being developed by various researchers3, at times with complementary abilities, it would be worth to experiment with various morphological analyzers for splitting a sandhied text. [sent-87, score-0.37]

45 Hence, we thought of exploring other alternatives and present two approaches, both of which assume the existence of a good coverage morphological analyzer. [sent-88, score-0.091]

46 3 Scoring Matrix Just as in the case of any NLP systems, with the sandhi splitter being no exception, it is always desirable to produce the most likely output when a machine produces multiple outputs. [sent-90, score-0.757]

47 A Parallel corpus of Sanskrit text in sandhied and sandhi split form is being developed as a part of the Consortium project in India. [sent-92, score-0.867]

48 Around 100K words of such a parallel corpus is available from which around 25,000 parallel strings of unsandhied and corresponding sandhied texts were extracted. [sent-94, score-0.242]

49 The same corpus was also used to extract a total of 2650 sandhi rules including the cases of mere concatenation, and the frequency distribution of these sandhi rules. [sent-95, score-1.414]

50 Each sandhi rule is a triple (x, y, z) 3http://sanskrit. [sent-96, score-0.726]

51 in where y is the last letter of the first primitive, z is the first letter of the second primitive, and x is the letter sequence created by euphonic combination. [sent-104, score-0.28]

52 We define the estimated probability of the occurrence of a sandhi rule as follows: Let Ri denote the ith rule with fRi as the frequency of occurrence in the manually split parallel text. [sent-105, score-0.876]

53 The probability of rule Ri is: PRi=Pin=fR1ifRi where n denotes the totPal number of sandhi rules found in the corpus. [sent-106, score-0.782]

54 Let a word be split into a candidate Sj with k constituents as < c1, c2, . [sent-107, score-0.165]

55 ck are interdependent since a different rule sequence will result in a different constituents sequence. [sent-116, score-0.103]

56 The weight of the split Sj is defined as: WSj=Qkx−=11(Pcx+k Pcx+1) ∗ PRx where Pcx is the probability of occurrence of the word cx in the corpus. [sent-118, score-0.078]

57 The factor of k was introduced to give more preference to the split with less number of segments than the one with more seg- ments. [sent-119, score-0.078]

58 A word is traversed from left to right and is segmented by applying the first applicable rule provided both the constituents are valid morphs. [sent-121, score-0.161]

59 5 Two Approaches We now present the two approaches we explored for sandhi splitting. [sent-124, score-0.679]

60 , 2007) toolkit, incorporating 87 sandhi rules in the FST itself and traverse it to find the sandhi splittings. [sent-127, score-1.462]

61 We illustrate the augmentation of a sandhi rule with an example. [sent-128, score-0.726]

62 The initial FST without considering any sandhi rules is shown in Figure 1. [sent-130, score-0.735]

63 One of the sandhi rule states that i+a → ya whOicnhe ew oilfl bthee represented as a triple (ya, i,a). [sent-142, score-0.792]

64 Applying the sandhi rule, we get: xaXi + awra → xaXyawra. [sent-143, score-0.734]

65 − − − Here, a transition arc is added depicting the rule which says that on receiving an input symbol ya at state 3, go to state 5 with an output i+a → ya. [sent-147, score-0.222]

66 Thus, we see that the original transducer gets modified with all possible transitions at the end of a final phoneme, and hence, also explodes the number of transitions leading to a complex transducer. [sent-150, score-0.111]

67 The basic outline of the algorithm to split the given string into sub-strings is: Algorithm 1 To split a string into sub-strings 1:Let the FST for morphology be f. [sent-151, score-0.377]

68 2: Add sandhi rules to the final states of f1 linking them to the intermediary states to get f′. [sent-152, score-0.793]

69 3: Traverse f′ to find all possible splits for a word. [sent-153, score-0.081]

70 If a sandhi rule is encountered, split the word and continue with the remaining part. [sent-154, score-0.804]

71 The pseudo-code of the algorithm used to insert sandhi rules in the FST is illustrated here: Algorithm 2 To insert sandhi rules in the FST 1:I = Input Symbol; X = last character of the result of the rule. [sent-156, score-1.47]

72 In such cases, if the input string is not exhausted, but the current state is a final state, we go back to the start state with the remaining string as the input. [sent-160, score-0.262]

73 The system was slow consuming, on an average, around 10 seconds per string of 15 letters. [sent-167, score-0.122]

74 With the increase in the sandhi rules, though system’s performance was better, it slowed down the system further. [sent-169, score-0.679]

75 Moreover, this was tested only with the inflection morphology of nouns. [sent-170, score-0.108]

76 The verb inflection morphology and the derivational morphology were not used at all. [sent-171, score-0.175]

77 2 Approach based on Optimality Theory Our second approach follows optimality theory(OT) which proposes that the observed forms of a language are a result of the interaction between the conflicting constraints. [sent-175, score-0.097]

78 OT assumes that these components are universal and the grammars differ in the way they rank the universal constraint set, CON. [sent-182, score-0.08]

79 Thus a candidate A is optimal if it performs better than some other candidate B on a higher ranking constraint even if A has more violations of a lower ranked constraint than B. [sent-185, score-0.139]

80 The GEN function produces every possible segmentation by applying the rules wherever applicable. [sent-186, score-0.112]

81 This might contain some insignificant words that will be eventually pruned out using the morphological analyser in the EVAL function thus leaving the winning candidate. [sent-188, score-0.175]

82 Therefore, the approach followed is very closely based on optimality theory. [sent-189, score-0.123]

83 The morph analyser has no role in the generation of the candidates but only during their validation thus composing the back-end of the segmentizer. [sent-190, score-0.097]

84 In original OT, the winning candidate need not satisfy all the constraints but it must outperform all the other candidates on some higher ranked constraint. [sent-191, score-0.14]

85 While in our scenario, the winning candidate must satisfy all the constraints and therefore there could be more than one winning candidates. [sent-192, score-0.141]

86 The constraints applied are: • • C1 : All the constituents of a split must be valid morphs. [sent-195, score-0.167]

87 C2 : Select the split with maximum weight, as defined in section 3. [sent-196, score-0.078]

88 The basic outline of the algorithm is: 1:Recursivelybreakawordateverypossibleposition applying a sandhi rule and generate all possible candidates for the input. [sent-197, score-0.753]

89 2: Pass the constituents of all the candidates through the morph analyzer. [sent-198, score-0.124]

90 3: Declare the candidate as a valid candidate, if all its constituents are recognized by the morphological analyzer. [sent-199, score-0.211]

91 1 Results The current morphological analyzer can recognize around 140 million words. [sent-204, score-0.209]

92 Using the 2650 rules 89 and the same test data used for previous approach, we obtained the following results: • • Almost 93% of the times, the highest ranked segmentation is correct. [sent-205, score-0.139]

93 And in almost 98% of the cases, the correct split was among the top 3 possible splits. [sent-206, score-0.078]

94 04 seconds per string of 15 letters on an average. [sent-208, score-0.077]

95 6 Conclusion We presented two methods to automatically segment a Sanskrit word into its morphologically valid constituents. [sent-211, score-0.095]

96 Though both the approaches outperformed the baseline system, the approach that is close to optimality theory gives better results both in terms of time consumption and segmentations. [sent-212, score-0.123]

97 This sandhi splitter be- ing modular, wherein one can plug in different morphological analyzer and different set of sandhi rules, the splitter can also be used for segmentization of other languages. [sent-215, score-1.678]

98 Future Work The major task would be to explore ways to shift rank 2 and rank 3 segmentations more towards rank 1. [sent-216, score-0.109]

99 The sandhi with white spaces also needs to be handled. [sent-218, score-0.73]

100 Building a wide coverage Sanskrit morphological analyzer: A practical approach. [sent-223, score-0.091]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sandhi', 0.679), ('sanskrit', 0.402), ('fst', 0.222), ('euphonic', 0.202), ('alm', 0.11), ('sandhied', 0.11), ('segmentizer', 0.11), ('xaxi', 0.11), ('optimality', 0.097), ('narada', 0.092), ('morphological', 0.091), ('splits', 0.081), ('splitter', 0.078), ('split', 0.078), ('string', 0.077), ('srutv', 0.073), ('analyzer', 0.073), ('morphology', 0.067), ('transducer', 0.061), ('rules', 0.056), ('segmentation', 0.056), ('constituents', 0.056), ('winning', 0.055), ('amba', 0.055), ('arado', 0.055), ('awra', 0.055), ('pcx', 0.055), ('rutv', 0.055), ('wisest', 0.055), ('state', 0.054), ('continuous', 0.053), ('oral', 0.048), ('thai', 0.048), ('traverse', 0.048), ('rule', 0.047), ('around', 0.045), ('transmission', 0.044), ('kulkarni', 0.041), ('prioritize', 0.041), ('inflection', 0.041), ('morph', 0.041), ('undergo', 0.041), ('analyzers', 0.039), ('ya', 0.037), ('alaya', 0.037), ('caitattrilokaj', 0.037), ('gavam', 0.037), ('haruechaiyasak', 0.037), ('kirmunipu', 0.037), ('parip', 0.037), ('recitation', 0.037), ('thang', 0.037), ('unsandhied', 0.037), ('segment', 0.036), ('changes', 0.034), ('valid', 0.033), ('huet', 0.032), ('roman', 0.032), ('sages', 0.032), ('vietnamese', 0.032), ('finite', 0.032), ('boundaries', 0.032), ('phoneme', 0.031), ('candidate', 0.031), ('transition', 0.03), ('compounds', 0.029), ('analyser', 0.029), ('eval', 0.029), ('handler', 0.029), ('kern', 0.029), ('states', 0.029), ('boundary', 0.029), ('components', 0.028), ('segmentations', 0.028), ('written', 0.028), ('allauzen', 0.028), ('openfst', 0.028), ('prince', 0.028), ('rank', 0.027), ('candidates', 0.027), ('ranked', 0.027), ('eng', 0.026), ('listening', 0.026), ('primitive', 0.026), ('undergone', 0.026), ('letter', 0.026), ('ot', 0.026), ('white', 0.026), ('morphologically', 0.026), ('followed', 0.026), ('theory', 0.026), ('compound', 0.025), ('segmented', 0.025), ('spaces', 0.025), ('transitions', 0.025), ('constraint', 0.025), ('agglutinative', 0.025), ('tradition', 0.025), ('verse', 0.025), ('parallel', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers

Author: Vipul Mittal

Abstract: In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph an- alyzer.

2 0.068716384 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

Author: Coskun Mermer ; Ahmet Afsin Akin

Abstract: We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation (SMT) and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efficient computation. Submerging the proposed segmentation method in a SMT task from morphologically-rich Turkish to English does not exhibit the expected improvement in translation BLEU scores and confirms the robustness of phrase-based SMT to translation unit combinatorics. A positive outcome of this work is the described modification to the sequential search algorithm of Morfessor (Creutz and Lagus, 2007) that enables arbitrary-fold parallelization of the computation, which unexpectedly improves the translation performance as measured by BLEU.

3 0.060465563 221 acl-2010-Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

Author: Reyyan Yeniterzi ; Kemal Oflazer

Abstract: We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.

4 0.044094119 100 acl-2010-Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble

Author: Sebastian Spiegler ; Peter A. Flach

Abstract: This paper demonstrates that the use of ensemble methods and carefully calibrating the decision threshold can significantly improve the performance of machine learning methods for morphological word decomposition. We employ two algorithms which come from a family of generative probabilistic models. The models consider segment boundaries as hidden variables and include probabilities for letter transitions within segments. The advantage of this model family is that it can learn from small datasets and easily gen- eralises to larger datasets. The first algorithm PROMODES, which participated in the Morpho Challenge 2009 (an international competition for unsupervised morphological analysis) employs a lower order model whereas the second algorithm PROMODES-H is a novel development of the first using a higher order model. We present the mathematical description for both algorithms, conduct experiments on the morphologically rich language Zulu and compare characteristics of both algorithms based on the experimental results.

5 0.043388959 97 acl-2010-Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Author: Graeme Blackwood ; Adria de Gispert ; William Byrne

Abstract: This paper presents an efficient implementation of linearised lattice minimum Bayes-risk decoding using weighted finite state transducers. We introduce transducers to efficiently count lattice paths containing n-grams and use these to gather the required statistics. We show that these procedures can be implemented exactly through simple transformations of word sequences to sequences of n-grams. This yields a novel implementation of lattice minimum Bayes-risk decoding which is fast and exact even for very large lattices.

6 0.042123705 234 acl-2010-The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages

7 0.041510064 170 acl-2010-Letter-Phoneme Alignment: An Exploration

8 0.040810134 217 acl-2010-String Extension Learning

9 0.040527668 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

10 0.039151799 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

11 0.0391283 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

12 0.037516192 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery

13 0.034121912 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers

14 0.034079399 215 acl-2010-Speech-Driven Access to the Deep Web on Mobile Devices

15 0.033632163 213 acl-2010-Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer

16 0.033345062 169 acl-2010-Learning to Translate with Source and Target Syntax

17 0.031036727 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

18 0.030917441 116 acl-2010-Finding Cognate Groups Using Phylogenies

19 0.030842213 16 acl-2010-A Statistical Model for Lost Language Decipherment

20 0.030765612 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.095), (1, -0.022), (2, 0.001), (3, -0.02), (4, -0.015), (5, -0.041), (6, 0.014), (7, 0.001), (8, 0.031), (9, 0.002), (10, -0.019), (11, 0.036), (12, 0.031), (13, -0.039), (14, -0.033), (15, -0.022), (16, -0.027), (17, 0.048), (18, 0.033), (19, 0.018), (20, -0.032), (21, -0.07), (22, -0.006), (23, -0.046), (24, 0.12), (25, 0.003), (26, -0.077), (27, -0.021), (28, -0.035), (29, 0.012), (30, -0.079), (31, 0.022), (32, -0.034), (33, -0.08), (34, -0.06), (35, -0.042), (36, 0.044), (37, 0.075), (38, -0.052), (39, 0.126), (40, -0.063), (41, 0.007), (42, -0.021), (43, -0.152), (44, 0.061), (45, -0.024), (46, 0.051), (47, -0.028), (48, 0.011), (49, -0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90377033 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers

Author: Vipul Mittal

Abstract: In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph an- alyzer.

2 0.66105336 234 acl-2010-The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages

Author: Andres Osvaldo Porta

Abstract: The aim of this work is to present some preliminary results of an investigation in course on the typology of the morphology of the native South American languages from the point of view of the formal language theory. With this object, we give two contrasting examples of descriptions of two Aboriginal languages finite verb forms morphology: Argentinean Quechua (quichua santiague n˜o) and Toba. The description of the morphology of the finite verb forms of Argentinean quechua, uses finite automata and finite transducers. In this case the construction is straightforward using two level morphology and then, describes in a very natural way the Argentinean Quechua morphology using a regular language. On the contrary, the Toba verbs morphology, with a system that simultaneously uses prefixes and suffixes, has not a natural description as regular language. Toba has a complex system of causative suffixes, whose successive applications determinate the use of prefixes belonging different person marking prefix sets. We adopt the solution of Creider et al. (1995) to naturally deal with this and other similar morphological processes which involve interactions between prefixes and suffixes and then we describe the toba morphology using linear context-free languages.1 .

3 0.63510424 100 acl-2010-Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble

Author: Sebastian Spiegler ; Peter A. Flach

Abstract: This paper demonstrates that the use of ensemble methods and carefully calibrating the decision threshold can significantly improve the performance of machine learning methods for morphological word decomposition. We employ two algorithms which come from a family of generative probabilistic models. The models consider segment boundaries as hidden variables and include probabilities for letter transitions within segments. The advantage of this model family is that it can learn from small datasets and easily gen- eralises to larger datasets. The first algorithm PROMODES, which participated in the Morpho Challenge 2009 (an international competition for unsupervised morphological analysis) employs a lower order model whereas the second algorithm PROMODES-H is a novel development of the first using a higher order model. We present the mathematical description for both algorithms, conduct experiments on the morphologically rich language Zulu and compare characteristics of both algorithms based on the experimental results.

4 0.55758995 16 acl-2010-A Statistical Model for Lost Language Decipherment

Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight

Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.

5 0.53735447 221 acl-2010-Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

Author: Reyyan Yeniterzi ; Kemal Oflazer

Abstract: We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.

6 0.45066029 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

7 0.44647643 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

8 0.41682926 116 acl-2010-Finding Cognate Groups Using Phylogenies

9 0.41632158 186 acl-2010-Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two

10 0.40115899 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

11 0.39221466 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

12 0.36886826 137 acl-2010-How Spoken Language Corpora Can Refine Current Speech Motor Training Methodologies

13 0.35500175 61 acl-2010-Combining Data and Mathematical Models of Language Change

14 0.35424581 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.

15 0.344978 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

16 0.33429 68 acl-2010-Conditional Random Fields for Word Hyphenation

17 0.33423799 119 acl-2010-Fixed Length Word Suffix for Factored Statistical Machine Translation

18 0.3339611 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers

19 0.3267414 67 acl-2010-Computing Weakest Readings

20 0.31840995 222 acl-2010-SystemT: An Algebraic Approach to Declarative Information Extraction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.024), (25, 0.061), (33, 0.017), (35, 0.372), (39, 0.015), (42, 0.012), (44, 0.012), (53, 0.011), (59, 0.072), (73, 0.032), (76, 0.013), (78, 0.03), (80, 0.013), (83, 0.059), (84, 0.032), (98, 0.114)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76910698 58 acl-2010-Classification of Feedback Expressions in Multimodal Data

Author: Costanza Navarretta ; Patrizia Paggio

Abstract: This paper addresses the issue of how linguistic feedback expressions, prosody and head gestures, i.e. head movements and face expressions, relate to one another in a collection of eight video-recorded Danish map-task dialogues. The study shows that in these data, prosodic features and head gestures significantly improve automatic classification of dialogue act labels for linguistic expressions of feedback.

same-paper 2 0.73047173 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers

Author: Vipul Mittal

Abstract: In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph an- alyzer.

3 0.61019069 130 acl-2010-Hard Constraints for Grammatical Function Labelling

Author: Wolfgang Seeker ; Ines Rehbein ; Jonas Kuhn ; Josef Van Genabith

Abstract: For languages with (semi-) free word order (such as German), labelling grammatical functions on top of phrase-structural constituent analyses is crucial for making them interpretable. Unfortunately, most statistical classifiers consider only local information for function labelling and fail to capture important restrictions on the distribution of core argument functions such as subject, object etc., namely that there is at most one subject (etc.) per clause. We augment a statistical classifier with an integer linear program imposing hard linguistic constraints on the solution space output by the classifier, capturing global distributional restrictions. We show that this improves labelling quality, in particular for argument grammatical functions, in an intrinsic evaluation, and, importantly, grammar coverage for treebankbased (Lexical-Functional) grammar acquisition and parsing, in an extrinsic evaluation.

4 0.60024929 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

Author: Joseph Turian ; Lev-Arie Ratinov ; Yoshua Bengio

Abstract: If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining different word representations. You can download our word features, for off-the-shelf use in existing NLP systems, as well as our code, here: http ://metaoptimize com/proj ects/wordreprs/ .

5 0.41108105 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

Author: Mohit Bansal ; Dan Klein

Abstract: We present a simple but accurate parser which exploits both large tree fragments and symbol refinement. We parse with all fragments of the training set, in contrast to much recent work on tree selection in data-oriented parsing and treesubstitution grammar learning. We require only simple, deterministic grammar symbol refinement, in contrast to recent work on latent symbol refinement. Moreover, our parser requires no explicit lexicon machinery, instead parsing input sentences as character streams. Despite its simplicity, our parser achieves accuracies of over 88% F1 on the standard English WSJ task, which is competitive with substantially more complicated state-of-theart lexicalized and latent-variable parsers. Additional specific contributions center on making implicit all-fragments parsing efficient, including a coarse-to-fine inference scheme and a new graph encoding.

6 0.40623963 71 acl-2010-Convolution Kernel over Packed Parse Forest

7 0.40448695 169 acl-2010-Learning to Translate with Source and Target Syntax

8 0.40443456 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

9 0.4036811 93 acl-2010-Dynamic Programming for Linear-Time Incremental Parsing

10 0.40311688 162 acl-2010-Learning Common Grammar from Multilingual Corpus

11 0.40192765 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

12 0.40138581 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

13 0.40103072 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

14 0.39888522 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

15 0.39882141 133 acl-2010-Hierarchical Search for Word Alignment

16 0.39874092 65 acl-2010-Complexity Metrics in an Incremental Right-Corner Parser

17 0.39841706 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

18 0.39833838 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization

19 0.39820969 248 acl-2010-Unsupervised Ontology Induction from Text

20 0.3981663 53 acl-2010-Blocked Inference in Bayesian Tree Substitution Grammars