acl acl2010 acl2010-68 knowledge-graph by maker-knowledge-mining

68 acl-2010-Conditional Random Fields for Word Hyphenation

Source: pdf

Author: Nikolaos Trogkanis ; Charles Elkan

Abstract: Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application of conditional random fields. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch. Experiments show that both the Knuth/Liang method and a leading current commercial alternative have error rates several times higher for both languages.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Finding allowable places in words to insert hyphens is an important practical problem. [sent-2, score-0.227]

2 This method is the TEX hyphenation algorithm of Knuth and Liang. [sent-4, score-0.621]

3 We present here a hyphenation method that is clearly more accurate. [sent-5, score-0.621]

4 We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0. [sent-7, score-0.134]

5 Experiments show that both the Knuth/Liang method and a leading current commercial alternative have error rates several times higher for both languages. [sent-10, score-0.205]

6 Another way of stating the task is that we want to learn to predict for each letter in a word whether or not it is permissible for the letter to be followed by a hyphen. [sent-13, score-0.11]

7 This means that we tag each letter with either 1, for hyphen allowed following this letter, or 0, for hyphen not allowed after this letter. [sent-14, score-0.551]

8 The hyphenation task is also called orthographic syllabification (Bartlett et al. [sent-15, score-0.638]

9 The goal in performing hyphenation is to predict a sequence of 0/1 values as a function of a sequence of input characters. [sent-24, score-0.676]

10 Research on structured learning has been highly successful, with sequence classification as its most important and successful subfield, and with conditional random fields (CRFs) as the most influential approach to learning sequence classifiers. [sent-35, score-0.132]

11 c As2s0o1c0ia Atisosnoc foiart Cionom fopru Ctaotmiopnuatla Lti on gaulis Lti cnsg,u piasgtiecs 36 –374, we show that CRFs can achieve extremely good performance on the hyphenation task. [sent-38, score-0.598]

12 2 History of automated hyphenation The earliest software for automatic hyphenation was implemented for RCA 301 computers, and used by the Palm Beach Post-Tribune and Los Angeles Times newspapers in 1962. [sent-39, score-1.272]

13 The Florida system had a dictionary of 30,000 words; words not in the dictionary were hyphenated after the third, fifth, or seventh letter, because the authors observed that this was correct for many words. [sent-41, score-0.211]

14 The earliest hyphenation software for a language other than English may have been a rule-based program for Finnish first used in 1964 (Jarvi, 2009). [sent-43, score-0.703]

15 The first formal description of an algorithm for hyphenation was in a patent application submitted in 1964 (Damerau, 1964). [sent-44, score-0.622]

16 The hyphenation algorithm that is by far the most widely used now is due to Liang (Liang, 1983). [sent-46, score-0.598]

17 Over the years, various machine learning methods have been applied to the hyphenation task. [sent-50, score-0.598]

18 The lowest per-letter test error rate reported is about 2%. [sent-54, score-0.2]

19 The highest accuracy achieved until now for the hyphenation task is by (Bartlett et al. [sent-57, score-0.598]

20 , 2008) do not address the issue that false positive hyphens are worse mistakes than false negative hyphens, which we address below. [sent-64, score-0.361]

21 Methods inspired by nonstatistical natural language processing research have also been proposed for the hyphenation task, in particular (Bouma, 2003; Tsalidis et al. [sent-69, score-0.598]

22 Moreover, our experimental results below show that the commercial software of (Woestenburg, 2006) allows hyphens incorrectly almost three times more often than TEX. [sent-72, score-0.276]

23 In general, a dictionary based approach has zero errors for words in the dictionary, but fails to work for words not included in it. [sent-73, score-0.09]

24 Furthermore, for languages such as English where hyphenation does not systematically follow general rules, such an approach does not have good results. [sent-75, score-0.598]

25 A pattern-learning approach, like that of TEX, infers patterns from a training list of hyphenated words, and then uses these patterns to hyphenate text. [sent-76, score-0.256]

26 Although useful patterns are learned automatically, both the TEX learning algorithm and the learned patterns must be hand-tuned to perform well (Liang, 1983). [sent-77, score-0.09]

27 Liang’s method is implemented in a program named PATGEN, which takes as input a training set of hyphenated words, and outputs a collection of interacting hyphenation patterns. [sent-78, score-0.75]

28 The precise details of how different versions of TEX and LATEX use these pattern collections to do hyphenation in practice are unclear. [sent-83, score-0.659]

29 At a minimum, current variants of TEX improve hyphenation accuracy by disallowing hyphens in the first and last two or three letters of every word, regardless of what the PATGEN patterns recommend. [sent-84, score-0.89]

30 367 Despite the success of Liang’s method, incorrect hyphenations remain an issue with TEX and its current variants and competitors. [sent-85, score-0.149]

31 For instance, incorrect hyphenations are common in the Wall Street Journal, which has the highest circulation of any newspaper in the U. [sent-86, score-0.149]

32 Specifically, let ¯x be a sequence of n letters and let ¯y be a corresponding sequence of n tags. [sent-92, score-0.152]

33 Each such function Fj is a sum along the output sequence for i= 1to i= n: Xn Fj( x¯, y¯ ) = Xfj(yi−1,yi, ¯ x,i) iX= X1 where each function fj is a 0/1 indicator function that picks out specific values for neighboring tags yi−1 and yi and a particular substring of x¯. [sent-95, score-0.221]

34 The software we use as an implementation of conditional random fields is named CRF++ (Kudo, 2007). [sent-98, score-0.109]

35 We define indicator functions fj that depend on substrings of the input word, and on whether or not a hyphen is legal after the current and/or the previous letter. [sent-101, score-0.423]

36 The substrings are of length 2 to 5, covering up to 4 letters to the left and right of the current letter. [sent-102, score-0.113]

37 Then exactly two functions fj that depend on substrings of length 2 have value 1: I(yi−1 = 1and yi = 0 and x2x3 = yp) = 1, I(yi−1 = 1and yi = 0 and x3x4 = ph) = 1. [sent-107, score-0.248]

38 All other similar functions have value 0: I(yi−1 = 1and yi = 1and x2x3 = yp) = 0, I(yi−1 = 1and yi = 0 and x2x3 = yq) = 0, and so on. [sent-108, score-0.157]

39 There are similar indicator functions for substrings up to length 5. [sent-109, score-0.099]

40 When evaluating the performance of a hyphenation algorithm, one should not just count how many words are hyphenated in exactly the same way as in a reference dictionary. [sent-114, score-0.721]

41 One should also measure separately how many legal hyphens are actually predicted, versus how many predicted hyphens are in fact not legal. [sent-115, score-0.391]

42 For any hyphenation 368 method, a false positive hyphen is a more serious mistake than a false negative hyphen, i. [sent-117, score-1.167]

43 a hyphen allowed by the lexicon that the method fails to identify. [sent-119, score-0.271]

44 The standard Viterbi algorithm for making predictions from a trained CRF is not tuned to minimize false positives. [sent-120, score-0.094]

45 To address this difficulty, we use the forward-backward algorithm (Sha and Pereira, 2003; Culotta and McCallum, 2004) to estimate separately for each position the probability of a hyphen at that position. [sent-121, score-0.248]

46 Then, we only allow a hyphen if this probability is over a high threshold such as 0. [sent-122, score-0.3]

47 Each hyphenation corresponds to one path through a graph that defines all 2k−1 hyphenations that are possible for a word of length k. [sent-124, score-0.747]

48 The overall probability of a hyphen at any given location is the sum of the weights of all paths that do have a hyphen at this position, divided by the sum of the weights of all paths. [sent-125, score-0.523]

49 In order to compute the weight of all paths that contain a hyphen at a specific location, weight 0 is assigned to all paths that do not have a hyphen at this location. [sent-127, score-0.5]

50 We download all English word forms with legal hyphenation points indicated by hyphens. [sent-131, score-0.643]

51 This fact implies that no algorithm that operates on words in isolation can be a complete solution for the hyphenation task. [sent-139, score-0.621]

52 1 We exclude the few words that have two or more different hyphenations from the dataset. [sent-140, score-0.172]

53 The Dutch dataset of 293,681 words is created following the same procedure as for the English dataset, except that all entries from CELEX that are compound words containing dashes are discarded instead of being split into parts, since many of these are not in fact Dutch words. [sent-147, score-0.133]

54 In order to measure accuracy, we compute the confusion matrix for each method, and from this we compute error rates. [sent-149, score-0.098]

55 The wordlevel error rate is the fraction of words on which a method makes at least one mistake. [sent-151, score-0.246]

56 The letterlevel error rate is the fraction of letters for which the method predicts incorrectly whether or not a hyphen is legal after this letter. [sent-152, score-0.569]

57 We evaluate this algorithm on our entire English and Dutch datasets using the appropriate language pattern files, and not allowing a hyphen to be placed between the first le fthyphenmin and last righthyphenmin letters of each word. [sent-158, score-0.341]

58 For 1The single word with more than two alternative hyphenations is “invalid” whose three hyphenations are in-va-l id in-val -id and in-val id. [sent-159, score-0.298]

59 Interestingly, the Merriam–Webster online dictionary also gives three hyphenations for this word, but not the same ones: in-va-l id in-val -id inval id. [sent-160, score-0.193]

60 The hyphenation patterns used by TeXHyphenator, which are those currently used by essentially all variants of TEX, may not be optimal for our new English and Dutch datasets. [sent-172, score-0.643]

61 Specifically, we create a pattern file from 90% of the dataset using PATGEN, and then hyphenate the remaining 10% of the dataset using Liang’s algorithm and the learned pattern file. [sent-175, score-0.258]

62 We also disallow hyphens in the first two letters of every word, and the last three letters for English, or last two for Dutch. [sent-181, score-0.321]

63 We also evaluate the TALO commercial software (Woestenburg, 2006). [sent-182, score-0.103]

64 We know of one other commercial hyphenation application, which is named Dashes. [sent-183, score-0.646]

65 Figure 1 shows how the error rate is affected by increasing the CRF probability threshold for each language. [sent-195, score-0.273]

66 Figure 1 shows confidence intervals for the error rates. [sent-196, score-0.098]

67 All differences between rows in Table 2 are significant, with one exception: the serious error rates for PATGEN and TALO are not statistically significantly different. [sent-207, score-0.288]

68 For the English language, the CRF using the Viterbi path has overall error rate of 0. [sent-209, score-0.225]

69 However, the serious error rate for the CRF is less good: 0. [sent-212, score-0.354]

70 This weakness is remedied by predicting that a hyphen is allowable only if it has high probability. [sent-215, score-0.258]

71 99, and still have lower overall error rate than the TEX algorithm. [sent-217, score-0.225]

72 876543219Dutch Figure 1: Total letter-level error rate and serious letter-level error rate for different values of threshold for the CRF. [sent-224, score-0.606]

73 0l14e0714r Table 4: Performance on the English dataset (10-fold cross validation dividing by stem). [sent-256, score-0.117]

74 Table 5: Performance on the Dutch dataset (10-fold cross validation dividing by stem). [sent-257, score-0.117]

75 371 For the English language, TALO yields overall error rate 1. [sent-258, score-0.225]

76 94, and still have lower overall error rate than TALO. [sent-263, score-0.225]

77 Using this threshold, the CRF serious error rate is 0. [sent-264, score-0.354]

78 For the Dutch language, the standard CRF using the Viterbi path has overall error rate 0. [sent-267, score-0.225]

79 99 or below yields lower error rates than the TEX algorithm. [sent-274, score-0.134]

80 For the Dutch language, the TALO method has overall error rate 0. [sent-278, score-0.248]

81 98, and still give lower overall error rate than TALO. [sent-284, score-0.225]

82 006% (206 false positives); in comparison the serious error rate of TALO is 0. [sent-287, score-0.448]

83 For both languages, PATGEN has higher serious letter-level and word-level error rates than TEX using the existing pattern files. [sent-289, score-0.328]

84 Initially, Liang optimized this pattern collection extensively by upweighting the most common words and by iteratively adding exception words found by testing the algorithm against a large dictionary from an unknown publisher (Liang, 1983). [sent-292, score-0.13]

85 One can tune PATGEN to yield either better overall error rate, or better serious error rate, but not both simultaneously, compared to the TEX algorithm using the existing pattern files for both languages. [sent-293, score-0.415]

86 For the English dataset, if we use Liang’s parameters for PATGEN as reported in (Sojka and Sevecek, 1995), we obtain overall error rate of 6. [sent-294, score-0.225]

87 It is possible that the specific patterns used in TEX implementations today have been tuned by hand to be better than anything the PATGEN software is capable of. [sent-297, score-0.1]

88 7 Additional experiments This section presents empirical results following two experimental designs that are less standard, but that may be more appropriate for the hyphenation task. [sent-298, score-0.598]

89 4 The 65,828 English words in our dictionary produce 27,100 unique stems, while the 293,681 Dutch words produce 169,693 unique stems. [sent-303, score-0.09]

90 txt 372 frequent words in this list, if we omit the words not in our dataset of 89,019 hyphenated English words from CELEX, we get 4,000 words. [sent-319, score-0.225]

91 We evaluate the following methods on the 4000 words: Liang’s method using the American patterns file hyphen . [sent-323, score-0.295]

92 tex, Liang’s method using the patterns derived from PATGEN when trained on the whole English dataset, our CRF trained on the whole English dataset, and the same CRF with a probability threshold of 0. [sent-324, score-0.141]

93 In summary, TEX and PATGEN make serious errors on 43 and 113 of the 4000 words, respectively. [sent-327, score-0.154]

94 9, the CRF approach makes zero serious errors on these words. [sent-329, score-0.154]

95 In TEX, hyphenation patterns are stored in a data structure that is a variant of a trie. [sent-337, score-0.643]

96 The CRF software uses other data structures and optimizations that allow a word to be hyphenated in time that is almost independent of the number of feature-functions used. [sent-338, score-0.155]

97 9 Conclusions Finding allowable places in words to insert hyphens is a real-world problem that is still not fully solved in practice. [sent-339, score-0.227]

98 The main contribution of this paper is a hyphenation method that is clearly more accurate than the currently used Knuth/Liang method. [sent-340, score-0.621]

99 We hope that the method proposed here is adopted in practice, since the number of serious errors that it makes is about a sixfold improvement over what is currently in use. [sent-346, score-0.177]

100 A second contribution of this paper is to provide training sets for hyphenation in English and Dutch, so other researchers can, we hope, soon invent even more accurate methods. [sent-347, score-0.598]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hyphenation', 0.598), ('tex', 0.36), ('crf', 0.256), ('patgen', 0.232), ('hyphen', 0.227), ('hyphens', 0.173), ('serious', 0.154), ('hyphenations', 0.149), ('dutch', 0.147), ('talo', 0.133), ('rate', 0.102), ('hyphenated', 0.1), ('error', 0.098), ('liang', 0.096), ('false', 0.094), ('letters', 0.074), ('hyphenate', 0.066), ('woestenburg', 0.066), ('yi', 0.066), ('bartlett', 0.062), ('positives', 0.059), ('dataset', 0.056), ('letter', 0.055), ('software', 0.055), ('fj', 0.052), ('threshold', 0.052), ('english', 0.051), ('celex', 0.05), ('hyphenating', 0.05), ('tugboat', 0.05), ('commercial', 0.048), ('patterns', 0.045), ('legal', 0.045), ('dictionary', 0.044), ('syllabification', 0.04), ('pattern', 0.04), ('substrings', 0.039), ('sequence', 0.039), ('rates', 0.036), ('indicator', 0.035), ('afbreken', 0.033), ('alo', 0.033), ('apostrophes', 0.033), ('breitenlohner', 0.033), ('etdhnreshold', 0.033), ('franklin', 0.033), ('hgo', 0.033), ('kristensen', 0.033), ('tsalidis', 0.033), ('tutelaers', 0.033), ('webster', 0.033), ('den', 0.033), ('allowable', 0.031), ('bosch', 0.031), ('compound', 0.031), ('validation', 0.031), ('cross', 0.03), ('program', 0.029), ('substring', 0.029), ('crfs', 0.029), ('jolla', 0.029), ('pronunciations', 0.029), ('elkan', 0.029), ('ftp', 0.029), ('halevy', 0.029), ('nettalk', 0.029), ('sejnowski', 0.029), ('sojka', 0.029), ('california', 0.029), ('publishing', 0.028), ('conditional', 0.028), ('stem', 0.028), ('american', 0.027), ('informal', 0.027), ('jaap', 0.027), ('fields', 0.026), ('functions', 0.025), ('viterbi', 0.025), ('overall', 0.025), ('rosenberg', 0.025), ('acceptable', 0.025), ('diego', 0.024), ('patent', 0.024), ('culotta', 0.024), ('method', 0.023), ('words', 0.023), ('nocedal', 0.023), ('paths', 0.023), ('sha', 0.022), ('preferable', 0.022), ('bouma', 0.022), ('collections', 0.021), ('probability', 0.021), ('allowed', 0.021), ('forge', 0.021), ('earliest', 0.021), ('dominance', 0.021), ('dat', 0.021), ('porter', 0.021), ('engineering', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 68 acl-2010-Conditional Random Fields for Word Hyphenation

Author: Nikolaos Trogkanis ; Charles Elkan

2 0.16312546 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

Author: Dong Yang ; Paul Dixon ; Sadaoki Furui

Abstract: This paper presents a joint optimization method of a two-step conditional random field (CRF) model for machine transliteration and a fast decoding algorithm for the proposed method. Our method lies in the category of direct orthographical mapping (DOM) between two languages without using any intermediate phonemic mapping. In the two-step CRF model, the first CRF segments an input word into chunks and the second one converts each chunk into one unit in the target language. In this paper, we propose a method to jointly optimize the two-step CRFs and also a fast algorithm to realize it. Our experiments show that the proposed method outper- forms the well-known joint source channel model (JSCM) and our proposed fast algorithm decreases the decoding time significantly. Furthermore, combination of the proposed method and the JSCM gives further improvement, which outperforms state-of-the-art results in terms of top-1 accuracy.

3 0.066974044 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

Author: Jenny Rose Finkel ; Christopher D. Manning

Abstract: One of the main obstacles to producing high quality joint models is the lack of jointly annotated data. Joint modeling of multiple natural language processing tasks outperforms single-task models learned from the same data, but still underperforms compared to single-task models learned on the more abundant quantities of available single-task annotated data. In this paper we present a novel model which makes use of additional single-task annotated data to improve the performance of a joint model. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model. Experiments on joint parsing and named entity recog- nition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.

4 0.059909884 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

Author: Fei Huang ; Alexander Yates

Abstract: Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Semantic role labeling techniques are typically trained on newswire text, and in tests their performance on fiction is as much as 19% worse than their performance on newswire text. We investigate techniques for building open-domain semantic role labeling systems that approach the ideal of a train-once, use-anywhere system. We leverage recently-developed techniques for learning representations of text using latent-variable language models, and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling. In experiments, our novel system reduces error by 16% relative to the previous state of the art on out-of-domain text.

5 0.05834274 170 acl-2010-Letter-Phoneme Alignment: An Exploration

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

6 0.051186543 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

7 0.045149397 197 acl-2010-Practical Very Large Scale CRFs

8 0.043601472 181 acl-2010-On Learning Subtypes of the Part-Whole Relation: Do Not Mix Your Seeds

9 0.043138091 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

10 0.042917233 195 acl-2010-Phylogenetic Grammar Induction

11 0.042711582 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

12 0.03960745 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

13 0.039018184 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction

14 0.038863491 137 acl-2010-How Spoken Language Corpora Can Refine Current Speech Motor Training Methodologies

15 0.037903115 185 acl-2010-Open Information Extraction Using Wikipedia

16 0.037364192 125 acl-2010-Generating Templates of Entity Summaries with an Entity-Aspect Model and Pattern Mining

17 0.035542 166 acl-2010-Learning Word-Class Lattices for Definition and Hypernym Extraction

18 0.034664504 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations

19 0.034230232 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization

20 0.033975564 213 acl-2010-Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.118), (1, 0.001), (2, -0.018), (3, -0.011), (4, 0.016), (5, -0.024), (6, 0.002), (7, -0.014), (8, 0.05), (9, 0.014), (10, -0.038), (11, 0.031), (12, 0.019), (13, -0.088), (14, -0.062), (15, 0.011), (16, -0.048), (17, 0.095), (18, 0.011), (19, 0.01), (20, -0.0), (21, 0.022), (22, -0.089), (23, -0.061), (24, -0.054), (25, -0.055), (26, 0.022), (27, -0.067), (28, -0.055), (29, 0.033), (30, -0.092), (31, -0.018), (32, 0.024), (33, -0.133), (34, -0.062), (35, -0.016), (36, -0.049), (37, 0.045), (38, -0.083), (39, -0.093), (40, -0.051), (41, 0.096), (42, 0.006), (43, 0.08), (44, 0.05), (45, -0.047), (46, 0.112), (47, 0.09), (48, 0.118), (49, -0.13)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89963037 68 acl-2010-Conditional Random Fields for Word Hyphenation

Author: Nikolaos Trogkanis ; Charles Elkan

2 0.73852736 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

Author: Dong Yang ; Paul Dixon ; Sadaoki Furui

3 0.51084197 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

Author: Eric Corlett ; Gerald Penn

Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.

4 0.50362098 197 acl-2010-Practical Very Large Scale CRFs

Author: Thomas Lavergne ; Olivier Cappe ; Francois Yvon

Abstract: Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. Even for the simple linearchain model, taking structure into account implies a number of parameters and a computational effort that grows quadratically with the cardinality of the label set. In this paper, we address the issue of training very large CRFs, containing up to hun- dreds output labels and several billion features. Efficiency stems here from the sparsity induced by the use of a ‘1 penalty term. Based on our own implementation, we compare three recent proposals for implementing this regularization strategy. Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets.

5 0.50118393 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

Author: Nadir Durrani ; Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional prob- ability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.

6 0.50019109 16 acl-2010-A Statistical Model for Lost Language Decipherment

7 0.46237355 111 acl-2010-Extracting Sequences from the Web

8 0.46101117 100 acl-2010-Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble

9 0.46078253 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.

10 0.42613366 185 acl-2010-Open Information Extraction Using Wikipedia

11 0.39439675 159 acl-2010-Learning 5000 Relational Extractors

12 0.380795 116 acl-2010-Finding Cognate Groups Using Phylogenies

13 0.37202561 98 acl-2010-Efficient Staggered Decoding for Sequence Labeling

14 0.37160885 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

15 0.34436336 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

16 0.34365278 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers

17 0.33467704 61 acl-2010-Combining Data and Mathematical Models of Language Change

18 0.33246329 19 acl-2010-A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation

19 0.3267929 13 acl-2010-A Rational Model of Eye Movement Control in Reading

20 0.31204277 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.013), (25, 0.04), (39, 0.011), (42, 0.019), (44, 0.018), (59, 0.076), (73, 0.452), (78, 0.022), (80, 0.014), (83, 0.069), (84, 0.036), (98, 0.09)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91635197 259 acl-2010-WebLicht: Web-Based LRT Services for German

Author: Erhard Hinrichs ; Marie Hinrichs ; Thomas Zastrow

Abstract: This software demonstration presents WebLicht (short for: Web-Based Linguistic Chaining Tool), a webbased service environment for the integration and use of language resources and tools (LRT). WebLicht is being developed as part of the D-SPIN project1. WebLicht is implemented as a web application so that there is no need for users to install any software on their own computers or to concern themselves with the technical details involved in building tool chains. The integrated web services are part of a prototypical infrastructure that was developed to facilitate chaining of LRT services. WebLicht allows the integration and use of distributed web services with standardized APIs. The nature of these open and standardized APIs makes it possible to access the web services from nearly any programming language, shell script or workflow engine (UIMA, Gate etc.) Additionally, an application for integration of additional services is available, allowing anyone to contribute his own web service. 1

2 0.88752949 34 acl-2010-Authorship Attribution Using Probabilistic Context-Free Grammars

Author: Sindhu Raghavan ; Adriana Kovashka ; Raymond Mooney

Abstract: In this paper, we present a novel approach for authorship attribution, the task of identifying the author of a document, using probabilistic context-free grammars. Our approach involves building a probabilistic context-free grammar for each author and using this grammar as a language model for classification. We evaluate the performance of our method on a wide range of datasets to demonstrate its efficacy.

same-paper 3 0.86842883 68 acl-2010-Conditional Random Fields for Word Hyphenation

Author: Nikolaos Trogkanis ; Charles Elkan

4 0.86346477 45 acl-2010-Balancing User Effort and Translation Error in Interactive Machine Translation via Confidence Measures

Author: Jesus Gonzalez Rubio ; Daniel Ortiz Martinez ; Francisco Casacuberta

Abstract: This work deals with the application of confidence measures within an interactivepredictive machine translation system in order to reduce human effort. If a small loss in translation quality can be tolerated for the sake of efficiency, user effort can be saved by interactively translating only those initial translations which the confidence measure classifies as incorrect. We apply confidence estimation as a way to achieve a balance between user effort savings and final translation error. Empirical results show that our proposal allows to obtain almost perfect translations while significantly reducing user effort.

5 0.84646106 141 acl-2010-Identifying Text Polarity Using Random Walks

Author: Ahmed Hassan ; Dragomir Radev

Abstract: Automatically identifying the polarity of words is a very important task in Natural Language Processing. It has applications in text classification, text filtering, analysis of product review, analysis of responses to surveys, and mining online discussions. We propose a method for identifying the polarity of words. We apply a Markov random walk model to a large word relatedness graph, producing a polarity estimate for any given word. A key advantage of the model is its ability to accurately and quickly assign a polarity sign and magnitude to any word. The method could be used both in a semi-supervised setting where a training set of labeled words is used, and in an unsupervised setting where a handful of seeds is used to define the two polarity classes. The method is experimentally tested using a manually labeled set of positive and negative words. It outperforms the state of the art methods in the semi-supervised setting. The results in the unsupervised setting is comparable to the best reported values. However, the proposed method is faster and does not need a large corpus.

6 0.81503654 238 acl-2010-Towards Open-Domain Semantic Role Labeling

7 0.78274232 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction

8 0.58466524 121 acl-2010-Generating Entailment Rules from FrameNet

9 0.57999563 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People

10 0.55427736 134 acl-2010-Hierarchical Sequential Learning for Extracting Opinions and Their Attributes

11 0.54638404 82 acl-2010-Demonstration of a Prototype for a Conversational Companion for Reminiscing about Images

12 0.54213107 251 acl-2010-Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews

13 0.5411734 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

14 0.53968275 175 acl-2010-Models of Metaphor in NLP

15 0.53568977 204 acl-2010-Recommendation in Internet Forums and Blogs

16 0.5326792 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

17 0.52569395 209 acl-2010-Sentiment Learning on Product Reviews via Sentiment Ontology Tree

18 0.52301717 85 acl-2010-Detecting Experiences from Weblogs

19 0.52185827 227 acl-2010-The Impact of Interpretation Problems on Tutorial Dialogue

20 0.51882887 158 acl-2010-Latent Variable Models of Selectional Preference