acl acl2013 acl2013-370 knowledge-graph by maker-knowledge-mining

370 acl-2013-Unsupervised Transcription of Historical Documents

Source: pdf

Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein

Abstract: We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 3 1% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. [sent-3, score-0.543]

2 By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. [sent-4, score-0.324]

3 Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 3 1% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system. [sent-5, score-0.266]

4 1 Introduction Standard techniques for transcribing modern documents do not work well on historical ones. [sent-6, score-0.271]

5 One key challenge is that the fonts used in historical documents are not standard (Shoemaker, 2005). [sent-11, score-0.372]

6 The fonts are not irregular like handwriting each occurrence of a given character type, e. [sent-13, score-0.291]

7 Some differences between fonts are minor, reflecting small variations in font design. [sent-17, score-0.291]

8 To address the general problem of unknown fonts, our model (a) (b) (c) Figure 1: Portions of historical documents with (a) unknown font, (b) uneven baseline, and (c) over-inking. [sent-19, score-0.287]

9 Font shape and character segmentation are tightly coupled, and so they are modeled jointly. [sent-21, score-0.219]

10 A second challenge with historical data is that the early typesetting process was noisy. [sent-22, score-0.403]

11 A third challenge is that the actual inking was also noisy. [sent-26, score-0.227]

12 For example, in Figure 1c some characters are thick from over-inking while others are obscured by ink bleeds. [sent-27, score-0.289]

13 To be robust to such rendering irregularities, our model captures both inking levels and pixel-level noise. [sent-28, score-0.305]

14 Because the model is generative, we can also treat areas that are obscured by larger ink blotches as unobserved, and let the model predict the obscured text based on visual and linguistic context. [sent-29, score-0.48]

15 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 207–217, E It appeared that the Prisoner was very X Wandering baseline Historical font Over-inked Figure 2: An example image from a historical document (X) and its transcription (E). [sent-34, score-0.507]

16 2 Related Work Relatively little prior work has built models specifically for transcribing historical documents. [sent-35, score-0.208]

17 For example, some approaches have learned fonts in an unsupervised fashion but require pre-segmentation of the image into character or word regions (Ho and Nagy, 2000; Huang et al. [sent-38, score-0.458]

18 Kae and Learned-Miller (2009) jointly learn the font and image segmentation but do not outperform mod- ern baselines. [sent-40, score-0.244]

19 Work that has directly addressed historical documents has done so using a pipelined approach, and without fully integrating a strong language model (Vamvakas et al. [sent-41, score-0.247]

20 They integrated typesetting models with language models, but did not model noise. [sent-48, score-0.277]

21 However, the symbols are not noisy in decipherment problems and in our problem we face a grid of pixels for which the segmentation into symbols is unknown. [sent-55, score-0.271]

22 3 Model Most historical documents have unknown fonts, noisy typesetting layouts, and inconsistent ink levels, usually simultaneously. [sent-57, score-0.702]

23 We take a generative modeling approach inspired by the overall structure of the historical printing process. [sent-60, score-0.267]

24 Our model generates images of documents line by line; we present the generative process for the image of a single line. [sent-61, score-0.346]

25 Our primary random variables are E (the text) and X (the pixels in an image of the line). [sent-62, score-0.364]

26 Additionally, we have a random variable T that specifies the layout of the bounding boxes of the glyphs in the image, and a random variable R that specifies aspects of the inking and rendering process. [sent-63, score-0.658]

27 For example, E represents the entire sequence of text, while ei represents ith character in the sequence. [sent-65, score-0.209]

28 We choose not to bias the model towards longer or shorter character sequences and let the line length m be drawn uniformly at random from the positive integers less than some large constant When i < 1, let ei denote a line-initial null character. [sent-70, score-0.273]

29 generated conditioned on the corresponding character, while the pixels in left and right padding bounding boxes, XLiPAD and 3. [sent-79, score-0.41]

30 2 Typesetting Model P(T|E) Generally speaking, the process of typesetting produces a line of text by first tiling bounding boxes of various widths and then filling in the boxes with glyphs. [sent-80, score-0.563]

31 As a first step, our model generates the dimensions of character bounding boxes; for each character token index i we generate three bounding box widths: a glyph box width gi, a left padding box width li, and a right padding box width ri, as shown in Figure 3. [sent-82, score-1.72]

32 We let the pixel height of all lines be fixed to h. [sent-83, score-0.257]

33 Let Ti = (li, gi, ri) so that Ti specifies the dimensions of the character box for token index i; T is then the concatenation of all Ti, denoting the full layout. [sent-84, score-0.262]

34 Because the width of a glyph depends on its shape, and because of effects resulting from kerning and the use of ligatures, the components of each Ti are drawn conditioned on the character token ei. [sent-85, score-0.608]

35 This means that, as part of our parameterization of the font, for each character type c we have vectors of multinomial parameters θLcPAD, θGcLYPH, and θRcPAD governing the distribution of the dimensions of character boxes of type c. [sent-86, score-0.453]

36 We can now express the typesetting layout portion of the model as: Ym P(T|E) =iY=1P(Ti|ei) Ym =iY=1? [sent-88, score-0.355]

37 Each character type c in our font has another set of parameters, a matrix φc. [sent-90, score-0.303]

38 These are weights that specify the shape of the character type’s glyph, and are depicted in Figure 3 as part of the font parameters. [sent-91, score-0.408]

39 φc will come into play when we begin generating pixels in Section 3. [sent-92, score-0.227]

40 1 Inking Model P(R) Before we start filling the character boxes with pixels, we need to specify some properties of the inking and rendering process, including the amount of ink used and vertical variation along the text baseline. [sent-96, score-0.793]

41 Our model does this by generating, for each character token index i, a discrete value di that specifies the overall inking level in the character’s bounding box, and a discrete value vi that specifies the glyph’s vertical offset. [sent-97, score-0.71]

42 These variations in the inking and typesetting process are mostly independent of character type. [sent-98, score-0.612]

43 There is one global set of multinomial parameters governing inking level (θINK), and another governing offset (θVERT); both are depicted on the left-hand side of Figure 3. [sent-100, score-0.443]

44 Let Ri = (di, vi) and let R be the concatenation of all Ri so that we can express the inking model as: P(R) =iYm=1P(Ri) YYm =iY=1? [sent-101, score-0.256]

45 The di and vi variables are suppressed in Figure 3 to reduce clutter but are expressed in Figure 4, which depicts the process of rendering a glyph box. [sent-103, score-0.517]

46 We assume that pixels are binary valued and sample their values independently from Bernoulli distributions. [sent-106, score-0.227]

47 2 The probability of black (the Bernoulli parameter) depends on the type of pixel generated. [sent-107, score-0.225]

48 All the pixels in a padding box have the same probability of black that depends only on the inking level of the box, di. [sent-108, score-0.653]

49 Since we have already generated this value and the widths li and ri of each padding box, we have enough information to generate left and right padding pixel matrices XiLPAD and XiRPAD. [sent-109, score-0.57]

50 e) Figure4:Wegnerat }hep⇥iPxXiexl GslLYvfaPoHlr⇤ujetkhse⇠cBhaer ncotuelritoken i by first sampling a glyph} width gi, an inking level di, and a vertical offset vi. [sent-112, score-0.475]

51 Then we interpolate the glyph weights φei and apply the logistic function to produce a matrix of Bernoulli parameters of width gi, inking di, and offset vi. [sent-113, score-0.891]

52 Finally, we sample from each Bernoulli distribution to generate a matrix of pixel values, XGiLYPH. [sent-115, score-0.254]

53 φei has some type-level width w which may differ from the current token-level width gi. [sent-116, score-0.256]

54 Introducing distinct parameters for each possible width would yield a model that can learn completely dif- ferent glyph shapes for slightly different widths of the same character. [sent-117, score-0.632]

55 Our solution is to horizontally interpolate the weights of the shape parameter matrix φei down to a smaller set of columns matching the tokenlevel choice of glyph width gi. [sent-119, score-0.629]

56 Thus, the typelevel matrix φei specifies the canonical shape of the glyph for character ei when it takes its maxglyph pixels are generated is depicted in Figure 4. [sent-120, score-0.986]

57 The dependence of glyph pixels on location complicates generation of the glyph pixel matrix XiGLYPH since the corresponding parameter matrix imum width w. [sent-121, score-1.324]

58 l-valued pixels with a dif erent of the to produce we apply the individual [XiGLYPH]jk denote the jth row and kth col- If we let pixel at uanmdn l eoft θ thPeIXE gLly(pjh,k p,igxie;lφ meia)tr dixen XotieGL tYhPeH tfookre tno-kleenve il, FigurGel5yp:hIwneiog? [sent-124, score-0.452]

59 We define a constant interpolation vector µ(gi, k) that is specific to the glyph box width gi and glyph box column k. [sent-129, score-1.047]

60 The glyph pixel Bernoulli parameters are defined as follows: θPIXEL(j, k,gi; φei) = logistic? [sent-131, score-0.607]

61 The full pixel generation process is diagrammed in Figure 4, where the dependence of θPIXEL on di and vi is also represented. [sent-137, score-0.32]

62 The identities of the characters E the typesetting layout T and the inking R will all be unobserved. [sent-141, score-0.553]

63 During the M-step, we update the parameters θLcPAD, θRcPAD using the standard closed-form multinomial updates and use a specialized closedform update for θGcLYPH that enforces unimodality of the glyph width distribution. [sent-146, score-0.51]

64 3 The glyph weights, φc, do not have a closed-form update. [sent-147, score-0.343]

65 In the early iterations of EM, our font parameters are still inaccurate, and to prune heavily based on such parameters would rule out correct analyses. [sent-157, score-0.215]

66 3We compute the weighted mean and weighted variance of the glyph width expected counts. [sent-161, score-0.471]

67 5 Data We perform experiments on two historical datasets consisting of images of documents printed between 1700 and 1900 in England and Australia. [sent-174, score-0.36]

68 We choose the first document in each of the corresponding years, choose a random page in the document, and extracted an image of the first 30 con- secutive lines of text consisting of full sentences. [sent-183, score-0.197]

69 We extracted ten images from this collection in the same way that we extracted images from Old Bailey, but starting from the year 1803. [sent-188, score-0.186]

70 3 Pre-processing Many of the images in historical collections are bitonal (binary) as a result of how they were captured on microfilm for storage in the 1980s (Arlitsch and Herbert, 2004). [sent-192, score-0.262]

71 For consistency, we binarized the images in our test sets that were not already binary by thresholding pixel values. [sent-194, score-0.305]

72 The line extraction process also identifies pixels that are not located in central text regions, and are part of large connected components of ink, spanning multiple lines. [sent-198, score-0.262]

73 The values of such pixels are treated as unobserved in the model since, more often than not, they are part of ink blotches. [sent-199, score-0.548]

74 It is the OCR system that the National Library ofAustralia used to recognize the historical documents in Trove (Holley, 2010). [sent-209, score-0.218]

75 2 Evaluation We evaluate the output of our system and the baseline systems using two metrics: character error rate (CER) and word error rate (WER). [sent-211, score-0.193]

76 320 Table 1: We evaluate the predicted transcriptions in terms of both character error rate (CER) and word error rate (WER), and report macro-averages across documents. [sent-229, score-0.251]

77 8 For documents that use the long s glyph, we introduce a special character type for the non-word-final s, and initialize its parameters from a mixture of the modern f and | glyphs. [sent-238, score-0.239]

78 This balances the contributions of the language model and the typesetting model to the posterior (Och and Ney, 2004). [sent-242, score-0.306]

79 This slightly improves performance on our development set and can be thought of as placing a prior on the glyph shape parameters. [sent-248, score-0.425]

80 The grayscale glyphs show the Bernoulli pixel distributions learned by our model, while the padding regions are depicted in blue. [sent-251, score-0.571]

81 1 Learned Typesetting Layout Figure 7 shows a representation of the typesetting layout learned by our model for portions of several model parameters for the glyph shape for g, and surrounding this are the learned parameters for documents from various years. [sent-270, score-1.043]

82 For each portion of a test document, the first line shows the transcription predicted by our model, and the second line shows padding and glyph regions predicted by the model, where the grayscale glyphs represent the learned Bernoulli parameters for each pixel. [sent-272, score-0.912]

83 Figure 7a demonstrates a case where our model has effectively explained both the uneven baseline and over-inked glyphs by using the vertical offsets vi and inking variables di. [sent-274, score-0.522]

84 In Figure 7b the model has used glyph widths gi and vertical offsets to explain the thinning of glyphs and falling baseline that occurred near the binding of the book. [sent-275, score-0.653]

85 In separate experiments on the Old Bailey test set, using the NYT language model, we found that removing the vertical offset variables from the model increased WER by 22, and removing the inking variables increased WER by 16. [sent-276, score-0.436]

86 214 Figure 9: This Old Bailey document from 1719 has severe ink bleeding from the facing page. [sent-278, score-0.321]

87 We annotated these blotches (in red) and treated the corresponding pixels as unobserved in the model. [sent-279, score-0.363]

88 Here, missing characters and ink blotches confuse the model, which picks something that is reasonable according to the language model, but incorrect. [sent-282, score-0.316]

89 2 Learned Fonts It is interesting to look at the fonts learned by our system, and track how historical fonts changed over time. [sent-284, score-0.495]

90 Figure 8 shows several grayscale images representing the Bernoulli pixel probabilities for the most likely width of the glyph for g under various conditions. [sent-285, score-0.816]

91 3 Unobserved Ink Blotches As noted earlier, one strength of our generative model is that we can make the values of certain pixels unobserved in the model, and let inference fill them in. [sent-291, score-0.344]

92 This document, a fragment of which is shown in Figure 9, has severe ink bleeding from the facing page. [sent-293, score-0.263]

93 We manually annotated the ink blotches (shown in red), and made them unobserved in the model. [sent-294, score-0.372]

94 The resulting typesetting layout learned by the model is also shown in Figure 9. [sent-295, score-0.387]

95 Running the model with the manually specified unobserved pixels re- duced the WER on this document from 58 to 19 when using the NYT language model. [sent-297, score-0.37]

96 We found that 56% of errors were accompanied by over-inking, 50% of errors were accompanied by ink blotches, 42% of errors contained punctuation, 21% of errors showed missing ink, and 12% of errors contained text that was italicized in the original image. [sent-302, score-0.269]

97 In cases of extreme ink blotching, or large areas of missing ink, the system usually makes an error. [sent-305, score-0.236]

98 8 Conclusion We have demonstrated a model, based on the historical typesetting process, that effectively learns font structure in an unsupervised fashion to improve transcription of historical documents into text. [sent-306, score-0.808]

99 The parameters of the learned fonts are interpretable, as are the predicted typesetting layouts. [sent-307, score-0.531]

100 A complete optical character recognition methodology for historical documents. [sent-420, score-0.292]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('glyph', 0.343), ('bailey', 0.294), ('typesetting', 0.248), ('ink', 0.236), ('inking', 0.227), ('pixels', 0.227), ('pixel', 0.225), ('wer', 0.192), ('ocr', 0.164), ('historical', 0.155), ('fonts', 0.154), ('nyt', 0.142), ('character', 0.137), ('font', 0.137), ('tesseract', 0.134), ('trove', 0.134), ('width', 0.128), ('padding', 0.118), ('bernoulli', 0.117), ('old', 0.114), ('image', 0.107), ('shape', 0.082), ('box', 0.081), ('images', 0.08), ('blotches', 0.08), ('printing', 0.08), ('layout', 0.078), ('glyphs', 0.076), ('boxes', 0.075), ('ei', 0.072), ('abbyy', 0.071), ('gi', 0.071), ('vertical', 0.069), ('gclyph', 0.067), ('kae', 0.067), ('ocular', 0.067), ('bounding', 0.065), ('widths', 0.065), ('documents', 0.063), ('printed', 0.062), ('document', 0.058), ('predicted', 0.058), ('unobserved', 0.056), ('holley', 0.053), ('kluzner', 0.053), ('kopec', 0.053), ('lcpad', 0.053), ('obscured', 0.053), ('transcribing', 0.053), ('xiglyph', 0.053), ('depicted', 0.052), ('vi', 0.051), ('offset', 0.051), ('transcription', 0.05), ('rendering', 0.049), ('interpolate', 0.047), ('ri', 0.044), ('di', 0.044), ('specifies', 0.044), ('decipherment', 0.044), ('arlitsch', 0.04), ('finereader', 0.04), ('grayscale', 0.04), ('rcpad', 0.04), ('shoemaker', 0.04), ('uneven', 0.04), ('vert', 0.04), ('parameters', 0.039), ('governing', 0.037), ('pass', 0.036), ('line', 0.035), ('gary', 0.035), ('iy', 0.033), ('italicized', 0.033), ('learned', 0.032), ('lines', 0.032), ('generative', 0.032), ('variables', 0.03), ('ravi', 0.03), ('portions', 0.029), ('matrix', 0.029), ('model', 0.029), ('reduction', 0.028), ('regions', 0.028), ('parameterization', 0.028), ('shapes', 0.028), ('error', 0.028), ('asaf', 0.027), ('bleeding', 0.027), ('hsmm', 0.027), ('kolak', 0.027), ('lgayre', 0.027), ('microfilm', 0.027), ('tzadok', 0.027), ('vamvakas', 0.027), ('xilpad', 0.027), ('logistic', 0.027), ('commercial', 0.027), ('erik', 0.026), ('ten', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 370 acl-2013-Unsupervised Transcription of Historical Documents

Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein

2 0.14000982 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

Author: Ahmed Salama ; Kemal Oflazer ; Susan Hagan

Abstract: We present results from our study ofwhich uses syntactically and semantically motivated information to group segments of sentences into unbreakable units for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize raggedness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2% Precision, 90.2% Recall (89.7% F1) on a test set, improving the rule-based baseline by about 11points and the classifier trained on all features by about 1point in F1. 1 Introduction and Motivation Current best practice in typography focuses on several interrelated factors (Humar et al., 2008; Tinkel, 1996). These factors include typeface selection, the color of the type and its contrast with the background, the size of the type, the length of the lines of type in the body of the text, the media in which the type will live, the distance between each line of type, and the appearance of the justified or ragged right side edge of the paragraphs, which should maintain either the appearance of a straight line on both sides of the block of type (justified) or create a gentle wave on the ragged right side edge. cmu .edu hagan @ cmu .edu This paper addresses one aspect of current “best practice,” concerning the alignment of text in a paragraph. While current practice values that gentle “wave,” which puts the focus on the elegant look of the overall paragraph, it does so at the expense of meaning-making features. Meaningmaking features enable typesetting to maintain the integrity of phrases within sentences, giving those interests equal consideration with the overall look of the paragraph. Figure 1 (a) shows a text fragment typeset without any regard to natural breaks while (b) shows an example of a typesetting that we would like to get, where many natural breaks are respected. While current practice works well enough for native speakers, fluency problems for non-native speakers lead to uncertainty when the beginning and end of English phrases are interrupted by the need to move to the next line of the text before completing the phrase. This pause is a potential problem for readers because they try to interpret content words, relate them to their referents and anticipate the role of the next word, as they encounter them in the text (Just and Carpenter, 1980). While incorrect anticipation might not be problematic for native speakers, who can quickly re-adjust, non-native speakers may find inaccurate anticipation more troublesome. This problem could be more significant because English as a second language (ESL) readers are engaged not only in understanding a foreign language, but also in processing the “anticipated text” as they read a partial phrase, and move to the next line in the text, only to discover that they anticipated meaning incorrectly. Even native speakers with less skill may experience difficulty comprehending text and work with young readers suggests that ”[c]omprehension difficulties may be localized at points of high processing demands whether from syntax or other sources” (Perfetti et al., 2005). As ESL readers process a partial phrase, and move to 719 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgices 719–724, the next line in the text, instances of incorrectly anticipated meaning would logically increase processing demands to a greater degree. Additionally, as readers make meaning, we assume that they don’t parse their thoughts using the same phrasal divisions “needed to diagram a sentence.” Our perspective not only relies on the immediacy assumption, but also develops as an outgrowth of other ways that we make meaning outside of the form or function rules of grammar. Specifically, Halliday and Hasan (1976) found that rules of grammar do not explain how cohesive principals engage readers in meaning making across sentences. In order to make meaning across sentences, readers must be able to refer anaphorically backward to the previous sentence, and cataphorically forward to the next sentence. Along similar lines, readers of a single sentence assume that transitive verbs will include a direct object, and will therefore speculate about what that object might be, and sometimes get it wrong. Thus proper typesetting of a segment of text must explore ways to help readers avoid incorrect anticipation, while also considering those moments in the text where readers tend to pause in order to integrate the meaning of a phrase. Those decisions depend on the context. A phrasal break between a one-word subject and its verb tends to be more unattractive, because the reader does not have to make sense of relationships between the noun/subject and related adjectives before moving on to the verb. In this case, the reader will be more likely to anticipate the verb to come. However, a break between a subject preceded by multiple adjectives and its verb is likely to be more useful to a reader (if not ideal), because the relationships between the noun and its related adjectives are more likely to have thematic importance leading to longer gaze time on the relevant words in the subject phrase (Just and Carpenter, 1980). We are not aware of any prior work for bringing computational linguistic techniques to bear on this problem. A relatively recent study (Levasseur et al., 2006) that accounted only for breaks at commas and ends of sentences, found that even those breaks improved reading fluency. While the participants in that study were younger (7 to 9+ years old), the study is relevant because the challenges those young participants face, are faced again when readers of any age encounter new and complicated texts that present words they do not know, and ideas they have never considered. On the other hand, there is ample work on the basic algorithm to place a sequence of words in a typesetting area with a certain width, commonly known as the optimal line breaking problem (e.g., Plass (1981), Knuth and Plass (1981)). This problem is quite well-understood and basic variants are usually studied as an elementary example application of dynamic programming. In this paper we explore the problem of learning where to break sentences in order to avoid the problems discussed above. Once such unbreakable segments are identified, a simple application of the dynamic programming algorithm for optimal line breaking, using unbreakable segments as “words”, easily typesets the text to a given width area. 2 Text Breaks The rationale for content breaks is linked to our interest in preventing inaccurate anticipation, which is based on the immediacy assumption. The immediacy assumption (Just and Carpenter, 1980) considers, among other things, the reader’s interest in trying to relate content words to their referents as soon as possible. Prior context also encourages the reader to anticipate a particular role or case for the next word, such as agent or the manner in which something is done.Therefore, in defining our breaks, we consider not only the need to maintain the syntactic integrity of phrases, such as the prepositional phrase, but also the semantic integrity across syntactical divisions. For example, semantic integrity is important when transitive verbs anticipate direct objects. Strictly speaking, we define a bad break as one that will cause (i) unintended anaphoric collocation, (ii) unintended cataphoric collocation, or (iii) incorrect anticipation. Using these broad constraints, we derived a set of about 30 rules that define acceptable and nonacceptable breaks, with exceptions based on context and other special cases. Some of the rules are very simple and are only related to the word posi- tion in the sentence: • • Break at the end of a sentence. Keep the first and last words of a sentence wKietehp pth teh rest sotf a aint.d The rest of the rule set are more complex and depend on the structure of the sentence in question, 720 . s anct ions and UN charge s o f gro s s right s abuse s Mi l ary tens i it ons on the Korean peninsula have risen to the i highe st level for years r with the communi st st ate under the youthful Kim threatening nuclear war in re sponse t o UN s anct i s impo s ed a ft e r it s thi rd at omi c t e st l on ast month . It ha s al s o (a) Text with standard typesetting from US s anct i s and UN charge s o f gro s s right s abu s e s . Mi l ary t en s i s on it on on the Ko rean penin sul a have r i en t o the i highe st l s r eve l for year s with the communi st st at e unde r the youthful Kim threat ening nuc l ear war in re spon s e t o UN s anct i s impo s ed a ft e r it s thi rd at omi c t e st l on ast month . (b) Text with syntax-directed typesetting , , Figure 1: Short fragment of text with standard typesetting (a) and with syntax and semantics motivated typesetting (b), both in a 75 character width. e.g.: • • • Keep a single word subject with the verb. Keep an appositive phrase with the noun it renames. Do not break inside a prepositional phrase. • • • Keep marooned prepositions with the word they modify. Keep the verb, the object and the preposition together ei nv a phrasal bvjeercbt phrase. Keep a gerund clause with its adverbial complement. There are exceptions to these rules in certain cases such as overly long phrases. 3 Experimental Setup Our data set consists of a modest set of 150 sentences (3918 tokens) selected from four different documents and manually annotated by a human expert relying on the 30 or so rules. The annotation consists of marking after each token whether one is allowed to break at that position or not.1 We developed three systems for predicting breaks: a rule-based baseline system, a maximumentropy classifier that learns to classify breaks us- ing about 100 lexical, syntactic and collocational features, and a maximum entropy classifier that uses a subset of these features selected by a simple genetic algorithm in a hill-climbing fashion. We evaluated our classifiers intrinsically using the usual measures: 1We expect to make our annotated data available upon the publication of the paper. • Precision: Percentage of the breaks posited tPhraetc were actually ctaogrere octf bthreeak bsre aink tshe p goldstandard hand-annotated data. It is possible to get 100% precision by putting a single break at the end. • Recall: Percentage of the actual breaks correctly posited. tIatg ies possible ttou get 1e0ak0%s c recall by positing a break after each token. F1: The geometric mean of precision and recFall divided by their average. It should be noted that when a text is typeset into an area of width of a certain number of characters, an erroneous break need not necessarily lead to an actual break in the final output, that is an error may • not be too bad. On the other hand, a missed break while not hurting the readability of the text may actually lead to a long segment that may eventually worsen raggedness in the final typesetting. Baseline Classifier We implemented a subset of the rules (those that rely only on lexical and partof-speech information), as a baseline rule-based break classifier. The baseline classifier avoids breaks: • • • after the first word in a sentence, quote or parentheses, before the last word in a sentence, quote or parentheses, asntd w between a punctuation mark following a bweotrwde or b aet wpueennct two nco nmsearckuti vfoel punctuation marks. It posits breaks (i) before a word following a punctuation, and (ii) before prepositions, auxiliary verbs, coordinating conjunctions, subordinate conjunctions, relative pronouns, relative adverbs, conjunctive adverbs, and correlative conjunctions. 721 Maximum Entropy Classifier We used the CRF++ Tool2 but with the option to run it only as a maximum entropy classifier (Berger et al., 1996), to train a classifier. We used a large set of about 100 features grouped into the following categories: • • Lexical features: These features include the tLoekxeinca aln fde athtuer ePsO:S T tag efo fre athtuer previous, current and the next word. We also encode whether the word is part of a compound noun or a verb, or is an adjective that subcategorizes a specific preposition in WordNet, (e.g., familiar with). Constituency structure features: These are Cunolnesxtiictauleinzecdy f setarutucrtuers eth faeat ttaurkees i:nt To aecsecou anret in the parse tree, for a word and its previous and next words, the labels of the parent, the grandparent and their siblings, and number of siblings they have. We also consider the label of the closest common ancestor for a word and its next word. • • Dependency structure features: These are unlDeexipceanldizeendc yfe satrtuurcteus eth faeat essentially capture the number of dependency relation links that cross-over a given word boundary. The motivation for these comes from the desire to limit the amount of information that would need to be carried over that boundary, assuming this would be captured by the number of dependency links over the break point. Baseline feature: This feature reflects Bwahseethlienre the rule-based baseline break classifier posits a break at this point or not. We use the following tools to process the sentences to extract some of these features: • Stanford constituency and dependency parsers, (De Marneffe et al., 2006; Klein and Manning, 2002; Klein and Manning, 2003), • • lemmatization tool in NLTK (Bird, 2006), WordNet for compound (Fellbaum, 1998). nouns and verbs 2Available at http : / / crfpp . googlecode .com/ svn /t runk / doc / index . html . TabPFRle1r c:ailsRoenultsBfra78os09me.l491inBaeslMin89eE078-a.nA382dlMaxi98mE09-.uG27mAEntropy break classifiers Maximum Entropy Classifier with GA Feature Selection We used a genetic algorithm on a development data set, to select a subset of the features above. Basically, we start with a randomly selected set of features and through mutation and crossover try to obtain feature combinations that perform better over the development set in terms of F1 score. After a few hundred generations of this kind of hill-climbing, we get a subset of features that perform the best. 4 Results Our current evaluation is only intrinsic in that we measure our performance in getting the break and no-break points correctly in a test set. The results are shown in Table 1. The column ME-All shows the results for a maximum entropy classifier using all the features and the column ME-GA shows the results for a maximum entropy classifier using about 50 of the about 100 features available, as selected by the genetic algorithm. Our best system delivers 89.2% precision and 90.2% recall (with 89.7% F1), improving the rulebased baseline by about 11points and the classifier trained on all features by about 1point in F1. After processing our test set with the ME-GA classifier, we can feed the segments into a standard word-wrapping dynamic programming algorithm (along with a maximum width) and obtain a typeset version with minimum raggedness on the right margin. This algorithm is fast enough to use even dynamically when resizing a window if the text is displayed in a browser on a screen. Figure 1 (b) displays an example of a small fragment of text typeset using the output of our best break classifier. One can immediately note that this typesetting has more raggedness overall, but avoids the bad breaks in (a). We are currently in the process of designing a series of experiments for extrinsic evaluation to determine if such typeset text helps comprehension for secondary language learners. 722 4.1 Error Analysis An analysis of the errors our best classifier makes (which may or may not be translated into an actual error in the final typesetting) shows that the majority of the errors basically can be categorized into the following groups: • Incorrect breaks posited for multiword colloIcnatcioornrse (e.g., akcst *po of weda fr,o3r rmuulel*ti of law, far ahead* of, raining cats* and dogs, etc.) • Missed breaks after a verb (e.g., calls | an act of war, proceeded to | implement, etc.) Missed breaks before or after prepositions or aMdvisesrebdia blsre (e.g., ethfoer day after | tehpeo wsitoiroldns realized, every .kgi.n,d th | of interference) We expect to overcome such cases by increasing our training data size significantly by using our classifier to break new texts and then have a human annotator to manually correct the breaks. • 5 Conclusions and Future Work We have used syntactically motivated information to help in typesetting text to facilitate better understanding of English text especially by secondary language learners, by avoiding breaks which may cause unnecessary anticipation errors. We have cast this as a classification problem to indicate whether to break after a certain word or not, by taking into account a variety of features. Our best system maximum entropy framework uses about 50 such features, which were selected using a genetic algorithm and performs significantly better than a rule-based break classifier and better than a maximum entropy classifier that uses all available features. We are currently working on extending this work in two main directions: We are designing a set of experiments to extrinsically test whether typesetting by our system improves reading ease and comprehension. We are also looking into a break labeling scheme that is not binary but based on a notion of “badness” perhaps quantized into 3-4 grades, that would allow flexibility between preventing bad breaks and minimizing raggedness. For instance, breaking a noun-phrase right after an initial the may be considered very bad. On the other hand, although it is desirable to keep an object NP together with the preceding transitive verb, – 3* indicates a spurious incorrect break, | indicates a misse*d i nbrdeiacka.t breaking before the object NP, could be OK, if not doing so causes an inordinate amount of raggedness. Then the final typesetting stage can optimize a combination of raggedness and the total “bad- ness” of all the breaks it posits. Acknowledgements This publication was made possible by grant NPRP-09-873-1-129 from the Qatar National Research Fund (a member of the Qatar Foundation). Susan Hagan acknowledges the generous support of the Qatar Foundation through Carnegie Mellon University’s Seed Research program. The statements made herein are solely the responsibility of this author(s), and not necessarily those of the Qatar Foundation. References Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71. Steven Bird. 2006. NLTK: The natural language toolkit. In Proceedings of the COLING/ACL, pages 69–72. Association for Computational Linguistics. Marie-Catherine De Marneffe, Bill MacCartney, and Christopher D Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, volume 6, pages 449–454. Christiane Fellbaum. 1998. WordNet: An electronic lexical database. The MIT Press. M. A. K. Halliday and R. Hasan. 1976. Cohesion in English. Longman, London. I. Humar, M. Gradisar, and T. Turk. 2008. The impact of color combinations on the legibility of a web page text presented on crt displays. International Journal of Industrial Ergonomics, 38(1 1-12):885–899. Marcel A. Just and Patricia A. Carpenter. 1980. A theory of reading: From eye fixations to comprehension. Psychological Review, 87:329–354. Dan Klein and Christopher D. Manning. 2002. Fast exact inference with a factored model for natural language parsing. Advances in Neural Information Processing Systems, 15(2003):3–10. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Asso- ciation for Computational Linguistics. 723 Donald E Knuth and Michael F. Plass. 1981. Breaking paragraphs into lines. Software: Practice and Experience, 11(11): 1119–1 184. Valerie Marciarille Levasseur, Paul Macaruso, Laura Conway Palumbo, and Donald Shankweiler. 2006. Syntactically cued text facilitates oral reading fluency in developing readers. Applied Psycholinguistics, 27(3):423–445. C. A. Perfetti, N. Landi, and J. Oakhill. 2005. The acquisition of reading comprehension skill. In M. J. Snowling and C. Hulme, editors, The science of reading: A handbook, pages 227–247. Blackwell, Oxford. Michael Frederick Plass. 1981. Optimal Pagination Techniques for Automatic Typesetting Systems. Ph.D. thesis, Stanford University. K. Tinkel. 1996. Taking it in: What makes type easier to read. Adobe Magazine, pages 40–50. 724

3 0.086436749 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus

Author: Polina Kuznetsova ; Vicente Ordonez ; Alexander Berg ; Tamara Berg ; Yejin Choi

Abstract: The ever growing amount of web images and their associated texts offers new opportunities for integrative models bridging natural language processing and computer vision. However, the potential benefits of such data are yet to be fully realized due to the complexity and noise in the alignment between image content and text. We address this challenge with contributions in two folds: first, we introduce the new task of image caption generalization, formulated as visually-guided sentence compression, and present an efficient algorithm based on dynamic beam search with dependency-based constraints. Second, we release a new large-scale corpus with 1 million image-caption pairs achieving tighter content alignment between images and text. Evaluation results show the intrinsic quality of the generalized captions and the extrinsic utility of the new imagetext parallel corpus with respect to a concrete application of image caption transfer.

4 0.072254956 380 acl-2013-VSEM: An open library for visual semantics representation

Author: Elia Bruni ; Ulisse Bordignon ; Adam Liska ; Jasper Uijlings ; Irina Sergienya

Abstract: VSEM is an open library for visual semantics. Starting from a collection of tagged images, it is possible to automatically construct an image-based representation of concepts by using off-theshelf VSEM functionalities. VSEM is entirely written in MATLAB and its objectoriented design allows a large flexibility and reusability. The software is accompanied by a website with supporting documentation and examples.

5 0.068588279 384 acl-2013-Visual Features for Linguists: Basic image analysis techniques for multimodally-curious NLPers

Author: Elia Bruni ; Marco Baroni

Abstract: unkown-abstract

6 0.067953177 249 acl-2013-Models of Semantic Representation with Visual Attributes

7 0.065795913 220 acl-2013-Learning Latent Personas of Film Characters

8 0.065199643 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

9 0.064750381 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

10 0.05843509 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

11 0.055584472 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

12 0.052164499 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain

13 0.051705159 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

14 0.050615065 80 acl-2013-Chinese Parsing Exploiting Characters

15 0.049644686 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

16 0.046304528 325 acl-2013-Smoothed marginal distribution constraints for language modeling

17 0.042624768 66 acl-2013-Beam Search for Solving Substitution Ciphers

18 0.041466057 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

19 0.04025837 240 acl-2013-Microblogs as Parallel Corpora

20 0.039975762 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.118), (1, -0.011), (2, -0.016), (3, -0.007), (4, 0.023), (5, -0.049), (6, 0.043), (7, -0.021), (8, -0.078), (9, 0.039), (10, -0.071), (11, -0.07), (12, 0.003), (13, -0.013), (14, 0.019), (15, -0.06), (16, -0.009), (17, 0.009), (18, -0.028), (19, 0.06), (20, 0.002), (21, 0.026), (22, -0.007), (23, -0.005), (24, 0.002), (25, -0.012), (26, -0.07), (27, -0.025), (28, 0.026), (29, 0.003), (30, 0.016), (31, -0.023), (32, 0.012), (33, 0.004), (34, -0.012), (35, -0.046), (36, -0.031), (37, 0.028), (38, 0.003), (39, 0.03), (40, -0.064), (41, 0.019), (42, -0.006), (43, -0.017), (44, 0.002), (45, 0.007), (46, 0.054), (47, 0.006), (48, -0.016), (49, -0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85168397 370 acl-2013-Unsupervised Transcription of Historical Documents

Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein

2 0.62783593 175 acl-2013-Grounded Language Learning from Video Described with Sentences

Author: Haonan Yu ; Jeffrey Mark Siskind

Abstract: We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts si- multaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.

3 0.60383415 220 acl-2013-Learning Latent Personas of Film Characters

Author: David Bamman ; Brendan O'Connor ; Noah A. Smith

Abstract: We present two latent variable models for learning character types, or personas, in film, in which a persona is defined as a set of mixtures over latent lexical classes. These lexical classes capture the stereotypical actions of which a character is the agent and patient, as well as attributes by which they are described. As the first attempt to solve this problem explicitly, we also present a new dataset for the text-driven analysis of film, along with a benchmark testbed to help drive future work in this area.

4 0.60081178 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus

Author: Polina Kuznetsova ; Vicente Ordonez ; Alexander Berg ; Tamara Berg ; Yejin Choi

5 0.59291905 380 acl-2013-VSEM: An open library for visual semantics representation

Author: Elia Bruni ; Ulisse Bordignon ; Adam Liska ; Jasper Uijlings ; Irina Sergienya

6 0.58887261 249 acl-2013-Models of Semantic Representation with Visual Attributes

7 0.56841105 384 acl-2013-Visual Features for Linguists: Basic image analysis techniques for multimodally-curious NLPers

8 0.5442819 382 acl-2013-Variational Inference for Structured NLP Models

9 0.54206359 321 acl-2013-Sign Language Lexical Recognition With Propositional Dynamic Logic

10 0.51150727 14 acl-2013-A Novel Classifier Based on Quantum Computation

11 0.50473118 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

12 0.48403716 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

13 0.47042137 66 acl-2013-Beam Search for Solving Substitution Ciphers

14 0.4671784 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

15 0.46608219 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

16 0.46522933 29 acl-2013-A Visual Analytics System for Cluster Exploration

17 0.46445999 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

18 0.46265799 54 acl-2013-Are School-of-thought Words Characterizable?

19 0.46182296 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

20 0.45620948 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.05), (6, 0.035), (11, 0.042), (24, 0.049), (26, 0.042), (35, 0.052), (42, 0.04), (48, 0.034), (53, 0.386), (70, 0.058), (88, 0.026), (90, 0.025), (95, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76141286 370 acl-2013-Unsupervised Transcription of Historical Documents

Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein

2 0.75253534 270 acl-2013-ParGramBank: The ParGram Parallel Treebank

Author: Sebastian Sulger ; Miriam Butt ; Tracy Holloway King ; Paul Meurer ; Tibor Laczko ; Gyorgy Rakosi ; Cheikh Bamba Dione ; Helge Dyvik ; Victoria Rosen ; Koenraad De Smedt ; Agnieszka Patejuk ; Ozlem Cetinoglu ; I Wayan Arka ; Meladel Mistica

Abstract: This paper discusses the construction of a parallel treebank currently involving ten languages from six language families. The treebank is based on deep LFG (LexicalFunctional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. The grammars produce output that is maximally parallelized across languages and language families. This output forms the basis of a parallel treebank covering a diverse set of phenomena. The treebank is publicly available via the INESS treebanking environment, which also allows for the alignment of language pairs. We thus present a unique, multilayered parallel treebank that represents more and different types of languages than are avail- able in other treebanks, that represents me ladel .mi st ica@ gmai l com . deep linguistic knowledge and that allows for the alignment of sentences at several levels: dependency structures, constituency structures and POS information.

3 0.41740519 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.

4 0.36376062 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

Author: Barbara Plank ; Alessandro Moschitti

Abstract: Relation Extraction (RE) is the task of extracting semantic relationships between entities in text. Recent studies on relation extraction are mostly supervised. The clear drawback of supervised methods is the need of training data: labeled data is expensive to obtain, and there is often a mismatch between the training data and the data the system will be applied to. This is the problem of domain adaptation. In this paper, we propose to combine (i) term generalization approaches such as word clustering and latent semantic analysis (LSA) and (ii) structured kernels to improve the adaptability of relation extractors to new text genres/domains. The empirical evaluation on ACE 2005 domains shows that a suitable combination of syntax and lexical generalization is very promising for domain adaptation.

5 0.35849962 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

Author: Muhua Zhu ; Yue Zhang ; Wenliang Chen ; Min Zhang ; Jingbo Zhu

Abstract: Shift-reduce dependency parsers give comparable accuracies to their chartbased counterparts, yet the best shiftreduce constituent parsers still lag behind the state-of-the-art. One important reason is the existence of unary nodes in phrase structure trees, which leads to different numbers of shift-reduce actions between different outputs for the same input. This turns out to have a large empirical impact on the framework of global training and beam search. We propose a simple yet effective extension to the shift-reduce process, which eliminates size differences between action sequences in beam-search. Our parser gives comparable accuracies to the state-of-the-art chart parsers. With linear run-time complexity, our parser is over an order of magnitude faster than the fastest chart parser.

6 0.35807732 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals

7 0.35799384 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

8 0.35760805 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

9 0.35708863 267 acl-2013-PARMA: A Predicate Argument Aligner

10 0.3569203 333 acl-2013-Summarization Through Submodularity and Dispersion

11 0.35643217 288 acl-2013-Punctuation Prediction with Transition-based Parsing

12 0.35579658 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

13 0.35569233 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

14 0.3556509 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

15 0.35558581 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

16 0.35541731 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

17 0.35525924 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

18 0.35471439 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

19 0.35444197 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions

20 0.35400033 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study