acl acl2013 acl2013-390 knowledge-graph by maker-knowledge-mining

390 acl-2013-Word surprisal predicts N400 amplitude during reading

Source: pdf

Author: Stefan L. Frank ; Leun J. Otten ; Giulia Galli ; Gabriella Vigliocco

Abstract: We investigated the effect of word surprisal on the EEG signal during sentence reading. On each word of 205 experimental sentences, surprisal was estimated by three types of language model: Markov models, probabilistic phrasestructure grammars, and recurrent neural networks. Four event-related potential components were extracted from the EEG of 24 readers of the same sentences. Surprisal estimates under each model type formed a significant predictor of the amplitude of the N400 component only, with more surprising words resulting in more negative N400s. This effect was mostly due to content words. These findings provide support for surprisal as a gener- ally applicable measure of processing difficulty during language comprehension.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Word surprisal predicts N400 amplitude during reading Stefan L. [sent-1, score-0.91]

2 na 2Department of Cognitive, Perceptual and Brain Sciences, University College London 3Institute of Cognitive Neuroscience, University College London Abstract We investigated the effect of word surprisal on the EEG signal during sentence reading. [sent-13, score-0.771]

3 On each word of 205 experimental sentences, surprisal was estimated by three types of language model: Markov models, probabilistic phrasestructure grammars, and recurrent neural networks. [sent-14, score-0.738]

4 Four event-related potential components were extracted from the EEG of 24 readers of the same sentences. [sent-15, score-0.092]

5 Surprisal estimates under each model type formed a significant predictor of the amplitude of the N400 component only, with more surprising words resulting in more negative N400s. [sent-16, score-0.418]

6 These findings provide support for surprisal as a gener- ally applicable measure of processing difficulty during language comprehension. [sent-18, score-0.649]

7 1 Introduction Many studies of human language comprehension measure the brain’s electrical activity during reading. [sent-19, score-0.078]

8 Such electroencephalography (EEG) experiments have revealed that the EEG signal displays systematic variation in response to the appearance of each word. [sent-20, score-0.077]

9 The different components that can be observed in this signal are known as eventrelated potentials (ERPs). [sent-21, score-0.164]

10 Probably the most reliably observed (and most studied) of these components is a negative-going deflection at centroparietal electrodes that peaks at around 400 ms after word onset and is therefore referred to as the N400 component. [sent-22, score-0.29]

11 It is well known that the N400 increases in amplitude (i. [sent-23, score-0.225]

12 , becomes more negative) when the word leads to comprehension difficulty. [sent-25, score-0.108]

13 To study the general relation between word predictability and the N400, Dambacher et al. [sent-26, score-0.079]

14 (2006) obtained subjective word-probability estimates (so-called cloze probabilities) by asking participants to predict the upcoming word at each point in a large number of sentences. [sent-27, score-0.246]

15 A different group of subjects read these same sentences while their EEG signal was recorded. [sent-28, score-0.087]

16 Results showed a correlation between N400 amplitude and cloze probability: Less predictable words yielded stronger N400s. [sent-29, score-0.281]

17 We investigated whether similar results can be obtained using more objective, model-based word probabilities. [sent-30, score-0.074]

18 For each word in a collection of English sentences, estimates of its surprisal (i. [sent-31, score-0.728]

19 , its negative log-transformed conditional probability: log P(wt |w1, . [sent-33, score-0.116]

20 , ngram) models, phrase-structure grammars (PSGs), and recurrent neural networks (RNNs). [sent-38, score-0.125]

21 Next, EEG signals of participants reading the same sentences were recorded. [sent-39, score-0.161]

22 A comparison of word surprisal to different ERP components revealed that, indeed, N400 amplitude was predicted by surprisal values: More surprising words resulted in more negative N400s, at least for content words. [sent-40, score-1.713]

23 , 2012; Frank and Bod, 2011; Frank and Thompson, 2012), providing additional support that these psychological data are indeed explained by the surprisal values and not by some confounding variable. [sent-43, score-0.674]

24 1 Corpus data All models were trained on sentences from the written texts in the British National Corpus (BNC). [sent-45, score-0.066]

25 First, the 10,000 word types with highest 878 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-46, score-0.03]

26 Next, all sentences were extracted that contained only those words. [sent-49, score-0.04]

27 Each trained model estimated a surprisal value for each word of the 205 sentences (193 1word tokens) for which eye-tracking data are available in the UCL corpus of reading times (Frank et al. [sent-53, score-0.781]

28 These sentences, which were selected from three unpublished novels, only contained words from the 10,000 high-frequency word list. [sent-55, score-0.03]

29 2 Markov models Markov models were trained with modified Kneser-Ney smoothing (Chen and Goodman, 1999) as implemented in SRILM (Stolcke, 2002). [sent-57, score-0.026]

30 No unigram model was computed because word frequency was factored out during data analysis (see Section 4. [sent-59, score-0.03]

31 3 Recurrent neural networks The RNN model architecture has been thoroughly described elsewhere (Fernandez Monsalve et al. [sent-62, score-0.028]

32 The only difference with previous versions was that the current RNN was trained on a substantially larger data set with more word types. [sent-64, score-0.056]

33 A range of RNN models was obtained by training on nine increasingly large subsets of the BNC data, comprising 2K, 5K, 10K, 20K, 50K, 100K, 200K, 400K, and all 1. [sent-65, score-0.1]

34 In addition, the network was trained on the full set twice, making a total of ten instantiations of the RNN model. [sent-67, score-0.051]

35 4 Phrase-structure grammars To prepare data for PSG training, the selected BNC sentences were parsed by the Stanford parser (Klein and Manning, 2003). [sent-69, score-0.08]

36 The resulting treebank was divided into nine increasingly large subsets, equal to those used for RNN training. [sent-70, score-0.06]

37 1 Grammars were induced from these subsets using the algorithm by Roark (2001) with its standard settings. [sent-71, score-0.04]

38 Next, surprisal values on the experimental sentences were generated by Roark’s incremental parser. [sent-72, score-0.663]

39 Since increasing the parser’s beam width has been shown to improve both word-probability estimates and the fit to word-reading times (Frank, 2009), the parser’s ‘base beam threshold’ parameter was reduced to 10−20. [sent-73, score-0.195]

40 1Because not all experimental sentences could be parsed when the treebank comprised only 2K sentences, 1K sentences were added to the smallest subset. [sent-74, score-0.107]

41 3 EEG data collection Twenty-four healthy, adult volunteers from the UCL Psychology subject pool took part in the reading study. [sent-75, score-0.062]

42 Their EEG was recorded continuously from 32 channels during the presentation of 5 practice sentences and the 205 experimental items. [sent-76, score-0.069]

43 Participants were asked to minimise blinks, eye movements, and head movements during sentence presentation. [sent-77, score-0.105]

44 Each sentence was preceded by a centrally presented fixation cross. [sent-78, score-0.09]

45 As soon as the participant pressed a key, the cross was replaced by the sentence’s first word, which was then automatically replaced by each subsequent word. [sent-79, score-0.024]

46 Word presentation duration (in milliseconds) equalled 190 + 20k, where k is the number of characters in the word (including any attached punctuation). [sent-80, score-0.086]

47 After the word disappeared, there was a 390 ms interval before the next word appeared. [sent-81, score-0.134]

48 The sentences were presented in random or- der, one word at a time, always centrally located on the monitor. [sent-82, score-0.134]

49 One-hundred and ten of the experimental sentences were followed by a yes/nocomprehension question, to ensure that participants tried to understand the sentences. [sent-83, score-0.124]

50 All participants answered at least 80% of the comprehension questions correctly. [sent-84, score-0.137]

51 1 ERP components Four ERP components of interest were identified from the literature on EEG and sentence reading: Early Left Anterior Negativity (ELAN), P200, N400, and a post-N400 positivity (PNP). [sent-86, score-0.158]

52 Table 1 lists the corresponding time windows and approximate electrode sites. [sent-87, score-0.095]

53 2 For each component, the average electrode potential over the corresponding time window and electrodes was computed. [sent-88, score-0.186]

54 The ELAN component is generally thought of as indicative ofdifficulty with constructing syntactic phrase structure (Friederici et al. [sent-90, score-0.042]

55 (2006) found effects of word frequency or length (which are strongly correlated 2The P600 component (Osterhout and Holcomb, 1992) was not included because the shortest interval between consecutive word onsets was only 600 ms. [sent-96, score-0.234]

56 and therefore difficult to tease apart) on the P200 amplitude. [sent-98, score-0.028]

57 Since we factor out these two lexical factors in the analysis, we expect no additional effect of surprisal on P200. [sent-99, score-0.65]

58 If any of the components is sensitive to word surprisal, this is most likely to be the N400 as many studies have already shown that N400 amplitude depends on subjective word predictability (Dambacher et al. [sent-100, score-0.425]

59 Whether an effect will appear on the PNP is more doubtful. [sent-103, score-0.027]

60 Van Petten and Luka (2012) argue that word expectations that are confirmed result in reduced N400 size, whereas expectations that are disconfirmed increase the PNP. [sent-104, score-0.13]

61 However, in a probabilistic setting, expectations are not all-or-nothing so there is no strict distinction between confirmation and disconfirmation. [sent-105, score-0.05]

62 Since the PNP has received relatively little attention, the component may not be such a reliable index of comprehension difficulty as the N400 has proven to be. [sent-107, score-0.12]

63 2 Regression analysis Data were discarded on words attached to a comma, clitics, sentence-initial, and sentencefinal words. [sent-109, score-0.027]

64 Moreover, artifacts in the EEG data (mostly due to eye blinks) were identified and removed, leaving 32,010 analysed data points per investigated ERP component. [sent-110, score-0.084]

65 For each data point and ERP component, a baseline potential was determined by averaging over the component’s electrodes in the 100 ms leading up to word onset. [sent-111, score-0.198]

66 3 Also, all significant 3For word and sentence position, both linear and squared factors were included in order to capture possible non-linear two-way interactions were included (main effects were removed if they were not significant and did not appear in any interaction). [sent-113, score-0.16]

67 Parameters for the correlation between random intercept and slope where also estimated, if they significantly contributed to model fit. [sent-115, score-0.026]

68 When the surprisal estimates by a particular language model are included in the analysis, the regression model’s deviance decreases. [sent-116, score-0.754]

69 The size of this decrease is the χ2-statistic of a likelihoodratio test for significance of the surprisal effect, and was taken as the measure of the surprisal values’ fit to the ERP data. [sent-117, score-1.312]

70 4 Negative values will be used to indicate effects in the negative direction, that is, when higher surprisal results in more negative (or less positive) going ERP deflections. [sent-118, score-0.799]

71 1 Surprisal effects Figure 1plots the fit of each model’s surprisal estimates to ERP amplitude as a function of the average natural log P(wt |w1, . [sent-120, score-1.134]

72 5 For the ELAN, P200 and PNP components, there were no significant effects after correcting for multiple comparisons. [sent-124, score-0.078]

73 In contrast, effects on the N400 were highly significant. [sent-125, score-0.078]

74 , those whose surprisal estimates fit the N400 data best). [sent-129, score-0.764]

75 Clearly, RNN-based surprisal explains variance over and above each of the other two models whereas neither the n-gram nor the PSG model outperforms the RNN. [sent-130, score-0.649]

76 07) amount of variance over and above the combined PSG and n-gram surprisals. [sent-133, score-0.026]

77 4This definition equals what Frank and Bod (201 1) call ‘psychological accuracy’ in an analysis of reading times. [sent-135, score-0.094]

78 5This measure, which Frank and Bod (201 1) call ‘linguistic accuracy’, equals the negative logarithm of the model’s perplexity. [sent-136, score-0.081]

79 Increasing the amount of training data (or the value of n) resulted in higher linguistic accuracy, except for the three PSG models trained on the smallest amounts ofdata. [sent-137, score-0.094]

80 log P(wt|w1,…,wt−1) Figure 1: Fit to surprisal of ERP amplitude (for ELAN, P200, N400, and PNP components) as a function of average log P(wt |w1, . [sent-151, score-0.982]

81 Each plotted point corresponds to predictions by one of the trained models. [sent-155, score-0.026]

82 84, beyond which effects are statistically significant (p < . [sent-157, score-0.078]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('surprisal', 0.623), ('erp', 0.338), ('eeg', 0.318), ('amplitude', 0.225), ('pnp', 0.182), ('elan', 0.169), ('rnn', 0.147), ('psg', 0.127), ('wt', 0.126), ('dambacher', 0.095), ('electrodes', 0.095), ('frank', 0.095), ('comprehension', 0.078), ('effects', 0.078), ('estimates', 0.075), ('log', 0.067), ('bod', 0.067), ('fit', 0.066), ('components', 0.065), ('blinks', 0.064), ('centrally', 0.064), ('electrode', 0.064), ('erps', 0.064), ('fernandez', 0.064), ('monsalve', 0.064), ('reading', 0.062), ('bnc', 0.06), ('participants', 0.059), ('recurrent', 0.057), ('cloze', 0.056), ('expectations', 0.05), ('ucl', 0.049), ('predictability', 0.049), ('negative', 0.049), ('signal', 0.047), ('ms', 0.046), ('investigated', 0.044), ('component', 0.042), ('resulted', 0.041), ('grammars', 0.04), ('eye', 0.04), ('subsets', 0.04), ('sentences', 0.04), ('brain', 0.039), ('movements', 0.039), ('roark', 0.036), ('markov', 0.035), ('nine', 0.033), ('equals', 0.032), ('windows', 0.031), ('regression', 0.03), ('word', 0.03), ('london', 0.03), ('revealed', 0.03), ('presentation', 0.029), ('neural', 0.028), ('utsr', 0.028), ('gunter', 0.028), ('eel', 0.028), ('positivity', 0.028), ('neville', 0.028), ('eventrelated', 0.028), ('fpi', 0.028), ('tease', 0.028), ('rnns', 0.028), ('onset', 0.028), ('milliseconds', 0.028), ('interval', 0.028), ('effect', 0.027), ('surprising', 0.027), ('beam', 0.027), ('attached', 0.027), ('potential', 0.027), ('increasingly', 0.027), ('smallest', 0.027), ('psychological', 0.027), ('college', 0.026), ('trained', 0.026), ('subjective', 0.026), ('fixation', 0.026), ('peaks', 0.026), ('minimise', 0.026), ('healthy', 0.026), ('neuroscience', 0.026), ('sei', 0.026), ('gabriella', 0.026), ('disappeared', 0.026), ('intercept', 0.026), ('clitics', 0.026), ('negativity', 0.026), ('ally', 0.026), ('variance', 0.026), ('included', 0.026), ('ten', 0.025), ('confounding', 0.024), ('pressed', 0.024), ('moreno', 0.024), ('novels', 0.024), ('potentials', 0.024), ('thr', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 390 acl-2013-Word surprisal predicts N400 amplitude during reading

Author: Stefan L. Frank ; Leun J. Otten ; Giulia Galli ; Gabriella Vigliocco

2 0.085787557 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation

Author: Minwei Feng ; Jan-Thorsten Peter ; Hermann Ney

Abstract: In this paper, we propose a novel reordering model based on sequence labeling techniques. Our model converts the reordering problem into a sequence labeling problem, i.e. a tagging task. Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average. Results of comparative study with other seven widely used reordering models will also be reported.

3 0.064577587 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us

Author: Antske Fokkens ; Marieke van Erp ; Marten Postma ; Ted Pedersen ; Piek Vossen ; Nuno Freire

Abstract: Repeating experiments is an important instrument in the scientific toolbox to validate previous work and build upon existing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. Moreover, investigating these deviations provides new insights and a deeper understanding of the examined techniques. We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field.

4 0.0570876 275 acl-2013-Parsing with Compositional Vector Grammars

Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.

Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.

5 0.05355281 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

6 0.04571062 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling

7 0.037608374 371 acl-2013-Unsupervised joke generation from big data

8 0.036164004 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms

9 0.035411555 64 acl-2013-Automatically Predicting Sentence Translation Difficulty

10 0.034839638 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models

11 0.031991124 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

12 0.03160179 67 acl-2013-Bi-directional Inter-dependencies of Subjective Expressions and Targets and their Value for a Joint Model

13 0.030963534 135 acl-2013-English-to-Russian MT evaluation campaign

14 0.030952595 339 acl-2013-Temporal Signals Help Label Temporal Relations

15 0.030948155 267 acl-2013-PARMA: A Predicate Argument Aligner

16 0.02980819 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

17 0.029360529 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals

18 0.02894905 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

19 0.028160784 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

20 0.02803088 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.087), (1, 0.002), (2, -0.004), (3, -0.004), (4, -0.009), (5, -0.02), (6, 0.021), (7, -0.013), (8, -0.022), (9, 0.011), (10, -0.014), (11, 0.005), (12, -0.023), (13, -0.035), (14, -0.056), (15, 0.009), (16, -0.036), (17, 0.006), (18, 0.008), (19, -0.071), (20, 0.007), (21, -0.018), (22, 0.02), (23, -0.02), (24, 0.053), (25, 0.03), (26, -0.026), (27, -0.069), (28, -0.019), (29, -0.027), (30, -0.03), (31, -0.048), (32, -0.011), (33, -0.022), (34, 0.001), (35, -0.029), (36, 0.012), (37, 0.044), (38, -0.012), (39, 0.029), (40, -0.026), (41, -0.016), (42, -0.065), (43, 0.002), (44, -0.049), (45, 0.026), (46, 0.016), (47, -0.02), (48, 0.025), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88149387 390 acl-2013-Word surprisal predicts N400 amplitude during reading

Author: Stefan L. Frank ; Leun J. Otten ; Giulia Galli ; Gabriella Vigliocco

2 0.66418129 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling

Author: Heike Adel ; Ngoc Thang Vu ; Tanja Schultz

Abstract: In this paper, we investigate the application of recurrent neural network language models (RNNLM) and factored language models (FLM) to the task of language modeling for Code-Switching speech. We present a way to integrate partof-speech tags (POS) and language information (LID) into these models which leads to significant improvements in terms of perplexity. Furthermore, a comparison between RNNLMs and FLMs and a detailed analysis of perplexities on the different backoff levels are performed. Finally, we show that recurrent neural networks and factored language models can . be combined using linear interpolation to achieve the best performance. The final combined language model provides 37.8% relative improvement in terms of perplexity on the SEAME development set and a relative improvement of 32.7% on the evaluation set compared to the traditional n-gram language model. Index Terms: multilingual speech processing, code switching, language modeling, recurrent neural networks, factored language models

3 0.62032008 247 acl-2013-Modeling of term-distance and term-occurrence information for improving n-gram language model performance

Author: Tze Yuang Chong ; Rafael E. Banchs ; Eng Siong Chng ; Haizhou Li

Abstract: In this paper, we explore the use of distance and co-occurrence information of word-pairs for language modeling. We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long history-contexts. Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively. Compared to the distant bigram, we show that word-pairs can be more effectively modeled in terms of both distance and occurrence. 1

4 0.59339541 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language

Author: Tiberiu Boros ; Radu Ion ; Dan Tufis

Abstract: Radu Ion Research Institute for ?????????? ???????????? ?????? Dr?????????? Romanian Academy radu@ racai . ro Dan Tufi? Research Institute for ?????????? ???????????? ?????? Dr?????????? Romanian Academy tufi s @ racai . ro Networks (Marques and Lopes, 1996) and Conditional Random Fields (CRF) (Lafferty et Standard methods for part-of-speech tagging suffer from data sparseness when used on highly inflectional languages (which require large lexical tagset inventories). For this reason, a number of alternative methods have been proposed over the years. One of the most successful methods used for this task, ?????? ?????? ??????? ??????, 1999), exploits a reduced set of tags derived by removing several recoverable features from the lexicon morpho-syntactic descriptions. A second phase is aimed at recovering the full set of morpho-syntactic features. In this paper we present an alternative method to Tiered Tagging, based on local optimizations with Neural Networks and we show how, by properly encoding the input sequence in a general Neural Network architecture, we achieve results similar to the Tiered Tagging methodology, significantly faster and without requiring extensive linguistic knowledge as implied by the previously mentioned method. 1

5 0.58414268 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

Author: Ahmed Salama ; Kemal Oflazer ; Susan Hagan

Abstract: We present results from our study ofwhich uses syntactically and semantically motivated information to group segments of sentences into unbreakable units for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize raggedness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2% Precision, 90.2% Recall (89.7% F1) on a test set, improving the rule-based baseline by about 11points and the classifier trained on all features by about 1point in F1. 1 Introduction and Motivation Current best practice in typography focuses on several interrelated factors (Humar et al., 2008; Tinkel, 1996). These factors include typeface selection, the color of the type and its contrast with the background, the size of the type, the length of the lines of type in the body of the text, the media in which the type will live, the distance between each line of type, and the appearance of the justified or ragged right side edge of the paragraphs, which should maintain either the appearance of a straight line on both sides of the block of type (justified) or create a gentle wave on the ragged right side edge. cmu .edu hagan @ cmu .edu This paper addresses one aspect of current “best practice,” concerning the alignment of text in a paragraph. While current practice values that gentle “wave,” which puts the focus on the elegant look of the overall paragraph, it does so at the expense of meaning-making features. Meaningmaking features enable typesetting to maintain the integrity of phrases within sentences, giving those interests equal consideration with the overall look of the paragraph. Figure 1 (a) shows a text fragment typeset without any regard to natural breaks while (b) shows an example of a typesetting that we would like to get, where many natural breaks are respected. While current practice works well enough for native speakers, fluency problems for non-native speakers lead to uncertainty when the beginning and end of English phrases are interrupted by the need to move to the next line of the text before completing the phrase. This pause is a potential problem for readers because they try to interpret content words, relate them to their referents and anticipate the role of the next word, as they encounter them in the text (Just and Carpenter, 1980). While incorrect anticipation might not be problematic for native speakers, who can quickly re-adjust, non-native speakers may find inaccurate anticipation more troublesome. This problem could be more significant because English as a second language (ESL) readers are engaged not only in understanding a foreign language, but also in processing the “anticipated text” as they read a partial phrase, and move to the next line in the text, only to discover that they anticipated meaning incorrectly. Even native speakers with less skill may experience difficulty comprehending text and work with young readers suggests that ”[c]omprehension difficulties may be localized at points of high processing demands whether from syntax or other sources” (Perfetti et al., 2005). As ESL readers process a partial phrase, and move to 719 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgices 719–724, the next line in the text, instances of incorrectly anticipated meaning would logically increase processing demands to a greater degree. Additionally, as readers make meaning, we assume that they don’t parse their thoughts using the same phrasal divisions “needed to diagram a sentence.” Our perspective not only relies on the immediacy assumption, but also develops as an outgrowth of other ways that we make meaning outside of the form or function rules of grammar. Specifically, Halliday and Hasan (1976) found that rules of grammar do not explain how cohesive principals engage readers in meaning making across sentences. In order to make meaning across sentences, readers must be able to refer anaphorically backward to the previous sentence, and cataphorically forward to the next sentence. Along similar lines, readers of a single sentence assume that transitive verbs will include a direct object, and will therefore speculate about what that object might be, and sometimes get it wrong. Thus proper typesetting of a segment of text must explore ways to help readers avoid incorrect anticipation, while also considering those moments in the text where readers tend to pause in order to integrate the meaning of a phrase. Those decisions depend on the context. A phrasal break between a one-word subject and its verb tends to be more unattractive, because the reader does not have to make sense of relationships between the noun/subject and related adjectives before moving on to the verb. In this case, the reader will be more likely to anticipate the verb to come. However, a break between a subject preceded by multiple adjectives and its verb is likely to be more useful to a reader (if not ideal), because the relationships between the noun and its related adjectives are more likely to have thematic importance leading to longer gaze time on the relevant words in the subject phrase (Just and Carpenter, 1980). We are not aware of any prior work for bringing computational linguistic techniques to bear on this problem. A relatively recent study (Levasseur et al., 2006) that accounted only for breaks at commas and ends of sentences, found that even those breaks improved reading fluency. While the participants in that study were younger (7 to 9+ years old), the study is relevant because the challenges those young participants face, are faced again when readers of any age encounter new and complicated texts that present words they do not know, and ideas they have never considered. On the other hand, there is ample work on the basic algorithm to place a sequence of words in a typesetting area with a certain width, commonly known as the optimal line breaking problem (e.g., Plass (1981), Knuth and Plass (1981)). This problem is quite well-understood and basic variants are usually studied as an elementary example application of dynamic programming. In this paper we explore the problem of learning where to break sentences in order to avoid the problems discussed above. Once such unbreakable segments are identified, a simple application of the dynamic programming algorithm for optimal line breaking, using unbreakable segments as “words”, easily typesets the text to a given width area. 2 Text Breaks The rationale for content breaks is linked to our interest in preventing inaccurate anticipation, which is based on the immediacy assumption. The immediacy assumption (Just and Carpenter, 1980) considers, among other things, the reader’s interest in trying to relate content words to their referents as soon as possible. Prior context also encourages the reader to anticipate a particular role or case for the next word, such as agent or the manner in which something is done.Therefore, in defining our breaks, we consider not only the need to maintain the syntactic integrity of phrases, such as the prepositional phrase, but also the semantic integrity across syntactical divisions. For example, semantic integrity is important when transitive verbs anticipate direct objects. Strictly speaking, we define a bad break as one that will cause (i) unintended anaphoric collocation, (ii) unintended cataphoric collocation, or (iii) incorrect anticipation. Using these broad constraints, we derived a set of about 30 rules that define acceptable and nonacceptable breaks, with exceptions based on context and other special cases. Some of the rules are very simple and are only related to the word posi- tion in the sentence: • • Break at the end of a sentence. Keep the first and last words of a sentence wKietehp pth teh rest sotf a aint.d The rest of the rule set are more complex and depend on the structure of the sentence in question, 720 . s anct ions and UN charge s o f gro s s right s abuse s Mi l ary tens i it ons on the Korean peninsula have risen to the i highe st level for years r with the communi st st ate under the youthful Kim threatening nuclear war in re sponse t o UN s anct i s impo s ed a ft e r it s thi rd at omi c t e st l on ast month . It ha s al s o (a) Text with standard typesetting from US s anct i s and UN charge s o f gro s s right s abu s e s . Mi l ary t en s i s on it on on the Ko rean penin sul a have r i en t o the i highe st l s r eve l for year s with the communi st st at e unde r the youthful Kim threat ening nuc l ear war in re spon s e t o UN s anct i s impo s ed a ft e r it s thi rd at omi c t e st l on ast month . (b) Text with syntax-directed typesetting , , Figure 1: Short fragment of text with standard typesetting (a) and with syntax and semantics motivated typesetting (b), both in a 75 character width. e.g.: • • • Keep a single word subject with the verb. Keep an appositive phrase with the noun it renames. Do not break inside a prepositional phrase. • • • Keep marooned prepositions with the word they modify. Keep the verb, the object and the preposition together ei nv a phrasal bvjeercbt phrase. Keep a gerund clause with its adverbial complement. There are exceptions to these rules in certain cases such as overly long phrases. 3 Experimental Setup Our data set consists of a modest set of 150 sentences (3918 tokens) selected from four different documents and manually annotated by a human expert relying on the 30 or so rules. The annotation consists of marking after each token whether one is allowed to break at that position or not.1 We developed three systems for predicting breaks: a rule-based baseline system, a maximumentropy classifier that learns to classify breaks us- ing about 100 lexical, syntactic and collocational features, and a maximum entropy classifier that uses a subset of these features selected by a simple genetic algorithm in a hill-climbing fashion. We evaluated our classifiers intrinsically using the usual measures: 1We expect to make our annotated data available upon the publication of the paper. • Precision: Percentage of the breaks posited tPhraetc were actually ctaogrere octf bthreeak bsre aink tshe p goldstandard hand-annotated data. It is possible to get 100% precision by putting a single break at the end. • Recall: Percentage of the actual breaks correctly posited. tIatg ies possible ttou get 1e0ak0%s c recall by positing a break after each token. F1: The geometric mean of precision and recFall divided by their average. It should be noted that when a text is typeset into an area of width of a certain number of characters, an erroneous break need not necessarily lead to an actual break in the final output, that is an error may • not be too bad. On the other hand, a missed break while not hurting the readability of the text may actually lead to a long segment that may eventually worsen raggedness in the final typesetting. Baseline Classifier We implemented a subset of the rules (those that rely only on lexical and partof-speech information), as a baseline rule-based break classifier. The baseline classifier avoids breaks: • • • after the first word in a sentence, quote or parentheses, before the last word in a sentence, quote or parentheses, asntd w between a punctuation mark following a bweotrwde or b aet wpueennct two nco nmsearckuti vfoel punctuation marks. It posits breaks (i) before a word following a punctuation, and (ii) before prepositions, auxiliary verbs, coordinating conjunctions, subordinate conjunctions, relative pronouns, relative adverbs, conjunctive adverbs, and correlative conjunctions. 721 Maximum Entropy Classifier We used the CRF++ Tool2 but with the option to run it only as a maximum entropy classifier (Berger et al., 1996), to train a classifier. We used a large set of about 100 features grouped into the following categories: • • Lexical features: These features include the tLoekxeinca aln fde athtuer ePsO:S T tag efo fre athtuer previous, current and the next word. We also encode whether the word is part of a compound noun or a verb, or is an adjective that subcategorizes a specific preposition in WordNet, (e.g., familiar with). Constituency structure features: These are Cunolnesxtiictauleinzecdy f setarutucrtuers eth faeat ttaurkees i:nt To aecsecou anret in the parse tree, for a word and its previous and next words, the labels of the parent, the grandparent and their siblings, and number of siblings they have. We also consider the label of the closest common ancestor for a word and its next word. • • Dependency structure features: These are unlDeexipceanldizeendc yfe satrtuurcteus eth faeat essentially capture the number of dependency relation links that cross-over a given word boundary. The motivation for these comes from the desire to limit the amount of information that would need to be carried over that boundary, assuming this would be captured by the number of dependency links over the break point. Baseline feature: This feature reflects Bwahseethlienre the rule-based baseline break classifier posits a break at this point or not. We use the following tools to process the sentences to extract some of these features: • Stanford constituency and dependency parsers, (De Marneffe et al., 2006; Klein and Manning, 2002; Klein and Manning, 2003), • • lemmatization tool in NLTK (Bird, 2006), WordNet for compound (Fellbaum, 1998). nouns and verbs 2Available at http : / / crfpp . googlecode .com/ svn /t runk / doc / index . html . TabPFRle1r c:ailsRoenultsBfra78os09me.l491inBaeslMin89eE078-a.nA382dlMaxi98mE09-.uG27mAEntropy break classifiers Maximum Entropy Classifier with GA Feature Selection We used a genetic algorithm on a development data set, to select a subset of the features above. Basically, we start with a randomly selected set of features and through mutation and crossover try to obtain feature combinations that perform better over the development set in terms of F1 score. After a few hundred generations of this kind of hill-climbing, we get a subset of features that perform the best. 4 Results Our current evaluation is only intrinsic in that we measure our performance in getting the break and no-break points correctly in a test set. The results are shown in Table 1. The column ME-All shows the results for a maximum entropy classifier using all the features and the column ME-GA shows the results for a maximum entropy classifier using about 50 of the about 100 features available, as selected by the genetic algorithm. Our best system delivers 89.2% precision and 90.2% recall (with 89.7% F1), improving the rulebased baseline by about 11points and the classifier trained on all features by about 1point in F1. After processing our test set with the ME-GA classifier, we can feed the segments into a standard word-wrapping dynamic programming algorithm (along with a maximum width) and obtain a typeset version with minimum raggedness on the right margin. This algorithm is fast enough to use even dynamically when resizing a window if the text is displayed in a browser on a screen. Figure 1 (b) displays an example of a small fragment of text typeset using the output of our best break classifier. One can immediately note that this typesetting has more raggedness overall, but avoids the bad breaks in (a). We are currently in the process of designing a series of experiments for extrinsic evaluation to determine if such typeset text helps comprehension for secondary language learners. 722 4.1 Error Analysis An analysis of the errors our best classifier makes (which may or may not be translated into an actual error in the final typesetting) shows that the majority of the errors basically can be categorized into the following groups: • Incorrect breaks posited for multiword colloIcnatcioornrse (e.g., akcst *po of weda fr,o3r rmuulel*ti of law, far ahead* of, raining cats* and dogs, etc.) • Missed breaks after a verb (e.g., calls | an act of war, proceeded to | implement, etc.) Missed breaks before or after prepositions or aMdvisesrebdia blsre (e.g., ethfoer day after | tehpeo wsitoiroldns realized, every .kgi.n,d th | of interference) We expect to overcome such cases by increasing our training data size significantly by using our classifier to break new texts and then have a human annotator to manually correct the breaks. • 5 Conclusions and Future Work We have used syntactically motivated information to help in typesetting text to facilitate better understanding of English text especially by secondary language learners, by avoiding breaks which may cause unnecessary anticipation errors. We have cast this as a classification problem to indicate whether to break after a certain word or not, by taking into account a variety of features. Our best system maximum entropy framework uses about 50 such features, which were selected using a genetic algorithm and performs significantly better than a rule-based break classifier and better than a maximum entropy classifier that uses all available features. We are currently working on extending this work in two main directions: We are designing a set of experiments to extrinsically test whether typesetting by our system improves reading ease and comprehension. We are also looking into a break labeling scheme that is not binary but based on a notion of “badness” perhaps quantized into 3-4 grades, that would allow flexibility between preventing bad breaks and minimizing raggedness. For instance, breaking a noun-phrase right after an initial the may be considered very bad. On the other hand, although it is desirable to keep an object NP together with the preceding transitive verb, – 3* indicates a spurious incorrect break, | indicates a misse*d i nbrdeiacka.t breaking before the object NP, could be OK, if not doing so causes an inordinate amount of raggedness. Then the final typesetting stage can optimize a combination of raggedness and the total “bad- ness” of all the breaks it posits. Acknowledgements This publication was made possible by grant NPRP-09-873-1-129 from the Qatar National Research Fund (a member of the Qatar Foundation). Susan Hagan acknowledges the generous support of the Qatar Foundation through Carnegie Mellon University’s Seed Research program. The statements made herein are solely the responsibility of this author(s), and not necessarily those of the Qatar Foundation. References Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71. Steven Bird. 2006. NLTK: The natural language toolkit. In Proceedings of the COLING/ACL, pages 69–72. Association for Computational Linguistics. Marie-Catherine De Marneffe, Bill MacCartney, and Christopher D Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, volume 6, pages 449–454. Christiane Fellbaum. 1998. WordNet: An electronic lexical database. The MIT Press. M. A. K. Halliday and R. Hasan. 1976. Cohesion in English. Longman, London. I. Humar, M. Gradisar, and T. Turk. 2008. The impact of color combinations on the legibility of a web page text presented on crt displays. International Journal of Industrial Ergonomics, 38(1 1-12):885–899. Marcel A. Just and Patricia A. Carpenter. 1980. A theory of reading: From eye fixations to comprehension. Psychological Review, 87:329–354. Dan Klein and Christopher D. Manning. 2002. Fast exact inference with a factored model for natural language parsing. Advances in Neural Information Processing Systems, 15(2003):3–10. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Asso- ciation for Computational Linguistics. 723 Donald E Knuth and Michael F. Plass. 1981. Breaking paragraphs into lines. Software: Practice and Experience, 11(11): 1119–1 184. Valerie Marciarille Levasseur, Paul Macaruso, Laura Conway Palumbo, and Donald Shankweiler. 2006. Syntactically cued text facilitates oral reading fluency in developing readers. Applied Psycholinguistics, 27(3):423–445. C. A. Perfetti, N. Landi, and J. Oakhill. 2005. The acquisition of reading comprehension skill. In M. J. Snowling and C. Hulme, editors, The science of reading: A handbook, pages 227–247. Blackwell, Oxford. Michael Frederick Plass. 1981. Optimal Pagination Techniques for Automatic Typesetting Systems. Ph.D. thesis, Stanford University. K. Tinkel. 1996. Taking it in: What makes type easier to read. Adobe Magazine, pages 40–50. 724

6 0.57741523 325 acl-2013-Smoothed marginal distribution constraints for language modeling

7 0.57731026 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

8 0.56912899 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation

9 0.56744283 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.

10 0.567056 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

11 0.55511922 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

12 0.54705685 63 acl-2013-Automatic detection of deception in child-produced speech using syntactic complexity features

13 0.53681374 14 acl-2013-A Novel Classifier Based on Quantum Computation

14 0.5325008 225 acl-2013-Learning to Order Natural Language Texts

15 0.52546555 349 acl-2013-The mathematics of language learning

16 0.51006192 257 acl-2013-Natural Language Models for Predicting Programming Comments

17 0.50514865 175 acl-2013-Grounded Language Learning from Video Described with Sentences

18 0.50489044 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

19 0.50204122 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

20 0.50194913 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.044), (6, 0.02), (11, 0.061), (24, 0.03), (26, 0.038), (29, 0.015), (35, 0.056), (42, 0.073), (48, 0.033), (70, 0.028), (88, 0.03), (90, 0.446), (95, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90273213 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation

Author: Guillaume Wisniewski

Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.

2 0.88687211 182 acl-2013-High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning

Author: Akiko Eriguchi ; Ichiro Kobayashi

Abstract: In a multi-class document categorization using graph-based semi-supervised learning (GBSSL), it is essential to construct a proper graph expressing the relation among nodes and to use a reasonable categorization algorithm. Furthermore, it is also important to provide high-quality correct data as training data. In this context, we propose a method to construct a similarity graph by employing both surface information and latent information to express similarity between nodes and a method to select high-quality training data for GBSSL by means of the PageR- ank algorithm. Experimenting on Reuters21578 corpus, we have confirmed that our proposed methods work well for raising the accuracy of a multi-class document categorization.

same-paper 3 0.8786577 390 acl-2013-Word surprisal predicts N400 amplitude during reading

Author: Stefan L. Frank ; Leun J. Otten ; Giulia Galli ; Gabriella Vigliocco

4 0.84966767 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

Author: Fabienne Braune ; Nina Seemann ; Daniel Quernheim ; Andreas Maletti

Abstract: We present a new translation model integrating the shallow local multi bottomup tree transducer. We perform a largescale empirical evaluation of our obtained system, which demonstrates that we significantly beat a realistic tree-to-tree baseline on the WMT 2009 English → German tlriannes olnati tohne tWasMk.T TA 2s0 an a Edndgitliisonha →l c Gonetrrmibauntion we make the developed software and complete tool-chain publicly available for further experimentation.

5 0.83341825 200 acl-2013-Integrating Phrase-based Reordering Features into a Chart-based Decoder for Machine Translation

Author: ThuyLinh Nguyen ; Stephan Vogel

Abstract: Hiero translation models have two limitations compared to phrase-based models: 1) Limited hypothesis space; 2) No lexicalized reordering model. We propose an extension of Hiero called PhrasalHiero to address Hiero’s second problem. Phrasal-Hiero still has the same hypothesis space as the original Hiero but incorporates a phrase-based distance cost feature and lexicalized reodering features into the chart decoder. The work consists of two parts: 1) for each Hiero translation derivation, find its corresponding dis- continuous phrase-based path. 2) Extend the chart decoder to incorporate features from the phrase-based path. We achieve significant improvement over both Hiero and phrase-based baselines for ArabicEnglish, Chinese-English and GermanEnglish translation.

6 0.80010134 139 acl-2013-Entity Linking for Tweets

7 0.7713089 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

8 0.53369343 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

9 0.5083847 166 acl-2013-Generalized Reordering Rules for Improved SMT

10 0.50143033 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

11 0.49431431 250 acl-2013-Models of Translation Competitions

12 0.49366292 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

13 0.4864493 314 acl-2013-Semantic Roles for String to Tree Machine Translation

14 0.47274792 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

15 0.46370131 165 acl-2013-General binarization for parsing and translation

16 0.44182807 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

17 0.44177261 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment

18 0.43871677 4 acl-2013-A Context Free TAG Variant

19 0.43605697 312 acl-2013-Semantic Parsing as Machine Translation

20 0.4351005 126 acl-2013-Diverse Keyword Extraction from Conversations