emnlp emnlp2012 emnlp2012-122 knowledge-graph by maker-knowledge-mining

122 emnlp-2012-Syntactic Surprisal Affects Spoken Word Duration in Conversational Contexts

Source: pdf

Author: Vera Demberg ; Asad Sayeed ; Philip Gorinski ; Nikolaos Engonopoulos

Abstract: We present results of a novel experiment to investigate speech production in conversational data that links speech rate to information density. We provide the first evidence for an association between syntactic surprisal and word duration in recorded speech. Using the AMI corpus which contains transcriptions of focus group meetings with precise word durations, we show that word durations correlate with syntactic surprisal estimated from the incremental Roark parser over and above simpler measures, such as word duration estimated from a state-of-the-art text-to-speech system and word frequencies, and that the syntactic surprisal estimates are better predictors of word durations than a simpler version of surprisal based on trigram probabilities. This result supports the uniform information density (UID) hypothesis and points a way to more realistic artificial speech generation.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Syntactic surprisal affects spoken word duration in conversational contexts Vera Demberg, Asad B. [sent-1, score-1.162]

2 de ipg i Abstract We present results of a novel experiment to investigate speech production in conversational data that links speech rate to information density. [sent-5, score-0.184]

3 We provide the first evidence for an association between syntactic surprisal and word duration in recorded speech. [sent-6, score-1.168]

4 1 Introduction The uniform information density (UID) hypothesis suggests that speakers try to distribute information uniformly across their utterances (Frank and Jaeger, 2008). [sent-9, score-0.156]

5 Information density can be measured in terms of the surprisal incurred at each word, where surprisal is defined as the negative log-probability of an event. [sent-10, score-1.639]

6 whether speakers adapt word duration during production to syntactic surprisal, such that words with higher surprisal have longer durations than words with lower surprisal. [sent-13, score-1.459]

7 Using linear mixed-effects modeling, we found that syntactic surprisal as calculated from a top- down incremental PCFG parser accounts for a significant amount of variation in spoken word duration, using an HMM-trained text-to-speech system as a baseline. [sent-16, score-0.991]

8 1 Related work The use of word-level surprisal as a predictor of processing difficulty is based on the notion that processing difficulty results when a word is encountered that is unexpected given its preceding context. [sent-19, score-0.874]

9 The amount of surprisal on a word wi can be formalized as the log of the inverse conditional probability of wi given the preceding words in the sentence w1 . [sent-20, score-0.865]

10 If this probability is low, othre −n ltohge Pw(owrd| wis unexpected, and surprisal is high. [sent-27, score-0.792]

11 Hale (2001) showed that surprisal calculated from a probabilistic Earley parser correctly predicts wellPLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 35t C6–o3n6f7e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r. [sent-31, score-0.858]

12 , garden paths) and Levy (2008) further demonstrated the relevance of surprisal to human sentence processing difficulty on a range of syntactic processing difficulty phenomena. [sent-35, score-0.838]

13 There is existing work in correlating informationtheoretic measures of linguistic redundancy to the observed duration of speech units. [sent-36, score-0.334]

14 Aylett and Turk (2006) demonstrate that the contextual predictability of a syllable (n-gram log probability) has an inverse relationship to syllable duration in speech. [sent-37, score-0.43]

15 Levy and Jaeger (2007) show that the reduction of optional that-complementizers in English is related to trigram surprisal; low surprisal predicts a high likelihood of reduction. [sent-40, score-0.818]

16 They use the surprisal of the candidate word itself as well as surprisals of the word before and after, computing bigram and trigram estimates directly from the corpus without smoothing or backoff. [sent-43, score-0.914]

17 For every word in a given language’s lexicon, they calculated 2-, 3-, and 4-gram surprisal values using the Google dataset 357 for every occurrence of the word, and then they took the mean surprisal for that word over all occurrences. [sent-50, score-1.656]

18 The 3-gram surprisal values in particular were a better predictor of orthographic length than unigram frequency, providing evidence for the use of information content and contextual predictability as improvement over a Zipf’s Law view of communicative efficiency. [sent-51, score-0.906]

19 (2007) analyzed the relationship between linguistic unit predictability and syllable duration in read-aloud speech in Dutch. [sent-55, score-0.439]

20 , the surrounding members of the compound), the longer the duration of the pronunciation of the interfix. [sent-61, score-0.272]

21 Existing broadcoverage work on syntactic surprisal has largely fo- cused on comprehension phenomena, such as Demberg and Keller (2008), Roark et al. [sent-67, score-0.838]

22 , but show that frequency effects work in the expected direction at the syntactic level. [sent-70, score-0.155]

23 In section 2, we describe at a high level the procedure we used to test our hypothesis that parser-derived surprisal values can partly account for utteranceduration variation. [sent-89, score-0.792]

24 Sections 5 and 6 describe how we use linear mixed effects modeling to find significant correlations between our predictors and the response variable, and we finally make some concluding remarks in section 7. [sent-95, score-0.218]

25 After pre-processing, for each word in the corpus, we extract the following predictors: canonical speech durations from the MARY text-to-speech system, logarithmic word frequencies, n-gram surprisal, and surprisal values produced by the Roark (2001a); Roark et al. [sent-99, score-1.201]

26 , 2008) with the observed durations as a response variable and the predictors mentioned above in order to detect whether syntac- tic surprisal is a significant positive predictor of spoken word durations above and beyond the more basic effects of canonical word duration and word frequency. [sent-103, score-1.968]

27 sentence, were filtered out in order to maximize the proportion of complete parses in surprisal calculation. [sent-129, score-0.792]

28 As one reviewer notes, one way of estimating word durations would be to calculate the average duration of each word in the corpus. [sent-132, score-0.559]

29 Therefore, we use word duration estimates from the state-of-the-art open-source text-to-speech system MARY (Schr ¨oder et al. [sent-134, score-0.349]

30 In other words, we obtain MARY word duration estimates in the second version by running individual whole sentences through MARY, segmented by standard punctuation marks used in the AMI corpus transcriptions. [sent-149, score-0.349]

31 For each version, we obtained phone durations using MARY and calculate the total duration of a word as the sum of the estimated phone durations for that word. [sent-150, score-0.863]

32 These durations serve as the “canonical” baselines to which the observed durations of the words in the AMI corpus are compared. [sent-151, score-0.498]

33 3 Word frequency baselines In order to account for the effects of simple word frequency on utterance duration, we extracted two types of frequency counts. [sent-153, score-0.229]

34 4 Surprisal models For predicting the surprisal of utterances in context, two different types of models were used— n-gram probabilities models, as well as Roark’s 2001 incremental top-down parser capable of calculating prefix probabilities. [sent-161, score-0.926]

35 We also estimated word frequen- cies to account for words being spoken more quickly due to their higher frequency which is independent of structural surprisal. [sent-162, score-0.164]

36 unexpected events, we can calculate the syntactic surprisal of each word in a sentence. [sent-174, score-0.876]

37 The syntactic surprisal at word Swi is defined as the difference between the prefix probability at word wi and the prefix probability at word wi−1 . [sent-175, score-1.04]

38 , stopping at the second a and providing the Roark parser surprisal values by word. [sent-195, score-0.824]

39 As the amount of probability mass lost has been shown to be small (Roark, 2001b), the surprisal estimates can be assumed to be a good approximation. [sent-200, score-0.85]

40 The parser was trained on Wall Street Journal sections 2–21 and applied to parse the full sentences of the AMI corpus, collecting predicted surprisal at each word (see Figure 2 for an example). [sent-204, score-0.843]

41 The syntactic surprisal can be furthermore be decomposed into a structural and a lexical part: sometimes, high surprisal might be due to a word being incompatible with the high-probability syntactic structures, other times high surprisal might just be due to a lexical item being unexpected. [sent-205, score-2.507]

42 It is interesting to evaluate these two aspects of syntactic surprisal separately, and the Roark parser conveniently outputs both surprisal estimates. [sent-206, score-1.662]

43 Structural surprisal is estimated from the occurrence counts of the application of syntactic rules during the parse discounting the effect of lexical probabilities, while lexical surprisal is calculated from the probabilities of the derivational step from the POS-tag to lexical item. [sent-207, score-1.725]

44 5 Linear mixed effects modelling In order to test whether surprisal estimates correlate with speech durations, we use linear mixed effects models (LME, Pinheiro and Bates (2000)). [sent-208, score-1.098]

45 We treat speakers as a random factor, which means that our mod- els contain an intercept term for each speaker, representing the individual differences in speech rates. [sent-210, score-0.154]

46 In a first step, we fit a baseline model with all predictors related to a word’s canonical duration and its frequency as well as their random slopes to the observed word durations. [sent-214, score-0.582]

47 We then calculated the residuals of that model, the part of the observed word durations that cannot be accounted for through canonical word durations or word frequency. [sent-217, score-0.732]

48 For each of our predictors of interest (n-gram surprisal, syntactic surprisal), we then fit another linear mixed-effects model with random slopes to the residuals of the baseline model. [sent-218, score-0.322]

49 surprisal and word fre- quency or canonical duration. [sent-221, score-0.871]

50 6 Results Our baseline model uses speech durations from the AMI corpus as the response variable and canonical duration estimates from the MARY TTS system and log word frequencies as predictors. [sent-225, score-0.767]

51 We exclude from the analysis all data points with zero duration (effectively, punctuation) or a real duration longer than 2 seconds. [sent-226, score-0.544]

52 Furthermore, we exclude all words which were never seen in Gigaword and any words for which syntactic surprisal couldn’t be estimated. [sent-227, score-0.838]

53 MARY duration models As mentioned in the earlier sections, we have calculated different versions of the MARY estimated word durations: one model without the sentential context and one model with the sentential context. [sent-229, score-0.443]

54 In our regression analyses, we find, as expected, that the model which includes sentential context achieves a much better fit with the actually measured word durations from the AMI corpus (AIC = 32167) than the model without context (AIC = 70917). [sent-230, score-0.373]

55 Word frequency estimates We estimated word frequencies from several different resources, from the AMI corpus to have a spoken domain frequency and from Gigaword as a very large resource. [sent-231, score-0.268]

56 We find that both frequency estimates significantly improve model fit over a model that does not contain frequency estimates. [sent-232, score-0.176]

57 Including both frequency estimates improves model fit with respect to a model that includes just one of the predictors (all p < 0. [sent-233, score-0.244]

58 Furthermore, including into the regression an interaction of estimated word duration and word frequency also significantly increases model fit (p < 0. [sent-235, score-0.451]

59 This means that words which are short and frequent have longer duration than would be estimated by adding up their length and frequency effects. [sent-237, score-0.348]

60 We see a highly significant effect in the expected direction for both the canonical duration estimate and word frequency. [sent-239, score-0.374]

61 The positive coefficient for MARY CONTEXT means that TTS duration estimates are positively correlated with the measured word durations. [sent-240, score-0.382]

62 Finally, the negative coefficient for the interaction between word durations and frequencies means that the duration estimate for short frequent and long infrequent words is less extreme than otherwise predicted by the main effects of duration and frequency. [sent-242, score-0.944]

63 Note though that the predictors are also correlated (for correlations of the main predictors used in these analyses, see Table 1), so there is some collinearity in the below model. [sent-243, score-0.243]

64 What is more important, is that we remove any collinearity between the baseline predictors and our predictors of interest, i. [sent-245, score-0.243]

65 the surprisal estimates from the ngram models and parser. [sent-247, score-0.874]

66 Therefore, we run separate regression models for these predictors on the residuals of the baseline model. [sent-248, score-0.212]

67 l8u6d71e0ra*tSiogns durations on the AMI corpus data for MARY CONTEXT (including the sentential context), WORDFREQUENCY under speaker with random intercept for speaker and random slopes under speaker. [sent-253, score-0.471]

68 363 including 4-gram surprisal trained on gigaword as a pre- dictor. [sent-255, score-0.871]

69 N-gram model surprisal estimated on newspaper texts from PTB or Gigaword were statistically significant positive predictors of spoken word durations beyond simple word frequencies (but PTB ngram surprisal did not improve fit over models containing Gigaword frequency estimates). [sent-263, score-2.196]

70 Surprisal Surprisal effects were found to have a robust significant positive coefficient, meaning that words with higher surprisal are spoken more slowly / clearly than expected when taking into account only canonical word duration and word frequency. [sent-265, score-1.282]

71 09*** Table 4: Linear mixed effects model of surprisal (based on Roark parser) with random intercept for speaker and random slope. [sent-272, score-0.974]

72 The response variable is residual word durations from the model shown in Table 3. [sent-273, score-0.287]

73 Surprisal estimated from the Roark parser also remains a significant positive predictor when regressed against the residuals of a baseline model including both 3-gram surprisal from the AMI corpus and 4-gram surprisal from the Gigaword corpus. [sent-274, score-1.781]

74 In order to make really sure that the observed surprisal effect has indeed to do with syntax and can not be explained away as a frequency effect, we also calculated frequency estimates for the corpus based on the Penn Treebank. [sent-275, score-0.983]

75 The significant positive surprisal effect remains stable, also when run on the residuals of a model which includes PTB trigrams and PTB frequencies. [sent-276, score-0.898]

76 To provide some intuition, we calculate the estimated effect size of Roark surprisal on speech durations. [sent-278, score-0.915]

77 Per Roark surprisal “unit”, the model estimates a 7 msec difference5. [sent-279, score-0.85]

78 The range of Roark surprisal in our data set is roughly from 0 to 25, with most values between 2 and 15. [sent-280, score-0.792]

79 For a word like “thing” which in one instance in the AMI cor- pus was estimated with a surprisal of 2. [sent-281, score-0.849]

80 277, the estimated difference in duration between these instances would thus be 104msec, which is certainly an audible difference. [sent-283, score-0.31]

81 (Full range for Roark surprisal: 174msec, whereas full range for gigaword 4gram surprisal is 35 msec. [sent-284, score-0.871]

82 ) When analysing the surprisal effect in more detail, we find that both the syntactic component of surprisal and its lexical component are significant positive predictors of word durations, as well as the interaction between them, which has a negative slope. [sent-285, score-1.778]

83 4msec for a unit of residualized Roark surprisal, but it is even less intuitive what that means, hence we calculate with non-residualized surprisal here. [sent-287, score-0.792]

84 baseline model from Table 3, with random intercept for speaker and random slope for structural and lexical component of surprisal, estimated using the Roark parser. [sent-291, score-0.165]

85 teraction achieves a better model fit (in AIC and BIC scores) than a model with only the full surprisal effect. [sent-292, score-0.834]

86 To summarize, the positive coefficient of surprisal means that words which carry a lot of information from a structural point of view are spoken more slowly than words that carry less such information. [sent-294, score-0.894]

87 These results thus provide good evidence for our hypothesis that the predictability of syntactic structure affects phonetic realization and that speakers use speech rate to achieve more uniform information density. [sent-295, score-0.255]

88 non-native speakers Finally, we also compared effects in our native vs. [sent-297, score-0.157]

89 It might be possible to interpret the findings in the sense that native speak- ers are more proficient at adapting their speech rate to (syntactic) complexity to achieve more uniform information density, given the slightly higher coefficient and significance for Surprisal for native speakers. [sent-300, score-0.191]

90 001 Table 6: Native speakers are possibly slightly better at adapting their speech rate to syntactic surprisal than non-native speakers. [sent-306, score-0.951]

91 Surprisal value is for model with residuals of other predictors as dependent variable. [sent-307, score-0.189]

92 From an applied perspective, the fact that frequency and syntactic surprisal have a significant effect beyond what a HMM-trained TTS model would predict for individual words is a case for further research into incorporating syntactic models into speech production systems. [sent-310, score-1.037]

93 Our methodology immediately provides a framework for estimating the word-by-word effect on duration for increased naturalness in TTS output. [sent-311, score-0.295]

94 This is relevant to spoken dialogue systems because it appears that synthesized speech requires a greater level of attention from the dialogue system users when compared to the same words delivered in natural speech (Delogu et al. [sent-312, score-0.2]

95 Some of this effect may be attributable to peaks in information density which are caused by current generation systems not compensating for areas of high information density through speech rate, lexical and structural choice. [sent-314, score-0.215]

96 Our result points a way towards a direction for explaining of this phenomenon by demonstrating that the differences between currenttechnology artificial speech and natural speech can be partially explained through higher-level syntactic 365 features. [sent-319, score-0.17]

97 From a theoretical and neuroanatomical perspective, the finding that a measure of syntactic ambiguity reduction has an effect on the phonological layer of production has additional implications for the organization of the human language production system. [sent-323, score-0.178]

98 Language redun- dancy predicts syllabic duration and the spectral characteristics of vocalic syllable nuclei. [sent-327, score-0.356]

99 The cmu arctic speech databases for speech synthesis research. [sent-394, score-0.191]

100 Morphological predictability and acoustic duration of interfixes in dutch compounds. [sent-404, score-0.324]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('surprisal', 0.792), ('duration', 0.272), ('durations', 0.249), ('ami', 0.194), ('roark', 0.149), ('predictors', 0.106), ('uid', 0.093), ('residuals', 0.083), ('jaeger', 0.08), ('gigaword', 0.079), ('effects', 0.071), ('mary', 0.064), ('tts', 0.062), ('speech', 0.062), ('canonical', 0.06), ('prefix', 0.059), ('estimates', 0.058), ('density', 0.055), ('syllable', 0.053), ('predictability', 0.052), ('speakers', 0.051), ('spoken', 0.049), ('speaker', 0.048), ('syntactic', 0.046), ('earley', 0.045), ('slopes', 0.045), ('predictor', 0.044), ('levy', 0.042), ('fit', 0.042), ('intercept', 0.041), ('kuperman', 0.041), ('sentential', 0.04), ('ptb', 0.04), ('estimated', 0.038), ('frequency', 0.038), ('demberg', 0.036), ('aic', 0.036), ('native', 0.035), ('calculated', 0.034), ('coefficient', 0.033), ('parser', 0.032), ('collinearity', 0.031), ('interfix', 0.031), ('piantadosi', 0.031), ('syllabic', 0.031), ('production', 0.03), ('conversational', 0.03), ('phonological', 0.03), ('frequencies', 0.028), ('wi', 0.027), ('synthesized', 0.027), ('baayen', 0.027), ('bates', 0.027), ('trigram', 0.026), ('uniform', 0.026), ('utterance', 0.025), ('schr', 0.024), ('ngram', 0.024), ('utterances', 0.024), ('synthesis', 0.024), ('transcribed', 0.024), ('regression', 0.023), ('effect', 0.023), ('frank', 0.023), ('cmu', 0.022), ('mixed', 0.022), ('gibson', 0.021), ('keller', 0.021), ('recorded', 0.021), ('arctic', 0.021), ('blizzard', 0.021), ('delogu', 0.021), ('fang', 0.021), ('hts', 0.021), ('interjections', 0.021), ('kominek', 0.021), ('meetings', 0.021), ('oder', 0.021), ('pinheiro', 0.021), ('rosenkrantz', 0.021), ('segalowitz', 0.021), ('wordfrequency', 0.021), ('zen', 0.021), ('analyses', 0.021), ('structural', 0.02), ('response', 0.019), ('word', 0.019), ('incremental', 0.019), ('implications', 0.019), ('unexpected', 0.019), ('cognitive', 0.018), ('evidence', 0.018), ('phone', 0.018), ('swift', 0.018), ('logxp', 0.018), ('hale', 0.018), ('swi', 0.018), ('xml', 0.018), ('slope', 0.018), ('aylett', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999833 122 emnlp-2012-Syntactic Surprisal Affects Spoken Word Duration in Conversational Contexts

Author: Vera Demberg ; Asad Sayeed ; Philip Gorinski ; Nikolaos Engonopoulos

2 0.055541039 102 emnlp-2012-Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems

Author: Nina Dethlefs ; Helen Hastie ; Verena Rieser ; Oliver Lemon

Abstract: Incremental processing allows system designers to address several discourse phenomena that have previously been somewhat neglected in interactive systems, such as backchannels or barge-ins, but that can enhance the responsiveness and naturalness of systems. Unfortunately, prior work has focused largely on deterministic incremental decision making, rendering system behaviour less flexible and adaptive than is desirable. We present a novel approach to incremental decision making that is based on Hierarchical Reinforcement Learning to achieve an interactive optimisation of Information Presentation (IP) strategies, allowing the system to generate and comprehend backchannels and barge-ins, by employing the recent psycholinguistic hypothesis of information density (ID) (Jaeger, 2010). Results in terms of average rewards and a human rating study show that our learnt strategy outperforms several baselines that are | v not sensitive to ID by more than 23%.

3 0.051840808 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures

Author: Su-Youn Yoon ; Suma Bhat

Abstract: This study presents a novel method that measures English language learners’ syntactic competence towards improving automated speech scoring systems. In contrast to most previous studies which focus on the length of production units such as the mean length of clauses, we focused on capturing the differences in the distribution of morpho-syntactic features or grammatical expressions across proficiency. We estimated the syntactic competence through the use of corpus-based NLP techniques. Assuming that the range and so- phistication of grammatical expressions can be captured by the distribution of Part-ofSpeech (POS) tags, vector space models of POS tags were constructed. We use a large corpus of English learners’ responses that are classified into four proficiency levels by human raters. Our proposed feature measures the similarity of a given response with the most proficient group and is then estimates the learner’s syntactic competence level. Widely outperforming the state-of-the-art measures of syntactic complexity, our method attained a significant correlation with humanrated scores. The correlation between humanrated scores and features based on manual transcription was 0.43 and the same based on ASR-hypothesis was slightly lower, 0.42. An important advantage of our method is its robustness against speech recognition errors not to mention the simplicity of feature generation that captures a reasonable set of learnerspecific syntactic errors. 600 Measures Suma Bhat Beckman Institute, Urbana, IL 61801 . spbhat 2 @ i l l ino i edu s

4 0.049240023 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language

Author: Thomas Francois ; Cedrick Fairon

Abstract: This paper present a new readability formula for French as a foreign language (FFL), which relies on 46 textual features representative of the lexical, syntactic, and semantic levels as well as some of the specificities of the FFL context. We report comparisons between several techniques for feature selection and various learning algorithms. Our best model, based on support vector machines (SVM), significantly outperforms previous FFL formulas. We also found that semantic features behave poorly in our case, in contrast with some previous readability studies on English as a first language.

5 0.048413962 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

Author: Lizhen Qu ; Rainer Gemulla ; Gerhard Weikum

Abstract: We propose the weakly supervised MultiExperts Model (MEM) for analyzing the semantic orientation of opinions expressed in natural language reviews. In contrast to most prior work, MEM predicts both opinion polarity and opinion strength at the level of individual sentences; such fine-grained analysis helps to understand better why users like or dislike the entity under review. A key challenge in this setting is that it is hard to obtain sentence-level training data for both polarity and strength. For this reason, MEM is weakly supervised: It starts with potentially noisy indicators obtained from coarse-grained training data (i.e., document-level ratings), a small set of diverse base predictors, and, if available, small amounts of fine-grained training data. We integrate these noisy indicators into a unified probabilistic framework using ideas from ensemble learning and graph-based semi-supervised learning. Our experiments indicate that MEM outperforms state-of-the-art methods by a significant margin.

6 0.035397027 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

7 0.035334989 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

8 0.032004211 9 emnlp-2012-A Sequence Labelling Approach to Quote Attribution

9 0.031412445 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

10 0.030300979 131 emnlp-2012-Unified Dependency Parsing of Chinese Morphological and Syntactic Structures

11 0.030193163 60 emnlp-2012-Generative Goal-Driven User Simulation for Dialog Management

12 0.029572334 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

13 0.029041206 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context

14 0.027808519 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

15 0.027024625 12 emnlp-2012-A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing

16 0.027002048 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

17 0.026736051 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

18 0.0259733 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

19 0.025503654 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis

20 0.025444983 88 emnlp-2012-Minimal Dependency Length in Realization Ranking

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.097), (1, -0.018), (2, 0.019), (3, 0.037), (4, -0.001), (5, 0.012), (6, 0.005), (7, -0.022), (8, 0.026), (9, 0.028), (10, 0.023), (11, -0.036), (12, -0.145), (13, 0.011), (14, -0.02), (15, -0.045), (16, -0.007), (17, 0.015), (18, -0.061), (19, -0.012), (20, -0.031), (21, -0.024), (22, -0.023), (23, 0.007), (24, -0.125), (25, 0.114), (26, 0.105), (27, -0.119), (28, 0.066), (29, -0.107), (30, 0.037), (31, -0.079), (32, 0.04), (33, -0.035), (34, -0.184), (35, -0.055), (36, 0.135), (37, 0.133), (38, 0.078), (39, 0.115), (40, 0.061), (41, 0.041), (42, 0.019), (43, -0.09), (44, 0.028), (45, -0.103), (46, 0.12), (47, 0.202), (48, 0.334), (49, -0.158)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9507401 122 emnlp-2012-Syntactic Surprisal Affects Spoken Word Duration in Conversational Contexts

Author: Vera Demberg ; Asad Sayeed ; Philip Gorinski ; Nikolaos Engonopoulos

2 0.55898166 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language

Author: Thomas Francois ; Cedrick Fairon

3 0.3997772 9 emnlp-2012-A Sequence Labelling Approach to Quote Attribution

Author: Timothy O'Keefe ; Silvia Pareti ; James R. Curran ; Irena Koprinska ; Matthew Honnibal

Abstract: Quote extraction and attribution is the task of automatically extracting quotes from text and attributing each quote to its correct speaker. The present state-of-the-art system uses gold standard information from previous decisions in its features, which, when removed, results in a large drop in performance. We treat the problem as a sequence labelling task, which allows us to incorporate sequence features without using gold standard information. We present results on two new corpora and an augmented version of a third, achieving a new state-of-the-art for systems using only realistic features.

4 0.34941176 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

Author: Afra Alishahi ; Grzegorz Chrupala

Abstract: Learning the meaning of words from ambiguous and noisy context is a challenging task for language learners. It has been suggested that children draw on syntactic cues such as lexical categories of words to constrain potential referents of words in a complex scene. Although the acquisition of lexical categories should be interleaved with learning word meanings, it has not previously been modeled in that fashion. In this paper, we investigate the interplay of word learning and category induction by integrating an LDA-based word class learning module with a probabilistic word learning model. Our results show that the incrementally induced word classes significantly improve word learning, and their contribution is comparable to that of manually assigned part of speech categories. 1 Learning the Meaning of Words For young learners of a natural language, mapping each word to its correct meaning is a challenging task. Words are often used as part of an utterance rather than in isolation. The meaning of an utterance must be inferred from among numerous possible interpretations that the (usually complex) surrounding scene offers. In addition, the linguistic and visual context in which words are heard and used is often noisy and highly ambiguous. Particularly, many words in a language are polysemous and have different meanings. Various learning mechanisms have been proposed for word learning. One well-studied mechanism is cross-situational learning, a bottom-up strategy based on statistical co-occurrence of words and referents across situations (Quine 1960, Pinker 1989). 643 Grzegorz Chrupała gchrupala@ l .uni-s aarland .de sv Spoken Language Systems Saarland University, Germany Several experimental studies have shown that adults and children are sensitive to cross-situational evidence and use this information for mapping words to objects, actions and properties (Smith and Yu 2007, Monaghan and Mattock 2009). A number of computational models have been developed based on this principle, demonstrating that cross-situational learning is a powerful and efficient mechanism for learning the correct mappings between words and meanings from noisy input (e.g. Siskind 1996, Yu 2005, Fazly et al. 2010). Another potential source of information that can help the learner to constrain the relevant aspects of a scene is the sentential context of a word. It has been suggested that children draw on syntactic cues provided by the linguistic context in order to guide word learning, a hypothesis known as syntactic bootstrapping (Gleitman 1990). There is substantial evidence that children are sensitive to the structural regularities of language from a very young age, and that they use these structural cues to find the referent of a novel word (e.g. Naigles and Hoff-Ginsberg 1995, Gertner et al. 2006). In particular, young children have robust knowledge of some of the abstract lexical categories such as nouns and verbs (e.g. Gelman and Taylor 1984, Kemp et al. 2005). Recent studies have examined the interplay of cross-situational learning and sentence-level learning mechanisms, showing that adult learners of an artificial language can successfully and simultaneously apply cues and constraints from both sources of information when mapping words to their referents (Gillette et al. 1999, Lidz et al. 2010, Koehne and Crocker 2010; 2011). Several computational models have also investigated this interaction by adding manually annotated part-of-speech tags as PLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 64t C3–o6n5f4e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r.a ?lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl input to word learning algorithms, and suggesting that integration of lexical categories can boost the performance of a cross-situational model (Yu 2006, Alishahi and Fazly 2010). However, none of the existing experimental or computational studies have examined the acquisition of word meanings and lexical categories in parallel. They all make the simplifying assumption that prior to the onset of word learning, the categoriza- tion module has already formed a relatively robust set of lexical categories. This assumption can be justified in the case of adult learners of a second or artificial language. But children’s acquisition of categories is most probably interleaved with the acquisition of word meaning, and these two processes must ultimately be studied simultaneously. In this paper, we investigate concurrent acquisition of word meanings and lexical categories. We use an online version of the LDA algorithm to induce a set of word classes from child-directed speech, and integrate them into an existing probabilistic model of word learning which combines cross-situational evidence with cues from lexical categories. Through a number of simulations of a word learning scenario, we show that our automatically and incrementally induced categories significantly improve the performance ofthe word learning model, and are closely comparable to a set of goldstandard, manually-annotated part of speech tags. 2 A Word Learning Model We want to investigate whether lexical categories (i.e. word classes) that are incrementally induced from child-directed speech can improve the performance of a cross-situational word learning model. For this purpose, we use the model of Alishahi and Fazly (2010). This model uses a probabilistic learning algorithm for combining evidence from word– referent co-occurrence statistics and the meanings associated with a set of pre-defined categories. They use child-directed utterances, manually annotated with a small set of part of speech tags, from the Manchester corpus (Theakston et al. 2001) in the CHILDES database (MacWhinney 1995). Their experimental results show that integrating these goldstandard categories into the algorithm boosts its performance over a pure cross-situational version. 644 The model of Alishahi and Fazly (2010) has the suitable architecture for our goal: it provides an integrated learning mechanism which combines evidence from word-referent co-occurrence with cues from the meaning representation associated with word categories. However, the model has two major shortcomings. First, it assumes that lexical categories are formed and finalized prior to the onset of word learning and that a correct and unique category for a target word can be identified at each point in time, assumptions that are highly unlikely. Second, it does not handle any ambiguity in the meaning of a word. Instead, each word is assumed to have only one correct meaning. Considering the high level of lexical ambiguity in most natural languages, this assumption unreasonably simplifies the word learning problem. To investigate the plausibility of integrating word and category learning, we use an online algorithm for automatically and incrementally inducing a set of lexical categories. Moreover, we use each word in its original form instead of lemmatizing them, which implies that categories contain different morphological forms of the same word. By applying these changes, we are able to study the contribution of lexical categories to word learning in a more realistic scenario. Representation of input. The input to the model consists of a sequence of utterances, each paired with a representation of an observed scene. We represent an utterance as a set of words, U = {w} (e.g. {she, went, home, ... }), aonfd w wthored corresponding scene as a set not,f hsoemmaen, .ti.c.} features, S c = {f} (e.g. {ANIMATE, HUMAN, FEMALE, ...}). Word and category meaning. We represent the meaning of a word as a time-dependent probability distribution over all the semantic features, where (f|w) |isw t)h oev probability mofa fnetaictur feea f rbees-, ing associa(tefd|w ww)i tish twhoerd p w aatb tiilmitye t o. fI fne tahteu raeb sfen bceeof any prior knowledge, the model assumes a uniform distribution over all features as the meaning of a novel word. Also, a function (w) gives us the category to which a word w in utterance belongs. At each point in time, a category c contains a set of word tokens. We assign a meaning to each category as a weighted sum of the meaning learned so far for each of its members, or p(t) (f|c) = (1/|c|) Pw∈c (f|w), where |c| is the nu(fm|bc)er o=f (w1o/r|dc |t)oPkenws∈ icn c a(tf t|hwe )c,u wrrheenrte em |co|m iesn tht. p(t) p(t)(·|w) cat(t) U(t) p(t) Learning algorithm. Given an utterance-scene pair (U(t) , S(t) ) received at time t, the model first calculates an alignment score a for each word w ∈ laantde se aanch a sleigmnamnetinct f secaoturere a f ∈ Sea(tc)h . Aw oserdm want ∈ic Ufea- U(t) × taunrde can h b see aligned fteoa a wreo frd ∈ according to the meaning acquired for that word from previous observations (word-based alignment, or aw). Alternatively, distributional clues of the word can be used to determine its category, and the semantic features can be aligned to the word according to the meaning associated to its category (category-based alignment, or ac). We combine these two sources of evidence when estimating an alignment score: a(w|f, U(t), S(t)) = λ(w) +(1 − λ(w)) U(t), S(t)) ac(w|f, U(t), S(t)) aw(w|f, (1) where the word-based and category-based alignment scores are estimated based on the acquired meanings of the word and its category, respectively: aw(w|f,U(t),S(t)) = Xp(t−p1(t)(−f1|)w(f)|wk) wkX∈U(t) ac(w|f,U(t),S(t)) = Xp(t−p1()t(−f1)|(cfat|c(awt)()wk)) wkX∈U(t) The relative contribution ofthe word-based versus the category-based alignment is determined by the weight function λ(w). Cross-situational evidence is a reliable cue for frequent words; on the other hand, the category-based score is most informative when the model encounters a low-frequency word (See Alishahi and Fazly (2010) for a full analysis of the frequency effect). Therefore, we define λ(w) as a function of the frequency of the word n(w) : λ(w) = n(w)/(n(w) + 1) Once an alignment score is calculated for each word w ∈ and each feature f ∈ S(t) , the model rweovirsdes w t ∈he U meanings cohf aeallt utrhee fw ∈ord Ss in and U(t) U(t) 645 their corresponding categories as follows: assoc(t)(w, f) = assoc(t−1)(w, f) + a(w|f, U(t), S(t)) where assoc(t−1) (w, f) is zero if w and f have not co-occurred before. These association scores are then used to update the meaning of the words in the current input: × p(t)(f|w) =Xasasossco(tc)( tf),(f wj,) w) fXj X∈F (2) where F is the set of all features seen so far. We use a hsmeroeo Fthe isd thveers sieotn o of fa ltlh fies aftourrmesul sea eton asocc faorm.m Woed uastee noisy or rare input. This process is repeated for all the input pairs, one at a time. Uniform categories. Adding the category-based alignment as a new factor to Eqn. (1) might imply that the role of categories in this model is nothing more than smoothing the cross-situational-based alignment of words and referents. In order to investigate this issue, we use the following alignment formula as an informed baseline in our experiments, where we replace ac(· |f, U(t) , S(t)) with a uniform distribution:1 a(w|f, U(t), S(t)) = λ(w) U(t), S(t)) +(1 − λ(w)) ×|U1(t)| aw(w|f, (3) where aw (w| f, U(t) , S(t) ) and λ(w) are estimated as before. In( our experiments in Section 4, we refer to this baseline as the ‘uniform’ condition. 3 Online induction of word classes with LDA Empirical findings suggest that young children form their knowledge of abstract categories, such as verbs, nouns, and adjectives, gradually (e.g. Gelman and Taylor 1984, Kemp et al. 2005). In addition, several unsupervised computational models have been proposed for inducing categories of words which resemble part-of-speech categories, by 1We thank an anonymous reviewers for suggesting this dition as an informed baseline. con- drawing on distributional properties of their context (see for example Redington et al. 1998, Clark 2000, Mintz 2003, Parisien et al. 2008, Chrupała and Alishahi 2010). However, explicit accounts of how such categories can be integrated in a crosssituational model of word learning have been rare. Here we adopt an online version of the model proposed in Chrupała (201 1), a method of soft word class learning using Latent Dirichlet Allocation. The approach is much more efficient than the commonly used alternative (Brown clustering, (Brown et al. 1992)) while at the same time matching or outperforming it when the word classes are used as automatically learned features for supervised learning of various language understanding tasks. Here we adopt this model as our approach to learning lexical categories. In Section 3.1 we describe the LDA model for word classes; in Section 3.2 we discuss the online Gibbs sampler we use for inference. 3.1 Word class learning with LDA Latent Dirichlet Allocation (LDA) was introduced by Blei et al. (2003) and is most commonly used for modeling the topic structure in document collections. It is a generative, probabilistic hierarchical Bayesian model that induces a set of latent variables, which correspond to the topics. The topics themselves are multinomial distributions over words. The generative structure of the LDA model is the following: φk ∼ Dirichlet(β) , k ∈ [1, K] zθdnd∼ D Ciraitcehgloetr(icαa),l(θd), wnd ∼ dnd ∈∈ [1 [1,D,N]d] Categorical(φznd ) , nd (4) ∈ [1, Nd] Chrupała (201 1) reinterprets the LDA model in terms of word classes as follows: K is the number of classes, D is the number of unique word types, Nd is the number ofcontext features (such as right or left neighbor) associated with word type d, znd is the class ofword type d in the ntdh context, and wnd is the ntdh context feature of word type d. Hyperparameters α and β control the sparseness of the vectors θd and φk. 646 Wordtype Features HowdoR do you HowL doL youR youL doR Table 1: Matrix of context features 1.8M words (CHILDES) 100M words (BNC) tsbgmrhioav oinekseycrbhaolrb ilnetbhgiesbJcmula nsceinkswMlwaionhalrmgituceahnge Table 2: Most similar word pairs As an example consider the small corpus consisting of the single sentence How do you do. The rows in Table 1 show the features w1 . . . wNd for each word type d if we use each word’s left and right neighbors as features, and subscript words with L and R to indicate left and right. After inference, the θd parameters correspond to word class probability distributions given a word type while the φk correspond to feature distributions given a word class: the model provides a probabilistic representation for word types independently of their context, and also for contexts independently of the word type. Probabilistic, soft word classes are more expressive than hard categories. First, they make it easy and efficient to express shared ambiguities: Chrupała (201 1) gives an example of words used as either first names or surnames, and this shared ambiguity is reflected in the similarity of their word class distributions. Second, with soft word classes it becomes easy to express graded similarity between words: as an example, Table 2 shows a random selection out of the 100 most similar word pairs according to the Jensen-Shannon divergence between their word class distributions, according to a word class model with 25 classes induced from (i) 1.8 million words of the CHILDES corpus or (ii) 100 million word of the BNC corpus. The similarities were measured between each of the 1000 most frequent CHILDES or BNC words. 3.2 Online Gibbs sampling for LDA There have been a number of attempts to develop online inference algorithms for topic modeling with LDA. A simple modification of the standard Gibbs sampler (o-LDA) was proposed by Song et al. (2005) and Banerjee and Basu (2007). Canini et al. (2009) experiment with three sampling algorithms for online topic inference: (i) oLDA, (ii) incremental Gibbs sampler, and (iii) a par- ticle filter. Only o-LDA is truly online in the sense that it does not revisit previously seen documents. The other two, the incremental Gibbs sampler and the particle filter, keep seen documents and periodically resample them. In Canini et al.’s experiments all of the online algorithms perform worse than the standard batch Gibbs sampler on a document clustering task. Hoffman et al. (2010) develop an online version of the variational Bayes (VB) optimization method for inference for topic modeling with LDA. Their method achieves good empirical results compared to batch VB as measured by perplexity on heldout data, especially when used with large minibatch sizes. Online VB for LDA is appropriate when streaming documents: with online VB documents are represented as word count tables. In our scenario where we apply LDA to modeling word classes we need to process context features from sentences arriving in a stream: i.e. we need to sample entries from a table like Table 1 in order of arrival rather than row by row. This means that online VB is not directly applicable to online word-class induction. However it also means that one issue with o-LDA identified by Canini et al. (2009) is ameliorated. When sampling in a topic modeling setting, documents are unique and are never seen again. Thus, the topics associated with old documents get stale and need to be periodically rejuvenated (i.e. resampled). This is the reason why the incremental Gibbs sampler and the particle filter algorithms in Canini et al. (2009) need to keep old documents around and cannot run in a true online fashion. Since for word class modeling we stream context features as they arrive, we will continue to see features associated with the seen word types, and will automatically resample their class assignments. In exploratory ex647 periments we have seen that this narrows the performance gap between the o-LDA sampler and the batch collapsed Gibbs sampler. We present our version of the o-LDA sampler in Algorithm 1. For each incoming sentence t we run J passes of sampling, updating the counts tables after each sampling step. We sample the class assignment zti for feature wti according to: P(zt|zt−1,wt,dt) ∝(nztt−,d1Pt+jV=t− α11)n ×ztt−, (w1njtz−t+,1wt β+ β), (5) where stands for the nPumber of times class z co-occurred with word type d up to step t, and similarly ntz,w is the number of times feature w was assigned to class z. Vt is the number of unique features seen up to step t, while α and β are the LDA hyperparameters. There are two differences between the original o-LDA and our version: we do not initialize the algorithm with a batch run over a prefix of the data, and we allow more than one sampling pass per sentence.2 Exploratory experiments have shown that batch initialization is unnecessary, and that multiple passes typically improve the quality of the induced word classes. ntz,d Algorithm 1 Online Gibbs sampler for word class induction with LDA for t = 1 → ∞ do fto =r j = →1 → dJo do fjor = =i = →1 → It do sample zti ∼ P(zti |zti−1 , wti , dti ) increment ntzti,wti and ntzti,dti Figure 1 shows the top 10 words for each of the 10 word classes induced with our online Gibbs sampler from 1.8 million words of CHILDES. Similarly, Figure 2 shows the top 10 words for 5 randomly chosen topics out of 50, learned online from 100 million words of the BNC. The topics are relatively coherent and at these levels of granularity express mostly part of speech and subcategorization frame information. Note that for each word class we show the words most frequently assigned to it while Gibbs sampling. 2Note that we do not allow multiple passes over the stream of sentences. Rather, while processing the current sentence, we allow the words in this sentence to be sampled more than once. CHILDES taIwhsyeatiofheuiwsmh’esorihntea shdoeam,tyihrewa thseliasrenhao wT,othYwaueotlbdietHcsdhaieyuors eIaitdwfmboyerwhat Figure 2: Top 10 words of 5 randomly chosen classes learned from BNC Since we are dealing with soft classes, most wordtypes have non-zero assignment probabilities for many classes. Thus frequently occurring words such as not will typically be listed for several classes. 4 Evaluation 4.1 Experimental setup As training data, we extract utterances from the Manchester corpus (Theakston et al. 2001) in the CHILDES database (MacWhinney 1995), a corpus that contains transcripts of conversations with children between the ages of 1 year, 8 months and 3 years. We use the mother’s speech from transcripts of 12 children (henceforth referred to by children’s names). We run word class induction while simultaneously outputting the highest scoring word-class label for each word: for a new sentence, we sample class assignments for each feature (doing J passes), update the counts, and then for each word dti output the highest scoring class label according to argmaxz ntz,dti (where ntz,dti stands for the num- 648 ber of times class z co-occurred with word type dti up to step t). During development we ran the online word class induction module on data for Aran, Becky, Carl and Anne and then started the word learning module for the Anne portion while continuing inducing categories. We then evaluated word learning on Anne. We chose the parameters of the word class induction module based on those development results: α = 10, β = 0.1, K = 10 and J = 20. We used cross-validation for the final evaluation. PFor each of six data files (Anne, Aran, Becky, Carl, Dominic and Gail), we ran word-class induction on the whole corpus with the chosen file last, and then started applying the word-learning algorithm on this last chosen file (while continuing with category induction). We evaluated how well word meanings were learned in those six cases. We follow Alishahi and Fazly (2010) in the construction of the input. We need a semantic representation paired with each utterance. Such a representation is not available from the corpus and has to be P1K=1 constructed. We automatically construct a gold lexicon for all nouns and verbs in this corpus as follows. For each word, we extract all hypernyms for its first sense in the appropriate (verb or noun) hierarchy in WordNet (Fellbaum 1998), and add the first word in the synset of each hypernym to the set of semantic features for the target word. For verbs, we also extract features from VerbNet (Kipper et al. 2006). A small subset of words (pronouns and frequent quantifiers) are also manually added. This lexicon represents the true meaning of each word, and is used in generating the scene representations in the input and in evaluation. For each utterance in the input corpus, we form the union of the feature representations of all its words. Words not found in the lexicon (i.e. for which we could not extract a semantic representation from WordNet and VerbNet) are removed from the utterance (only for the word learning module). In order to simulate the high level of noise that children receive from their environment, we follow Alishahi and Fazly (2010) and pair each utterance with a combination of its own scene representation and the scene representation for the following utter- ance. This decision was based on the intuition that consequent utterances are more likely to be about re- SUctenra:ce{BmCPAOLRoNAImOTMSCmEAU,yTOMELaPB,ItTeJH,EIVUObCEMrNToG,cAE. NTo.C,Al}Ti.BILO},EN. Figure 3: A sample input item to the word learning model lated topics and scenes. This results in a (roughly) 200% ambiguity. In addition, we remove the meaning of one random word from the scene representation of every second utterance in an attempt to simulate cases where the referent of an uttered word is not within the perception field (such as ‘daddy is not home yet’). A sample utterance and its corresponding scene are shown in Figure 3. As mentioned before, many words in our input corpus are polysemous. For such words, we extract different sets of features depending on their manually tagged part of speech and keep them in the lexicon (e.g. the lexicon contains two different entries for set:N and set:V). When constructing a scene representation for an utterance which contains an ambiguous word, we choose the correct sense from our lexicon according to the word’s part of speech tag in Manchester corpus. In the experiments reported in the next section, we assess the performance of our model on learning words at each point in time: for each target word, we compare its set of features in the lexicon with its probability distribution over the semantic features that the model has learned. We use mean average precision (MAP) to measure how well (· |w) ranks the features of w. p(t) 4.2 Learning curves To understand whether our categories contribute to learning of word–meaning mappings, we compare the pattern of word learning over time in four conditions. The first condition represents our baseline, in which we do not use category-based alignment in the word learning model by setting λ(w) = 1 in Eqn. (1). In the second condition we use a set of uniformly distributed categories for alignment, as estimated by Eqn. (3) on page 3 (this condition is introduced to examine whether categories act as more than a simple smoothing factor in the align649 UCNoantinefego rmryAvg.0 M0. 6 A3236PStd.0 . D.0 e3 v2 . TabPlLeDO3AS:FinalMeanAv0 e.6 r5a7g29ePrecis0 io.n0 32s09cores ment process.) In the third condition we use the categories induced by online LDA in the word learning model. The fourth condition represents the performance ceiling, in which we use the pre-defined and manually annotated part of speech categories from the Manchester corpus. Table 3 shows the average and the standard deviation of the final MAP scores across the six datasets, for the four conditions (no categories, uniform categories, LDA categories and gold part-of-speech tags). The differences between LDA and None, and between LDA and Uniform are statistically signif- icant according to the paired t test (p < 0.01), while the difference between LDA and POS is not (p = 0.16). Figure 4 shows the learning curves in each condition, averaged over the six splits explained in the previous section. The top panel shows the average learning curve over the minimum number of sentences across the six sub-corpora (8800 sentences). The curves show that our LDA categories significantly improve the performance of the model over both baselines. That means that using these categories can improve word learning compared to not using them and relying on cross-situational evidence alone. Moreover, LDA-induced categories are not merely acting as a smoothing function the way the ‘uniform’ categories are. Our results show that they are bringing relevant information to the task at hand, that is, improving word learning by using the sentential context. In fact, this improvement is comparable to the improvement achieved by integrating the ‘gold-standard’ POS categories. The middle and bottom panels of Figure 4 zoom in on shorter time spans (5000 and 1000 sentences, respectively). These diagrams suggest that the pat- tern of improvement over baseline is relatively constant, even at very early stages of learning. In fact, once the model receives enough input data, crosssituational evidence becomes stronger (since fewer words in the input are encountered for the first time) and the contribution of the categories becomes less significant. 4.3 Class granularity In Figure 5 we show the influence of the number of word classes used on the performance in word learning. It is evident that in the range between 5 to 20 classes the performance ofthe word learning module is quite stable and insensitive to the exact class granularity. Even with only 5 classes the model can still roughly distinguish noun-like words from verb-like words from pronoun-like words, and this will help learn the meaning elements derived from the higher levels of WordNet hierarchy. Notwithstanding that, ideally we would like to avoid having to pre-specify the number of classes for the word class induction module: we thus plan to investigate non-parametric models such as Hierarchical Dirichlet Process for this purpose. 5 Related Work This paper investigates the interplay between two language learning tasks which have so far been studied in isolation: the acquisition of lexical categories from distributional clues, and learning the mapping between words and meanings. Previous models have shown that lexical categories can be learned from unannotated text, mainly drawing on distributional properties of words (e.g. Redington et al. 1998, Clark 2000, Mintz 2003, Parisien et al. 2008, Chrupała and Alishahi 2010). Independently, several computational models have exploited cross-situational evidence in learning the correct mappings between words and meanings, using rule-based inference (Siskind 1996), neural networks (Li et al. 2004, Regier 2005), hierarchical Bayesian models (Frank et al. 2007) and probabilistic alignment inspired by machine translation models (Yu 2005, Fazly et al. 2010). There are only a few existing computational models that explore the role of syntax in word learning. Maurits et al. (2009) investigates the joint acquisition of word meaning and word order using a batch model. This model is tested on an artificial language with a simple first order predicate representation of meaning, and limited built-in possibilities for word 650 Figure 4: Mean average precision for all observed words at each point in time for four conditions: with gold POS categories, with LDA categories, with uniform categories, and without using categories. Each panel displays a different time span. Figure 5: Mean average precision for all observed words at each point in time in four conditions: using online LDA categories of varying numbers of 20, 10 and 5, and with- out using categories. order. The model of Niyogi (2002) simulates the mutual bootstrapping effects of syntactic and semantic knowledge in verb learning, that is the use of syntax to aid in inducing the semantics of a verb, and the use of semantics to narrow down possible syntactic frames in which a verb can participate. However, this model relies on manually assigned priors for associations between syntactic and semantic features, and is tested on a toy language with very limited vocabulary and a constrained syntax. Yu (2006) integrates automatically induced syntactic word categories into his model of crosssituational word learning, showing that they can improve the model’s performance. Yu’s model also processes input utterances in a batch mode, and its evaluation is limited to situations in which only a coarse distinction between referring words (words that could potentially refer to objects in a scene, e.g. concrete nouns) and non-referring words (words that cannot possibly refer to objects, e.g. function words) is sufficient. It is thus not clear whether information about finer-grained categories (e.g. verbs and nouns) can indeed help word learning in a more naturalistic incremental setting. On the other hand, the model of Alishahi and Fazly (2010) integrates manually annotated part-ofspeech tags into an incremental word learning algorithm, and shows that these tags boost the over651 all word learning performance, especially for infrequent words. In a different line of research, a number of models have been proposed which study the acquisition of the link between syntax and semantics within the Combinatory Categorial Grammar (CCG) framework (Briscoe 1997, Villavicencio 2002, Buttery 2006, Kwiatkowski et al. 2012). These approaches set the parameters of a semantic parser on a corpus of utterances paired with a logical form as their meaning. These models bring in extensive and detailed prior assumptions about the nature of the syntactic representation (i.e. atomic categories such as S and NP, and built-in rules which govern their combination), as well as about the representation of meaning via the formalism of lambda calculus. This is fundamentally different than the approach taken in this paper, which in comparison only assumes very simple syntactic and semantic representations of syntax. We view word and category learning as stand-alone cognitive tasks with independent representations (word meanings as probabilistic collections of properties or features as opposed to single symbols; categories as sets of word tokens with similar context distribution) and we do not bring in any prior knowledge of specific atomic categories. 6 Conclusion In this paper, we show the plausibility of using automatically and incrementally induced categories while learning word meanings. Our results suggest that the sentential context that a word appears in across its different uses can be used as a complementary source of guidance for mapping it to its featural meaning representation. In Section 4 we show that the improvement achieved by our categories is comparable to that gained by integrating gold POS categories. This result is very encouraging, since manually assigned POS tags are typically believed to set the upper bound on the usefulness of category information. We believe that it automatically induced categories have the potential to do even better: Chrupała and Alishahi (2010) have shown that categories induced from usage data in an unsupervised fashion can be used more effectively than POS categories in a number of tasks. In our experiments here on the development data we observed some improvements over POS categories. This advantage can result from the fact that our categories are more fine-grained (if also more noisy) than POS categories, which sometimes yields more accurate predictions. One important characteristic of the category induction algorithm we have used in this paper is that it provides a soft categorization scheme, where each word is associated with a probability distribution over all categories. In future, we plan to exploit this feature: when estimating the category-based alignment, we can interpolate predictions of multiple categories to which a word belongs, weighted by its probabilities associated with membership in each category. Acknowledgements Grzegorz Chrupała was funded by the German Federal Ministry of Education and Research (BMBF) under grant number 01IC10S01O as part of the Software-Cluster project EMERGENT (www . s o ftware-clu ster .org). References Alishahi, A. and Fazly, A. (2010). Integrating Syntactic Knowledge into a Model of Crosssituational Word Learning. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society. Banerjee, A. and Basu, S. (2007). Topic models over text streams: A study ofbatch and online unsupervised learning. In SIAM Data Mining. Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022. Briscoe, T. (1997). Co-evolution of language and of the language acquisition device. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 418–427. Association for Computa- tional Linguistics. Brown, P. F., Mercer, R. L., Della Pietra, V. J., and Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479. 652 Buttery, P. (2006). Computational models for first language acquisition. Computer Laboratory, University of Cambridge, Tech. Rep. UCAM-CLTR675. Canini, K., Shi, L., and Griffiths, T. (2009). Online inference of topics with latent dirichlet allocation. In Proceedings ofthe International Conference on Artificial Intelligence and Statistics. Chrupała, G. (201 1). Efficient induction of probabilistic word classes with LDA. In International Joint Conference on Natural Language Processing. Chrupała, G. and Alishahi, A. (2010). Online Entropy-based Model of Lexical Category Acquisition. In CoNLL 2010. Clark, A. (2000). Inducing syntactic categories by context distribution clustering. In Proceedings of the 2nd workshop on Learning Language in Logic and the 4th conference on Computational Natural Language Learning, pages 91–94. Association for Computational Linguistics Morristown, NJ, USA. Fazly, A., Alishahi, A., and Stevenson, S. (2010). A Probabilistic Computational Model of CrossSituational Word Learning. Cognitive Science, 34(6): 1017–1063. Fellbaum, C., editor (1998). WordNet, An Electronic Lexical Database. MIT Press. Frank, M. C., Goodman, N. D., and Tenenbaum, J. B. (2007). A Bayesian framework for crosssituational word-learning. In Advances in Neural Information Processing Systems, volume 20. Gelman, S. and Taylor, M. (1984). How two-yearold children interpret proper and common names for unfamiliar objects. Child Development, pages 1535–1540. Gertner, Y., Fisher, C., and Eisengart, J. (2006). Learning words and rules: Abstract knowledge of word order in early sentence comprehension. Psychological Science, 17(8):684–691 . Gillette, J., Gleitman, H., Gleitman, L., and Led- erer, A. (1999). Human simulations of vocabulary learning. Cognition, 73(2): 135–76. Gleitman, L. (1990). The structural sources of verb meanings. Language Acquisition, 1:135–176. Hoffman, M., Blei, D., and Bach, F. (2010). Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems. Kemp, N., Lieven, E., and Tomasello, M. (2005). Young Children’s Knowledge of the” Determiner” and” Adjective” Categories. Journal of Speech, Language and Hearing Research, 48(3):592–609. Kipper, K., Korhonen, A., Ryant, N., and Palmer, M. (2006). Extensive classifications of english verbs. In Proceedings of the 12th EURALEX International Congress. Koehne, J. and Crocker, M. W. (2010). Sentence processing mechanisms influence crosssituational word learning. In Proceedings of the Annual Conference of the Cognitive Science Society. Koehne, J. and Crocker, M. W. (201 1). The interplay of multiple mechanisms in word learning. In Proceedings ofthe Annual Conference ofthe Cognitive Science Society. Kwiatkowski, T., Goldwater, S., Zettelmoyer, L., and Steedman, M. (2012). A probabilistic model of syntactic and semantic acquisition from childdirected utterances and their meanings. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Li, P., Farkas, I., and MacWhinney, B. (2004). Early lexical development in a self-organizing neural network. Neural Networks, 17: 1345–1362. Lidz, J., Bunger, A., Leddon, E., Baier, R., and Waxman, S. R. (2010). When one cue is better than two: lexical vs . syntactic cues to verb learning. Unpublished manuscript. MacWhinney, B. (1995). The CHILDES Project: Tools for Analyzing Talk. Hillsdale, NJ: Lawrence Erlbaum Associates, second edition. Maurits, L., Perfors, A. F., and Navarro, D. J. (2009). Joint acquisition of word order and word reference. In Proceedings of the 31st Annual Conference of the Cognitive Science Society. Mintz, T. (2003). Frequent frames as a cue for gram- 653 matical categories in child directed speech. Cognition, 90(1):91–1 17. Monaghan, P. and Mattock, K. (2009). Crosssituational language learning: The effects of grammatical categories as constraints on referential labeling. In Proceedings of the 31st Annual Conference of the Cognitive Science Society. Naigles, L. and Hoff-Ginsberg, E. (1995). Input to Verb Learning: Evidence for the Plausibility of Syntactic Bootstrapping. Developmental Psychology, 31(5):827–37. Niyogi, S. (2002). Bayesian learning at the syntaxsemantics interface. In Proceedings of the 24th annual conference of the Cognitive Science Society, pages 697–702. Parisien, C., Fazly, A., and Stevenson, S. (2008). An incremental bayesian model for learning syntactic categories. In Proceedings of the Twelfth Conference on Computational Natural Language Learning. Pinker, S. (1989). Learnability and Cognition: The Acquisition of Argument Structure. Cambridge, MA: MIT Press. Quine, W. (1960). Word and Object. Cambridge University Press, Cambridge, MA. Redington, M., Crater, N., and Finch, S. (1998). Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science: A Multidisciplinary Journal, 22(4):425–469. Regier, T. (2005). The emergence of words: Attentional learning in form and meaning. Cognitive Science, 29:819–865. Siskind, J. M. (1996). A computational study of cross-situational techniques for learning word-tomeaning mappings. Cognition, 61:39–91 . Smith, L. and Yu, C. (2007). Infants rapidly learn words from noisy data via cross-situational statistics. In Proceedings of the 29th Annual Conference of the Cognitive Science Society. Song, X., Lin, C., Tseng, B., and Sun, M. (2005). Modeling and predicting personal information dissemination behavior. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 479–488. ACM. Theakston, A. L., Lieven, E. V., Pine, J. M., and Rowland, C. F. (2001). The role of performance limitations in the acquisition structure: An alternative Child Language, of verb-argument account. 28: 127–152. Journal of Villavicencio, A. (2002). The acquisition of a unification-based generalised categorial grammar. In Proceedings of the Third CLUK Colloquium, pages 59–66. Yu, C. (2005). The emergence of links between lexical acquisition and object categorization: A computational study. Connection Science, 17(3– 4):381–397. Yu, C. (2006). Learning syntax–semantics mappings to bootstrap word learning. In Proceedings of the 28th Annual Conference of the Cognitive Science Society. 654

5 0.33174223 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures

Author: Su-Youn Yoon ; Suma Bhat

6 0.31629059 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

7 0.26560569 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation

8 0.26168278 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

9 0.25332099 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context

10 0.22196135 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

11 0.18204628 31 emnlp-2012-Cross-Lingual Language Modeling with Syntactic Reordering for Low-Resource Speech Recognition

12 0.1781771 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

13 0.17287475 102 emnlp-2012-Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems

14 0.16946161 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis

15 0.16868219 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings

16 0.16605763 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

17 0.16503586 131 emnlp-2012-Unified Dependency Parsing of Chinese Morphological and Syntactic Structures

18 0.16015829 105 emnlp-2012-Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output

19 0.15738639 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls

20 0.15374574 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.019), (6, 0.013), (16, 0.021), (34, 0.06), (45, 0.013), (47, 0.291), (60, 0.071), (63, 0.058), (64, 0.015), (65, 0.018), (70, 0.02), (73, 0.01), (74, 0.084), (76, 0.056), (80, 0.023), (82, 0.014), (86, 0.053), (95, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73744047 122 emnlp-2012-Syntactic Surprisal Affects Spoken Word Duration in Conversational Contexts

Author: Vera Demberg ; Asad Sayeed ; Philip Gorinski ; Nikolaos Engonopoulos

2 0.58338195 12 emnlp-2012-A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing

Author: Bernd Bohnet ; Joakim Nivre

Abstract: Most current dependency parsers presuppose that input words have been morphologically disambiguated using a part-of-speech tagger before parsing begins. We present a transitionbased system for joint part-of-speech tagging and labeled dependency parsing with nonprojective trees. Experimental evaluation on Chinese, Czech, English and German shows consistent improvements in both tagging and parsing accuracy when compared to a pipeline system, which lead to improved state-of-theart results for all languages.

3 0.45114693 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction

Author: Yang Feng ; Yang Liu ; Qun Liu ; Trevor Cohn

Abstract: Decoding algorithms for syntax based machine translation suffer from high computational complexity, a consequence of intersecting a language model with a context free grammar. Left-to-right decoding, which generates the target string in order, can improve decoding efficiency by simplifying the language model evaluation. This paper presents a novel left to right decoding algorithm for tree-to-string translation, using a bottom-up parsing strategy and dynamic future cost estimation for each partial translation. Our method outperforms previously published tree-to-string decoders, including a competing left-to-right method.

4 0.44951695 66 emnlp-2012-Improving Transition-Based Dependency Parsing with Buffer Transitions

Author: Daniel Fernandez-Gonzalez ; Carlos Gomez-Rodriguez

Abstract: In this paper, we show that significant improvements in the accuracy of well-known transition-based parsers can be obtained, without sacrificing efficiency, by enriching the parsers with simple transitions that act on buffer nodes. First, we show how adding a specific transition to create either a left or right arc of length one between the first two buffer nodes produces improvements in the accuracy of Nivre’s arc-eager projective parser on a number of datasets from the CoNLL-X shared task. Then, we show that accuracy can also be improved by adding transitions involving the topmost stack node and the second buffer node (allowing a limited form of non-projectivity). None of these transitions has a negative impact on the computational complexity of the algorithm. Although the experiments in this paper use the arc-eager parser, the approach is generic enough to be applicable to any stackbased dependency parser.

5 0.44368911 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

6 0.44296157 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars

7 0.44209373 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis

8 0.43701458 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

9 0.43638647 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

10 0.43576708 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

11 0.43158191 51 emnlp-2012-Extracting Opinion Expressions with semi-Markov Conditional Random Fields

12 0.43112859 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context

13 0.42862922 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

14 0.42251337 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

15 0.42189261 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

16 0.42184761 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

17 0.42158553 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

18 0.42101175 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

19 0.42031315 120 emnlp-2012-Streaming Analysis of Discourse Participants

20 0.41946149 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling