emnlp emnlp2012 emnlp2012-87 knowledge-graph by maker-knowledge-mining

87 emnlp-2012-Lyrics, Music, and Emotions

Source: pdf

Author: Rada Mihalcea ; Carlo Strapparava

Abstract: In this paper, we explore the classification of emotions in songs, using the music and the lyrics representation of the songs. We introduce a novel corpus of music and lyrics, consisting of 100 songs annotated for emotions. We show that textual and musical features can both be successfully used for emotion recognition in songs. Moreover, through comparative experiments, we show that the joint use of lyrics and music brings significant improvements over each of the individual textual and musical classifiers, with error rate reductions of up to 31%.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we explore the classification of emotions in songs, using the music and the lyrics representation of the songs. [sent-3, score-1.113]

2 We introduce a novel corpus of music and lyrics, consisting of 100 songs annotated for emotions. [sent-4, score-0.663]

3 We show that textual and musical features can both be successfully used for emotion recognition in songs. [sent-5, score-0.916]

4 Moreover, through comparative experiments, we show that the joint use of lyrics and music brings significant improvements over each of the individual textual and musical classifiers, with error rate reductions of up to 31%. [sent-6, score-1.205]

5 1 Introduction Language and music are peculiar characteristics of human beings. [sent-7, score-0.411]

6 The capability of producing and enjoying language and music appears in every human society, regardless of the richness of its culture (Nettl, 2000). [sent-8, score-0.411]

7 Importantly, language and music complement each other in many different ways. [sent-9, score-0.411]

8 For instance, looking at music and language in terms of features, we can observe that music organizes pitch and rhythm in ways that language does not, and it lacks the specificity of language in terms of semantic meaning. [sent-10, score-0.886]

9 On the other hand, language is built from categories that are absent in music (e. [sent-11, score-0.431]

10 , nouns and verbs), whereas music seems to have a deeper power over our emotions than does ordinary speech. [sent-13, score-0.776]

11 time of the earliest written records of music encountered in musical settings for poetry. [sent-15, score-0.824]

12 Despite this in- terest, and despite the long history of the interaction between music and lyrics, there is only little work that explicitly focuses on the connection between music and lyrics. [sent-16, score-0.822]

13 In this paper, we focus on the connection between the musical and linguistic representations in popular songs, and their role in the expression of affect. [sent-17, score-0.429]

14 We introduce a novel corpus of lyrics and music, annotated for emotions at line level, and explore the automatic recognition of emotions using both textual and musical features. [sent-18, score-1.586]

15 Through comparative experiments, we show that emotion recognition can be performed using either textual or musical features, and that the joint use of lyrics and music can improve significantly over classifiers that use only one dimension at a time. [sent-19, score-1.632]

16 2 Related Work The literature on music analysis is noticeably large, and there are several studies concerning the music’s power over emotions (Juslin and Sloboda, 2001), thinking (Rauscher et al. [sent-21, score-0.776]

17 In particular, there has been significant research in music and psychology focusing on the idea of a parallel between affective cues in music and speech (Sundberg, 1982; Scherer, 1995). [sent-23, score-0.881]

18 For instance, (Scherer, 2004) investigated the types of emotions that can be induced by music, their mechanisms, and how they can be empirically measured. [sent-24, score-0.365]

19 lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl Laukka, 2003) conducted a comprehensive review of vocal expressions and music performance, finding substantial overlap in the cues used to convey basic emotions in speech and music. [sent-27, score-0.796]

20 The work most closely related to ours is the combination of audio and lyrics for emotion classification in songs, as thoroughly surveyed in (Kim et al. [sent-28, score-0.764]

21 fm to col- lect large datasets of songs annotated for emotions (Laurier et al. [sent-32, score-0.617]

22 None of the previous methods considered the fine-grained classification of emotions at line level, as we do, and none of them considered the six Ekman emotions used in our work. [sent-35, score-0.847]

23 Other related work consists of the development of tools for music accessing, filtering, classification, and retrieval, focusing primarily on music in digital format such as MIDI. [sent-36, score-0.822]

24 For instance, the task of music retrieval and music recommendation has received a lot of attention from both the arts and the computer science communities (see for instance (Orio, 2006) for an introduction to this task). [sent-37, score-0.856]

25 , 2000), who described an analysis of predominant up-down motion types within music, through extraction of the kinematic variables of music velocity and acceleration from MIDI data streams. [sent-40, score-0.454]

26 , 2004) automati- cally aligned acoustic musical signals with their corresponding textual lyrics. [sent-45, score-0.535]

27 A reliable system to identify the MIDI track containing the melody1 is very relevant for music information retrieval, and 1A melody can be defined as a “cantabile” sequence of notes, usually the sequence that a listener can remember after hearing a song. [sent-47, score-0.491]

28 Finally, in natural language processing, there are a few studies that mainly exploited the lyrics com- ponent of the songs, while generally ignoring the musical component. [sent-54, score-0.711]

29 , 2008) addressed the task of sentiment classification in lyrics, recognizing positive and negative moods in a large dataset of Chinese pop songs, while (Yang and Lee, 2009) approached the problem of emotion identification in lyrics, classifying songs from allmusic. [sent-58, score-0.729]

30 3 A Corpus of Music and Lyrics Annotated for Emotions To enable our exploration of emotions in songs, we compiled a corpus of 100 popular songs (e. [sent-60, score-0.617]

31 Popular songs exert a lot of power on people, both at an individual level as well as on groups, mainly because of the message and emotions they convey. [sent-63, score-0.617]

32 Songs are able to embody deep feelings, usually through a combined effect of both music and lyrics. [sent-65, score-0.434]

33 The MIDI files, which were purchased from the provider, contain also lyrics that are synchronized with the notes. [sent-68, score-0.298]

34 In these MIDI files, the melody channel is unequivocally decided by the provider, making it easier to extract the music and the corresponding lyrics. [sent-69, score-0.468]

35 MIDI is an industry-standard protocol that enables electronic musical instruments, computers, and other electronic equipment to communicate and synchronize with each other. [sent-71, score-0.466]

36 Unlike analog devices, MIDI does not transmit an audio signal: it sends event messages about musical notation, pitch, and intensity, control signals for parameters such as volume, vibrato, and panning, and cues and clock signals to set the tempo. [sent-72, score-0.505]

37 As an electronic protocol, it is notable for its widespread adoption throughout the music industry. [sent-73, score-0.426]

38 At the line level, we represent the raising, which is the musical interval (in half-steps) between the first note in the line and the most important note (i. [sent-81, score-0.493]

39 In order to explore the classification of emotions in songs, we needed a gold standard consisting of manual emotion annotations of the songs. [sent-88, score-0.83]

40 Following 592 previous work on emotion annotation of text (Alm et al. [sent-89, score-0.383]

41 , 2005; Strapparava and Mihalcea, 2007), to annotate the emotions in songs we use the six basic emotions proposed by (Ekman, 1993): ANGER, DISGUST, FEAR, JOY, SADNESS, SURPRISE. [sent-90, score-1.02]

42 The annotators were instructed to: (1) Score the emotions from the writer perspective, not their own perspective; (2) Read and interpret each line in context; i. [sent-98, score-0.429]

43 , they were asked to read and understand the entire song before producing any annotations; (3) Produce the six emotion annotations independent from each other, accounting for the fact that a line could contain none, one, or multiple emotions. [sent-100, score-0.69]

44 First, in each song we inserted a “checkpoint” at a random position in the song a fake line that reads “Please enter 7 for each of the six emotions. [sent-105, score-0.45]

45 Second, for each remaining annotator, we calculated the Pearson correlation between her emotion scores and the average emotion scores of all the other annotators. [sent-107, score-0.788]

46 The final annotations are produced by averaging the emotions scores produced by the reliable annotators. [sent-112, score-0.431]

47 Figure 3 shows an example of the emotion scores produced for two lines. [sent-113, score-0.383]

48 For each of the six emotions, Table 2 shows the number of lines that had that emotion present (i. [sent-116, score-0.459]

49 , the score of the emotion was different from 0), as well as the average score for that emotion over all 4,976 lines in the corpus. [sent-118, score-0.804]

50 Perhaps not surprisingly, the emotions that are dominant in the corpus are JOY and SADNESS which are the emotions that are often invoked by people as the reason behind a song. [sent-119, score-0.748]

51 Note that the emotions do not exclude each other: i. [sent-120, score-0.365]

52 , a line that is labeled as containing JOY may also contain a certain amount of SADNESS, which is the reason for the high percentage of songs containing both JOY and SADNESS. [sent-122, score-0.292]

53 The emotional load for the overlapping emotions is however very different. [sent-123, score-0.395]

54 2837 Table 2: Emotions in the corpus of 100 songs: number of lines including a certain emotion, and average emotion score computed over all the 4,976 lines. [sent-131, score-0.421]

55 4 Experiments and Evaluations Through our experiments, we seek to determine the extent to which we can automatically determine the emotional load of each line in a song, for each of the six emotion dimensions. [sent-134, score-0.491]

56 We use two main classes of features: textual features, which build upon the textual representation of the lyrics; and musical features, which rely on the musical notation associated with the songs. [sent-135, score-1.008]

57 The first one is intended to determine the usefulness of the textual features for emotion classification. [sent-137, score-0.481]

58 The second set specifically focuses on the musical features. [sent-138, score-0.413]

59 Finally, the last set of experiments makes joint use of textual and musical features. [sent-139, score-0.496]

60 1 Textual Features First, we attempt to identify the emotions in a line by relying exclusively on the features that can be derived from the lyrics of the song. [sent-143, score-0.718]

61 We decided to focus on those features that were successfully used in the past for emotion classification (Strapparava and Mihalcea, 2008). [sent-144, score-0.437]

62 We use a bag-of-words representation of the lyrics to derive unigram counts, which are then used as input features. [sent-147, score-0.317]

63 First, we build a vocabulary consisting of all the words, including stopwords, occurring in the lyrics of the training set. [sent-148, score-0.298]

64 It uses several resources for affective information, including the emotion classification of Ortony (Ortony et al. [sent-165, score-0.481]

65 From WA, we extract the words corresponding to the six basic emotions used in our experiments. [sent-167, score-0.403]

66 In a second set of experiments, we explore the role played by the musical features. [sent-188, score-0.413]

67 While the musical notation of a song offers several characteristics that could be potentially useful for our classification experiments (e. [sent-189, score-0.654]

68 A note is a sign used in the musical notation associated with a song, to represent the relative duration and pitch of a sound. [sent-193, score-0.506]

69 In traditional music theory, the notes are represented using the first seven letters of the alphabet (C-D-E-F-G-A-B), although other notations can also be used. [sent-194, score-0.455]

70 For instance, a song in “the key of C minor” means that the song is harmonically centered on the note C, and it makes use of the minor scale whose first note is C. [sent-205, score-0.441]

71 34271034697152 0 Table 4: Evaluations using musical features: notes, key, and all the musical features. [sent-213, score-0.826]

72 To explore the usefulness of the joint lyrics and music representation, we also run a set of experiments that use all the textual and musical features. [sent-216, score-1.205]

73 5439 Table 5: Evaluations using both textual and musical features. [sent-240, score-0.496]

74 5 Discussion One clear conclusion can be drawn from these experiments: the textual and musical features are both useful for the classification of emotions in songs, and, more importantly, their joint use leads to the highest classification results. [sent-241, score-0.954]

75 2% with respect to the classifier that uses only musical features. [sent-244, score-0.413]

76 This supports the idea that lyrics and music represent orthogonal dimensions for the classification of emotions in songs. [sent-245, score-1.113]

77 Among the six emotions considered, the largest improvements are observed for JOY, SADNESS, and ANGER. [sent-246, score-0.403]

78 Nonetheless, the addition of the musical features brings clear improvements, as shown in the last column from the same table. [sent-250, score-0.428]

79 Thus, to generate the coarse binary annotations, if the score of an emotion is below 3, we record it as “negative” (i. [sent-273, score-0.416]

80 , the emotion is absent), whereas if the score is equal to or above 3, we record it as “positive” (i. [sent-275, score-0.399]

81 Table 7 shows the results obtained for each of the six emotions, and for the three major settings that we considered: textual features only, musical features only, and a classifier that jointly uses the textual and the musical features. [sent-280, score-1.076]

82 As seen from the table, on average, thejoint use of textual and musical features is also beneficial for this binary coarser-grained classification. [sent-283, score-0.511]

83 65 3410689573 89654 Table 8: Results obtained in previous work on emotion classification. [sent-287, score-0.399]

84 those emotions that are dominant in the corpus, i. [sent-288, score-0.383]

85 The improvement obtained with the classifiers is much smaller for the other emotions (or even absent, e. [sent-291, score-0.403]

86 There is no previous research that has considered the joint use of lyrics and songs representations for emotion classification at line level, and thus we cannot draw a direct comparison with other work on emotion classification in songs. [sent-295, score-1.434]

87 Nonetheless, as a point of reference, we consider the previous work done on emotion classification of texts. [sent-296, score-0.422]

88 Table 8 shows the results obtained in previous work for the recognition of emotions in a corpus consisting of 1,000 news headlines (Strapparava and Mihalcea, 2007) annotated for the same six emotions. [sent-297, score-0.441]

89 Specifically, the table shows the best overall correlation results obtained by the three emotion recognition systems in the SEMEVAL task on Affective Text (Strapparava and Mihalcea, 2007): (Chaumartin, 2007; Kozareva et al. [sent-298, score-0.443]

90 Except for one emotion (FEAR), the correlation figures we obtain are significantly higher than those reported in previous work. [sent-302, score-0.405]

91 597 6 Conclusions Popular songs express universally understood meanings and embody experiences and feelings shared by many, usually through a combined effect ofboth music and lyrics. [sent-305, score-0.706]

92 In this paper, we introduced a novel corpus ofmusic and lyrics, annotated for emotions at line level, and we used this corpus to explore the automatic recognition of emotions in songs. [sent-306, score-0.792]

93 Through experiments carried out on the dataset of 100 songs, we showed that emotion recognition can be performed using either textual or musical features, and that the joint use of lyrics and music can improve significantly over classifiers that use only one dimension at a time. [sent-307, score-1.632]

94 Acknowledgments The authors are grateful to Rajitha Schellenberg for her help with collecting the emotion annotations. [sent-309, score-0.383]

95 Emotions from text: Machine learning for text-based emotion prediction. [sent-318, score-0.383]

96 Communication of emotion in vocal expression and music performance: Different channels, same code? [sent-363, score-0.83]

97 Music emotion recognition: A state of the art review. [sent-396, score-0.383]

98 An ethnomusicologist contemplates universals in musical sound and musical culture. [sent-423, score-0.826]

99 LyricAlly: Automatic synchronization of acoustic musical signals and textual lyrics. [sent-541, score-0.535]

100 Lyricbased song sentiment classification with sentiment vector space model. [sent-550, score-0.259]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('musical', 0.413), ('music', 0.411), ('emotion', 0.383), ('emotions', 0.365), ('lyrics', 0.298), ('midi', 0.264), ('songs', 0.252), ('song', 0.186), ('strapparava', 0.098), ('textual', 0.083), ('joy', 0.08), ('sadness', 0.079), ('affective', 0.059), ('melody', 0.057), ('juslin', 0.046), ('pitch', 0.046), ('notes', 0.044), ('audio', 0.044), ('annotations', 0.043), ('evaluations', 0.04), ('line', 0.04), ('classification', 0.039), ('six', 0.038), ('ablation', 0.038), ('lines', 0.038), ('mihalcea', 0.037), ('tracks', 0.036), ('pennebaker', 0.036), ('ortony', 0.034), ('provider', 0.034), ('files', 0.034), ('minor', 0.032), ('duration', 0.031), ('multimedia', 0.03), ('anger', 0.03), ('emotional', 0.03), ('liwc', 0.03), ('pearson', 0.028), ('fear', 0.027), ('signals', 0.024), ('annotators', 0.024), ('yang', 0.024), ('reliable', 0.023), ('beatles', 0.023), ('cataltepe', 0.023), ('ekman', 0.023), ('embody', 0.023), ('harmony', 0.023), ('karageorghis', 0.023), ('kinematic', 0.023), ('laurier', 0.023), ('mahedero', 0.023), ('protocol', 0.023), ('rauscher', 0.023), ('rizo', 0.023), ('scherer', 0.023), ('sloboda', 0.023), ('sport', 0.023), ('velusamy', 0.023), ('recognition', 0.022), ('correlation', 0.022), ('classifiers', 0.022), ('rada', 0.02), ('surprise', 0.02), ('absent', 0.02), ('moods', 0.02), ('tenfold', 0.02), ('motion', 0.02), ('harmonically', 0.02), ('somehow', 0.02), ('disgust', 0.02), ('feelings', 0.02), ('mood', 0.02), ('vocal', 0.02), ('retrieval', 0.019), ('unigram', 0.019), ('dominant', 0.018), ('organizes', 0.018), ('inquiry', 0.018), ('headline', 0.018), ('alm', 0.018), ('pop', 0.018), ('wa', 0.018), ('key', 0.017), ('sentiment', 0.017), ('coarse', 0.017), ('night', 0.016), ('crowdsourcing', 0.016), ('katz', 0.016), ('voice', 0.016), ('obtained', 0.016), ('record', 0.016), ('notation', 0.016), ('expression', 0.016), ('groups', 0.016), ('acoustic', 0.015), ('recommendation', 0.015), ('recording', 0.015), ('features', 0.015), ('electronic', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 87 emnlp-2012-Lyrics, Music, and Emotions

Author: Rada Mihalcea ; Carlo Strapparava

2 0.14656518 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls

Author: Kai Hong ; Christian G. Kohler ; Mary E. March ; Amber A. Parker ; Ani Nenkova

Abstract: We present a system for automatic identification of schizophrenic patients and healthy controls based on narratives the subjects recounted about emotional experiences in their own life. The focus of the study is to identify the lexical features that distinguish the two populations. We report the results of feature selection experiments that demonstrate that the classifier can achieve accuracy on patient level prediction as high as 76.9% with only a small set of features. We provide an in-depth discussion of the lexical features that distinguish the two groups and the unexpected relationship between emotion types of the narratives and the accuracy of patient status prediction.

3 0.03461767 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Entity linking systems link noun-phrase mentions in text to their corresponding Wikipedia articles. However, NLP applications would gain from the ability to detect and type all entities mentioned in text, including the long tail of entities not prominent enough to have their own Wikipedia articles. In this paper we show that once the Wikipedia entities mentioned in a corpus of textual assertions are linked, this can further enable the detection and fine-grained typing of the unlinkable entities. Our proposed method for detecting unlinkable entities achieves 24% greater accuracy than a Named Entity Recognition baseline, and our method for fine-grained typing is able to propagate over 1,000 types from linked Wikipedia entities to unlinkable entities. Detection and typing of unlinkable entities can increase yield for NLP applications such as typed question answering.

4 0.030861326 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types

Author: Ndapandula Nakashole ; Gerhard Weikum ; Fabian Suchanek

Abstract: This paper presents PATTY: a large resource for textual patterns that denote binary relations between entities. The patterns are semantically typed and organized into a subsumption taxonomy. The PATTY system is based on efficient algorithms for frequent itemset mining and can process Web-scale corpora. It harnesses the rich type system and entity population of large knowledge bases. The PATTY taxonomy comprises 350,569 pattern synsets. Random-sampling-based evaluation shows a pattern accuracy of 84.7%. PATTY has 8,162 subsumptions, with a random-sampling-based precision of 75%. The PATTY resource is freely available for interactive access and download.

5 0.026182234 51 emnlp-2012-Extracting Opinion Expressions with semi-Markov Conditional Random Fields

Author: Bishan Yang ; Claire Cardie

Abstract: Extracting opinion expressions from text is usually formulated as a token-level sequence labeling task tackled using Conditional Random Fields (CRFs). CRFs, however, do not readily model potentially useful segment-level information like syntactic constituent structure. Thus, we propose a semi-CRF-based approach to the task that can perform sequence labeling at the segment level. We extend the original semi-CRF model (Sarawagi and Cohen, 2004) to allow the modeling of arbitrarily long expressions while accounting for their likely syntactic structure when modeling segment boundaries. We evaluate performance on two opinion extraction tasks, and, in contrast to previous sequence labeling approaches to the task, explore the usefulness of segment- level syntactic parse features. Experimental results demonstrate that our approach outperforms state-of-the-art methods for both opinion expression tasks.

6 0.025565777 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

7 0.023731064 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

8 0.022092393 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation

9 0.021863788 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

10 0.021409186 97 emnlp-2012-Natural Language Questions for the Web of Data

11 0.020201758 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification

12 0.019756999 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering

13 0.019620029 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets

14 0.01954235 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures

15 0.019272374 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language

16 0.018753549 116 emnlp-2012-Semantic Compositionality through Recursive Matrix-Vector Spaces

17 0.018417882 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

18 0.017801814 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

19 0.017519485 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes

20 0.016935499 15 emnlp-2012-Active Learning for Imbalanced Sentiment Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.074), (1, 0.023), (2, 0.002), (3, 0.037), (4, 0.01), (5, 0.009), (6, 0.014), (7, 0.009), (8, 0.004), (9, -0.009), (10, -0.003), (11, -0.014), (12, -0.082), (13, 0.015), (14, -0.003), (15, -0.043), (16, -0.019), (17, -0.052), (18, -0.029), (19, 0.022), (20, -0.054), (21, -0.004), (22, 0.069), (23, 0.042), (24, -0.051), (25, 0.125), (26, 0.308), (27, -0.154), (28, -0.477), (29, 0.119), (30, -0.005), (31, 0.146), (32, -0.165), (33, -0.189), (34, 0.094), (35, 0.031), (36, 0.143), (37, -0.002), (38, 0.13), (39, -0.108), (40, -0.069), (41, 0.009), (42, 0.049), (43, -0.061), (44, -0.092), (45, 0.061), (46, -0.01), (47, -0.035), (48, -0.02), (49, 0.141)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97003257 87 emnlp-2012-Lyrics, Music, and Emotions

Author: Rada Mihalcea ; Carlo Strapparava

2 0.82652849 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls

Author: Kai Hong ; Christian G. Kohler ; Mary E. March ; Amber A. Parker ; Ani Nenkova

3 0.15555325 61 emnlp-2012-Grounded Models of Semantic Representation

Author: Carina Silberer ; Mirella Lapata

Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.

4 0.15315352 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types

Author: Ndapandula Nakashole ; Gerhard Weikum ; Fabian Suchanek

5 0.15064025 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

Author: Afra Alishahi ; Grzegorz Chrupala

Abstract: Learning the meaning of words from ambiguous and noisy context is a challenging task for language learners. It has been suggested that children draw on syntactic cues such as lexical categories of words to constrain potential referents of words in a complex scene. Although the acquisition of lexical categories should be interleaved with learning word meanings, it has not previously been modeled in that fashion. In this paper, we investigate the interplay of word learning and category induction by integrating an LDA-based word class learning module with a probabilistic word learning model. Our results show that the incrementally induced word classes significantly improve word learning, and their contribution is comparable to that of manually assigned part of speech categories. 1 Learning the Meaning of Words For young learners of a natural language, mapping each word to its correct meaning is a challenging task. Words are often used as part of an utterance rather than in isolation. The meaning of an utterance must be inferred from among numerous possible interpretations that the (usually complex) surrounding scene offers. In addition, the linguistic and visual context in which words are heard and used is often noisy and highly ambiguous. Particularly, many words in a language are polysemous and have different meanings. Various learning mechanisms have been proposed for word learning. One well-studied mechanism is cross-situational learning, a bottom-up strategy based on statistical co-occurrence of words and referents across situations (Quine 1960, Pinker 1989). 643 Grzegorz Chrupała gchrupala@ l .uni-s aarland .de sv Spoken Language Systems Saarland University, Germany Several experimental studies have shown that adults and children are sensitive to cross-situational evidence and use this information for mapping words to objects, actions and properties (Smith and Yu 2007, Monaghan and Mattock 2009). A number of computational models have been developed based on this principle, demonstrating that cross-situational learning is a powerful and efficient mechanism for learning the correct mappings between words and meanings from noisy input (e.g. Siskind 1996, Yu 2005, Fazly et al. 2010). Another potential source of information that can help the learner to constrain the relevant aspects of a scene is the sentential context of a word. It has been suggested that children draw on syntactic cues provided by the linguistic context in order to guide word learning, a hypothesis known as syntactic bootstrapping (Gleitman 1990). There is substantial evidence that children are sensitive to the structural regularities of language from a very young age, and that they use these structural cues to find the referent of a novel word (e.g. Naigles and Hoff-Ginsberg 1995, Gertner et al. 2006). In particular, young children have robust knowledge of some of the abstract lexical categories such as nouns and verbs (e.g. Gelman and Taylor 1984, Kemp et al. 2005). Recent studies have examined the interplay of cross-situational learning and sentence-level learning mechanisms, showing that adult learners of an artificial language can successfully and simultaneously apply cues and constraints from both sources of information when mapping words to their referents (Gillette et al. 1999, Lidz et al. 2010, Koehne and Crocker 2010; 2011). Several computational models have also investigated this interaction by adding manually annotated part-of-speech tags as PLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 64t C3–o6n5f4e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r.a ?lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl input to word learning algorithms, and suggesting that integration of lexical categories can boost the performance of a cross-situational model (Yu 2006, Alishahi and Fazly 2010). However, none of the existing experimental or computational studies have examined the acquisition of word meanings and lexical categories in parallel. They all make the simplifying assumption that prior to the onset of word learning, the categoriza- tion module has already formed a relatively robust set of lexical categories. This assumption can be justified in the case of adult learners of a second or artificial language. But children’s acquisition of categories is most probably interleaved with the acquisition of word meaning, and these two processes must ultimately be studied simultaneously. In this paper, we investigate concurrent acquisition of word meanings and lexical categories. We use an online version of the LDA algorithm to induce a set of word classes from child-directed speech, and integrate them into an existing probabilistic model of word learning which combines cross-situational evidence with cues from lexical categories. Through a number of simulations of a word learning scenario, we show that our automatically and incrementally induced categories significantly improve the performance ofthe word learning model, and are closely comparable to a set of goldstandard, manually-annotated part of speech tags. 2 A Word Learning Model We want to investigate whether lexical categories (i.e. word classes) that are incrementally induced from child-directed speech can improve the performance of a cross-situational word learning model. For this purpose, we use the model of Alishahi and Fazly (2010). This model uses a probabilistic learning algorithm for combining evidence from word– referent co-occurrence statistics and the meanings associated with a set of pre-defined categories. They use child-directed utterances, manually annotated with a small set of part of speech tags, from the Manchester corpus (Theakston et al. 2001) in the CHILDES database (MacWhinney 1995). Their experimental results show that integrating these goldstandard categories into the algorithm boosts its performance over a pure cross-situational version. 644 The model of Alishahi and Fazly (2010) has the suitable architecture for our goal: it provides an integrated learning mechanism which combines evidence from word-referent co-occurrence with cues from the meaning representation associated with word categories. However, the model has two major shortcomings. First, it assumes that lexical categories are formed and finalized prior to the onset of word learning and that a correct and unique category for a target word can be identified at each point in time, assumptions that are highly unlikely. Second, it does not handle any ambiguity in the meaning of a word. Instead, each word is assumed to have only one correct meaning. Considering the high level of lexical ambiguity in most natural languages, this assumption unreasonably simplifies the word learning problem. To investigate the plausibility of integrating word and category learning, we use an online algorithm for automatically and incrementally inducing a set of lexical categories. Moreover, we use each word in its original form instead of lemmatizing them, which implies that categories contain different morphological forms of the same word. By applying these changes, we are able to study the contribution of lexical categories to word learning in a more realistic scenario. Representation of input. The input to the model consists of a sequence of utterances, each paired with a representation of an observed scene. We represent an utterance as a set of words, U = {w} (e.g. {she, went, home, ... }), aonfd w wthored corresponding scene as a set not,f hsoemmaen, .ti.c.} features, S c = {f} (e.g. {ANIMATE, HUMAN, FEMALE, ...}). Word and category meaning. We represent the meaning of a word as a time-dependent probability distribution over all the semantic features, where (f|w) |isw t)h oev probability mofa fnetaictur feea f rbees-, ing associa(tefd|w ww)i tish twhoerd p w aatb tiilmitye t o. fI fne tahteu raeb sfen bceeof any prior knowledge, the model assumes a uniform distribution over all features as the meaning of a novel word. Also, a function (w) gives us the category to which a word w in utterance belongs. At each point in time, a category c contains a set of word tokens. We assign a meaning to each category as a weighted sum of the meaning learned so far for each of its members, or p(t) (f|c) = (1/|c|) Pw∈c (f|w), where |c| is the nu(fm|bc)er o=f (w1o/r|dc |t)oPkenws∈ icn c a(tf t|hwe )c,u wrrheenrte em |co|m iesn tht. p(t) p(t)(·|w) cat(t) U(t) p(t) Learning algorithm. Given an utterance-scene pair (U(t) , S(t) ) received at time t, the model first calculates an alignment score a for each word w ∈ laantde se aanch a sleigmnamnetinct f secaoturere a f ∈ Sea(tc)h . Aw oserdm want ∈ic Ufea- U(t) × taunrde can h b see aligned fteoa a wreo frd ∈ according to the meaning acquired for that word from previous observations (word-based alignment, or aw). Alternatively, distributional clues of the word can be used to determine its category, and the semantic features can be aligned to the word according to the meaning associated to its category (category-based alignment, or ac). We combine these two sources of evidence when estimating an alignment score: a(w|f, U(t), S(t)) = λ(w) +(1 − λ(w)) U(t), S(t)) ac(w|f, U(t), S(t)) aw(w|f, (1) where the word-based and category-based alignment scores are estimated based on the acquired meanings of the word and its category, respectively: aw(w|f,U(t),S(t)) = Xp(t−p1(t)(−f1|)w(f)|wk) wkX∈U(t) ac(w|f,U(t),S(t)) = Xp(t−p1()t(−f1)|(cfat|c(awt)()wk)) wkX∈U(t) The relative contribution ofthe word-based versus the category-based alignment is determined by the weight function λ(w). Cross-situational evidence is a reliable cue for frequent words; on the other hand, the category-based score is most informative when the model encounters a low-frequency word (See Alishahi and Fazly (2010) for a full analysis of the frequency effect). Therefore, we define λ(w) as a function of the frequency of the word n(w) : λ(w) = n(w)/(n(w) + 1) Once an alignment score is calculated for each word w ∈ and each feature f ∈ S(t) , the model rweovirsdes w t ∈he U meanings cohf aeallt utrhee fw ∈ord Ss in and U(t) U(t) 645 their corresponding categories as follows: assoc(t)(w, f) = assoc(t−1)(w, f) + a(w|f, U(t), S(t)) where assoc(t−1) (w, f) is zero if w and f have not co-occurred before. These association scores are then used to update the meaning of the words in the current input: × p(t)(f|w) =Xasasossco(tc)( tf),(f wj,) w) fXj X∈F (2) where F is the set of all features seen so far. We use a hsmeroeo Fthe isd thveers sieotn o of fa ltlh fies aftourrmesul sea eton asocc faorm.m Woed uastee noisy or rare input. This process is repeated for all the input pairs, one at a time. Uniform categories. Adding the category-based alignment as a new factor to Eqn. (1) might imply that the role of categories in this model is nothing more than smoothing the cross-situational-based alignment of words and referents. In order to investigate this issue, we use the following alignment formula as an informed baseline in our experiments, where we replace ac(· |f, U(t) , S(t)) with a uniform distribution:1 a(w|f, U(t), S(t)) = λ(w) U(t), S(t)) +(1 − λ(w)) ×|U1(t)| aw(w|f, (3) where aw (w| f, U(t) , S(t) ) and λ(w) are estimated as before. In( our experiments in Section 4, we refer to this baseline as the ‘uniform’ condition. 3 Online induction of word classes with LDA Empirical findings suggest that young children form their knowledge of abstract categories, such as verbs, nouns, and adjectives, gradually (e.g. Gelman and Taylor 1984, Kemp et al. 2005). In addition, several unsupervised computational models have been proposed for inducing categories of words which resemble part-of-speech categories, by 1We thank an anonymous reviewers for suggesting this dition as an informed baseline. con- drawing on distributional properties of their context (see for example Redington et al. 1998, Clark 2000, Mintz 2003, Parisien et al. 2008, Chrupała and Alishahi 2010). However, explicit accounts of how such categories can be integrated in a crosssituational model of word learning have been rare. Here we adopt an online version of the model proposed in Chrupała (201 1), a method of soft word class learning using Latent Dirichlet Allocation. The approach is much more efficient than the commonly used alternative (Brown clustering, (Brown et al. 1992)) while at the same time matching or outperforming it when the word classes are used as automatically learned features for supervised learning of various language understanding tasks. Here we adopt this model as our approach to learning lexical categories. In Section 3.1 we describe the LDA model for word classes; in Section 3.2 we discuss the online Gibbs sampler we use for inference. 3.1 Word class learning with LDA Latent Dirichlet Allocation (LDA) was introduced by Blei et al. (2003) and is most commonly used for modeling the topic structure in document collections. It is a generative, probabilistic hierarchical Bayesian model that induces a set of latent variables, which correspond to the topics. The topics themselves are multinomial distributions over words. The generative structure of the LDA model is the following: φk ∼ Dirichlet(β) , k ∈ [1, K] zθdnd∼ D Ciraitcehgloetr(icαa),l(θd), wnd ∼ dnd ∈∈ [1 [1,D,N]d] Categorical(φznd ) , nd (4) ∈ [1, Nd] Chrupała (201 1) reinterprets the LDA model in terms of word classes as follows: K is the number of classes, D is the number of unique word types, Nd is the number ofcontext features (such as right or left neighbor) associated with word type d, znd is the class ofword type d in the ntdh context, and wnd is the ntdh context feature of word type d. Hyperparameters α and β control the sparseness of the vectors θd and φk. 646 Wordtype Features HowdoR do you HowL doL youR youL doR Table 1: Matrix of context features 1.8M words (CHILDES) 100M words (BNC) tsbgmrhioav oinekseycrbhaolrb ilnetbhgiesbJcmula nsceinkswMlwaionhalrmgituceahnge Table 2: Most similar word pairs As an example consider the small corpus consisting of the single sentence How do you do. The rows in Table 1 show the features w1 . . . wNd for each word type d if we use each word’s left and right neighbors as features, and subscript words with L and R to indicate left and right. After inference, the θd parameters correspond to word class probability distributions given a word type while the φk correspond to feature distributions given a word class: the model provides a probabilistic representation for word types independently of their context, and also for contexts independently of the word type. Probabilistic, soft word classes are more expressive than hard categories. First, they make it easy and efficient to express shared ambiguities: Chrupała (201 1) gives an example of words used as either first names or surnames, and this shared ambiguity is reflected in the similarity of their word class distributions. Second, with soft word classes it becomes easy to express graded similarity between words: as an example, Table 2 shows a random selection out of the 100 most similar word pairs according to the Jensen-Shannon divergence between their word class distributions, according to a word class model with 25 classes induced from (i) 1.8 million words of the CHILDES corpus or (ii) 100 million word of the BNC corpus. The similarities were measured between each of the 1000 most frequent CHILDES or BNC words. 3.2 Online Gibbs sampling for LDA There have been a number of attempts to develop online inference algorithms for topic modeling with LDA. A simple modification of the standard Gibbs sampler (o-LDA) was proposed by Song et al. (2005) and Banerjee and Basu (2007). Canini et al. (2009) experiment with three sampling algorithms for online topic inference: (i) oLDA, (ii) incremental Gibbs sampler, and (iii) a par- ticle filter. Only o-LDA is truly online in the sense that it does not revisit previously seen documents. The other two, the incremental Gibbs sampler and the particle filter, keep seen documents and periodically resample them. In Canini et al.’s experiments all of the online algorithms perform worse than the standard batch Gibbs sampler on a document clustering task. Hoffman et al. (2010) develop an online version of the variational Bayes (VB) optimization method for inference for topic modeling with LDA. Their method achieves good empirical results compared to batch VB as measured by perplexity on heldout data, especially when used with large minibatch sizes. Online VB for LDA is appropriate when streaming documents: with online VB documents are represented as word count tables. In our scenario where we apply LDA to modeling word classes we need to process context features from sentences arriving in a stream: i.e. we need to sample entries from a table like Table 1 in order of arrival rather than row by row. This means that online VB is not directly applicable to online word-class induction. However it also means that one issue with o-LDA identified by Canini et al. (2009) is ameliorated. When sampling in a topic modeling setting, documents are unique and are never seen again. Thus, the topics associated with old documents get stale and need to be periodically rejuvenated (i.e. resampled). This is the reason why the incremental Gibbs sampler and the particle filter algorithms in Canini et al. (2009) need to keep old documents around and cannot run in a true online fashion. Since for word class modeling we stream context features as they arrive, we will continue to see features associated with the seen word types, and will automatically resample their class assignments. In exploratory ex647 periments we have seen that this narrows the performance gap between the o-LDA sampler and the batch collapsed Gibbs sampler. We present our version of the o-LDA sampler in Algorithm 1. For each incoming sentence t we run J passes of sampling, updating the counts tables after each sampling step. We sample the class assignment zti for feature wti according to: P(zt|zt−1,wt,dt) ∝(nztt−,d1Pt+jV=t− α11)n ×ztt−, (w1njtz−t+,1wt β+ β), (5) where stands for the nPumber of times class z co-occurred with word type d up to step t, and similarly ntz,w is the number of times feature w was assigned to class z. Vt is the number of unique features seen up to step t, while α and β are the LDA hyperparameters. There are two differences between the original o-LDA and our version: we do not initialize the algorithm with a batch run over a prefix of the data, and we allow more than one sampling pass per sentence.2 Exploratory experiments have shown that batch initialization is unnecessary, and that multiple passes typically improve the quality of the induced word classes. ntz,d Algorithm 1 Online Gibbs sampler for word class induction with LDA for t = 1 → ∞ do fto =r j = →1 → dJo do fjor = =i = →1 → It do sample zti ∼ P(zti |zti−1 , wti , dti ) increment ntzti,wti and ntzti,dti Figure 1 shows the top 10 words for each of the 10 word classes induced with our online Gibbs sampler from 1.8 million words of CHILDES. Similarly, Figure 2 shows the top 10 words for 5 randomly chosen topics out of 50, learned online from 100 million words of the BNC. The topics are relatively coherent and at these levels of granularity express mostly part of speech and subcategorization frame information. Note that for each word class we show the words most frequently assigned to it while Gibbs sampling. 2Note that we do not allow multiple passes over the stream of sentences. Rather, while processing the current sentence, we allow the words in this sentence to be sampled more than once. CHILDES taIwhsyeatiofheuiwsmh’esorihntea shdoeam,tyihrewa thseliasrenhao wT,othYwaueotlbdietHcsdhaieyuors eIaitdwfmboyerwhat Figure 2: Top 10 words of 5 randomly chosen classes learned from BNC Since we are dealing with soft classes, most wordtypes have non-zero assignment probabilities for many classes. Thus frequently occurring words such as not will typically be listed for several classes. 4 Evaluation 4.1 Experimental setup As training data, we extract utterances from the Manchester corpus (Theakston et al. 2001) in the CHILDES database (MacWhinney 1995), a corpus that contains transcripts of conversations with children between the ages of 1 year, 8 months and 3 years. We use the mother’s speech from transcripts of 12 children (henceforth referred to by children’s names). We run word class induction while simultaneously outputting the highest scoring word-class label for each word: for a new sentence, we sample class assignments for each feature (doing J passes), update the counts, and then for each word dti output the highest scoring class label according to argmaxz ntz,dti (where ntz,dti stands for the num- 648 ber of times class z co-occurred with word type dti up to step t). During development we ran the online word class induction module on data for Aran, Becky, Carl and Anne and then started the word learning module for the Anne portion while continuing inducing categories. We then evaluated word learning on Anne. We chose the parameters of the word class induction module based on those development results: α = 10, β = 0.1, K = 10 and J = 20. We used cross-validation for the final evaluation. PFor each of six data files (Anne, Aran, Becky, Carl, Dominic and Gail), we ran word-class induction on the whole corpus with the chosen file last, and then started applying the word-learning algorithm on this last chosen file (while continuing with category induction). We evaluated how well word meanings were learned in those six cases. We follow Alishahi and Fazly (2010) in the construction of the input. We need a semantic representation paired with each utterance. Such a representation is not available from the corpus and has to be P1K=1 constructed. We automatically construct a gold lexicon for all nouns and verbs in this corpus as follows. For each word, we extract all hypernyms for its first sense in the appropriate (verb or noun) hierarchy in WordNet (Fellbaum 1998), and add the first word in the synset of each hypernym to the set of semantic features for the target word. For verbs, we also extract features from VerbNet (Kipper et al. 2006). A small subset of words (pronouns and frequent quantifiers) are also manually added. This lexicon represents the true meaning of each word, and is used in generating the scene representations in the input and in evaluation. For each utterance in the input corpus, we form the union of the feature representations of all its words. Words not found in the lexicon (i.e. for which we could not extract a semantic representation from WordNet and VerbNet) are removed from the utterance (only for the word learning module). In order to simulate the high level of noise that children receive from their environment, we follow Alishahi and Fazly (2010) and pair each utterance with a combination of its own scene representation and the scene representation for the following utter- ance. This decision was based on the intuition that consequent utterances are more likely to be about re- SUctenra:ce{BmCPAOLRoNAImOTMSCmEAU,yTOMELaPB,ItTeJH,EIVUObCEMrNToG,cAE. NTo.C,Al}Ti.BILO},EN. Figure 3: A sample input item to the word learning model lated topics and scenes. This results in a (roughly) 200% ambiguity. In addition, we remove the meaning of one random word from the scene representation of every second utterance in an attempt to simulate cases where the referent of an uttered word is not within the perception field (such as ‘daddy is not home yet’). A sample utterance and its corresponding scene are shown in Figure 3. As mentioned before, many words in our input corpus are polysemous. For such words, we extract different sets of features depending on their manually tagged part of speech and keep them in the lexicon (e.g. the lexicon contains two different entries for set:N and set:V). When constructing a scene representation for an utterance which contains an ambiguous word, we choose the correct sense from our lexicon according to the word’s part of speech tag in Manchester corpus. In the experiments reported in the next section, we assess the performance of our model on learning words at each point in time: for each target word, we compare its set of features in the lexicon with its probability distribution over the semantic features that the model has learned. We use mean average precision (MAP) to measure how well (· |w) ranks the features of w. p(t) 4.2 Learning curves To understand whether our categories contribute to learning of word–meaning mappings, we compare the pattern of word learning over time in four conditions. The first condition represents our baseline, in which we do not use category-based alignment in the word learning model by setting λ(w) = 1 in Eqn. (1). In the second condition we use a set of uniformly distributed categories for alignment, as estimated by Eqn. (3) on page 3 (this condition is introduced to examine whether categories act as more than a simple smoothing factor in the align649 UCNoantinefego rmryAvg.0 M0. 6 A3236PStd.0 . D.0 e3 v2 . TabPlLeDO3AS:FinalMeanAv0 e.6 r5a7g29ePrecis0 io.n0 32s09cores ment process.) In the third condition we use the categories induced by online LDA in the word learning model. The fourth condition represents the performance ceiling, in which we use the pre-defined and manually annotated part of speech categories from the Manchester corpus. Table 3 shows the average and the standard deviation of the final MAP scores across the six datasets, for the four conditions (no categories, uniform categories, LDA categories and gold part-of-speech tags). The differences between LDA and None, and between LDA and Uniform are statistically signif- icant according to the paired t test (p < 0.01), while the difference between LDA and POS is not (p = 0.16). Figure 4 shows the learning curves in each condition, averaged over the six splits explained in the previous section. The top panel shows the average learning curve over the minimum number of sentences across the six sub-corpora (8800 sentences). The curves show that our LDA categories significantly improve the performance of the model over both baselines. That means that using these categories can improve word learning compared to not using them and relying on cross-situational evidence alone. Moreover, LDA-induced categories are not merely acting as a smoothing function the way the ‘uniform’ categories are. Our results show that they are bringing relevant information to the task at hand, that is, improving word learning by using the sentential context. In fact, this improvement is comparable to the improvement achieved by integrating the ‘gold-standard’ POS categories. The middle and bottom panels of Figure 4 zoom in on shorter time spans (5000 and 1000 sentences, respectively). These diagrams suggest that the pat- tern of improvement over baseline is relatively constant, even at very early stages of learning. In fact, once the model receives enough input data, crosssituational evidence becomes stronger (since fewer words in the input are encountered for the first time) and the contribution of the categories becomes less significant. 4.3 Class granularity In Figure 5 we show the influence of the number of word classes used on the performance in word learning. It is evident that in the range between 5 to 20 classes the performance ofthe word learning module is quite stable and insensitive to the exact class granularity. Even with only 5 classes the model can still roughly distinguish noun-like words from verb-like words from pronoun-like words, and this will help learn the meaning elements derived from the higher levels of WordNet hierarchy. Notwithstanding that, ideally we would like to avoid having to pre-specify the number of classes for the word class induction module: we thus plan to investigate non-parametric models such as Hierarchical Dirichlet Process for this purpose. 5 Related Work This paper investigates the interplay between two language learning tasks which have so far been studied in isolation: the acquisition of lexical categories from distributional clues, and learning the mapping between words and meanings. Previous models have shown that lexical categories can be learned from unannotated text, mainly drawing on distributional properties of words (e.g. Redington et al. 1998, Clark 2000, Mintz 2003, Parisien et al. 2008, Chrupała and Alishahi 2010). Independently, several computational models have exploited cross-situational evidence in learning the correct mappings between words and meanings, using rule-based inference (Siskind 1996), neural networks (Li et al. 2004, Regier 2005), hierarchical Bayesian models (Frank et al. 2007) and probabilistic alignment inspired by machine translation models (Yu 2005, Fazly et al. 2010). There are only a few existing computational models that explore the role of syntax in word learning. Maurits et al. (2009) investigates the joint acquisition of word meaning and word order using a batch model. This model is tested on an artificial language with a simple first order predicate representation of meaning, and limited built-in possibilities for word 650 Figure 4: Mean average precision for all observed words at each point in time for four conditions: with gold POS categories, with LDA categories, with uniform categories, and without using categories. Each panel displays a different time span. Figure 5: Mean average precision for all observed words at each point in time in four conditions: using online LDA categories of varying numbers of 20, 10 and 5, and with- out using categories. order. The model of Niyogi (2002) simulates the mutual bootstrapping effects of syntactic and semantic knowledge in verb learning, that is the use of syntax to aid in inducing the semantics of a verb, and the use of semantics to narrow down possible syntactic frames in which a verb can participate. However, this model relies on manually assigned priors for associations between syntactic and semantic features, and is tested on a toy language with very limited vocabulary and a constrained syntax. Yu (2006) integrates automatically induced syntactic word categories into his model of crosssituational word learning, showing that they can improve the model’s performance. Yu’s model also processes input utterances in a batch mode, and its evaluation is limited to situations in which only a coarse distinction between referring words (words that could potentially refer to objects in a scene, e.g. concrete nouns) and non-referring words (words that cannot possibly refer to objects, e.g. function words) is sufficient. It is thus not clear whether information about finer-grained categories (e.g. verbs and nouns) can indeed help word learning in a more naturalistic incremental setting. On the other hand, the model of Alishahi and Fazly (2010) integrates manually annotated part-ofspeech tags into an incremental word learning algorithm, and shows that these tags boost the over651 all word learning performance, especially for infrequent words. In a different line of research, a number of models have been proposed which study the acquisition of the link between syntax and semantics within the Combinatory Categorial Grammar (CCG) framework (Briscoe 1997, Villavicencio 2002, Buttery 2006, Kwiatkowski et al. 2012). These approaches set the parameters of a semantic parser on a corpus of utterances paired with a logical form as their meaning. These models bring in extensive and detailed prior assumptions about the nature of the syntactic representation (i.e. atomic categories such as S and NP, and built-in rules which govern their combination), as well as about the representation of meaning via the formalism of lambda calculus. This is fundamentally different than the approach taken in this paper, which in comparison only assumes very simple syntactic and semantic representations of syntax. We view word and category learning as stand-alone cognitive tasks with independent representations (word meanings as probabilistic collections of properties or features as opposed to single symbols; categories as sets of word tokens with similar context distribution) and we do not bring in any prior knowledge of specific atomic categories. 6 Conclusion In this paper, we show the plausibility of using automatically and incrementally induced categories while learning word meanings. Our results suggest that the sentential context that a word appears in across its different uses can be used as a complementary source of guidance for mapping it to its featural meaning representation. In Section 4 we show that the improvement achieved by our categories is comparable to that gained by integrating gold POS categories. This result is very encouraging, since manually assigned POS tags are typically believed to set the upper bound on the usefulness of category information. We believe that it automatically induced categories have the potential to do even better: Chrupała and Alishahi (2010) have shown that categories induced from usage data in an unsupervised fashion can be used more effectively than POS categories in a number of tasks. In our experiments here on the development data we observed some improvements over POS categories. This advantage can result from the fact that our categories are more fine-grained (if also more noisy) than POS categories, which sometimes yields more accurate predictions. One important characteristic of the category induction algorithm we have used in this paper is that it provides a soft categorization scheme, where each word is associated with a probability distribution over all categories. In future, we plan to exploit this feature: when estimating the category-based alignment, we can interpolate predictions of multiple categories to which a word belongs, weighted by its probabilities associated with membership in each category. Acknowledgements Grzegorz Chrupała was funded by the German Federal Ministry of Education and Research (BMBF) under grant number 01IC10S01O as part of the Software-Cluster project EMERGENT (www . s o ftware-clu ster .org). References Alishahi, A. and Fazly, A. (2010). Integrating Syntactic Knowledge into a Model of Crosssituational Word Learning. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society. Banerjee, A. and Basu, S. (2007). Topic models over text streams: A study ofbatch and online unsupervised learning. In SIAM Data Mining. Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022. Briscoe, T. (1997). Co-evolution of language and of the language acquisition device. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 418–427. Association for Computa- tional Linguistics. Brown, P. F., Mercer, R. L., Della Pietra, V. J., and Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479. 652 Buttery, P. (2006). Computational models for first language acquisition. Computer Laboratory, University of Cambridge, Tech. Rep. UCAM-CLTR675. Canini, K., Shi, L., and Griffiths, T. (2009). Online inference of topics with latent dirichlet allocation. In Proceedings ofthe International Conference on Artificial Intelligence and Statistics. Chrupała, G. (201 1). Efficient induction of probabilistic word classes with LDA. In International Joint Conference on Natural Language Processing. Chrupała, G. and Alishahi, A. (2010). Online Entropy-based Model of Lexical Category Acquisition. In CoNLL 2010. Clark, A. (2000). Inducing syntactic categories by context distribution clustering. In Proceedings of the 2nd workshop on Learning Language in Logic and the 4th conference on Computational Natural Language Learning, pages 91–94. Association for Computational Linguistics Morristown, NJ, USA. Fazly, A., Alishahi, A., and Stevenson, S. (2010). A Probabilistic Computational Model of CrossSituational Word Learning. Cognitive Science, 34(6): 1017–1063. Fellbaum, C., editor (1998). WordNet, An Electronic Lexical Database. MIT Press. Frank, M. C., Goodman, N. D., and Tenenbaum, J. B. (2007). A Bayesian framework for crosssituational word-learning. In Advances in Neural Information Processing Systems, volume 20. Gelman, S. and Taylor, M. (1984). How two-yearold children interpret proper and common names for unfamiliar objects. Child Development, pages 1535–1540. Gertner, Y., Fisher, C., and Eisengart, J. (2006). Learning words and rules: Abstract knowledge of word order in early sentence comprehension. Psychological Science, 17(8):684–691 . Gillette, J., Gleitman, H., Gleitman, L., and Led- erer, A. (1999). Human simulations of vocabulary learning. Cognition, 73(2): 135–76. Gleitman, L. (1990). The structural sources of verb meanings. Language Acquisition, 1:135–176. Hoffman, M., Blei, D., and Bach, F. (2010). Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems. Kemp, N., Lieven, E., and Tomasello, M. (2005). Young Children’s Knowledge of the” Determiner” and” Adjective” Categories. Journal of Speech, Language and Hearing Research, 48(3):592–609. Kipper, K., Korhonen, A., Ryant, N., and Palmer, M. (2006). Extensive classifications of english verbs. In Proceedings of the 12th EURALEX International Congress. Koehne, J. and Crocker, M. W. (2010). Sentence processing mechanisms influence crosssituational word learning. In Proceedings of the Annual Conference of the Cognitive Science Society. Koehne, J. and Crocker, M. W. (201 1). The interplay of multiple mechanisms in word learning. In Proceedings ofthe Annual Conference ofthe Cognitive Science Society. Kwiatkowski, T., Goldwater, S., Zettelmoyer, L., and Steedman, M. (2012). A probabilistic model of syntactic and semantic acquisition from childdirected utterances and their meanings. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Li, P., Farkas, I., and MacWhinney, B. (2004). Early lexical development in a self-organizing neural network. Neural Networks, 17: 1345–1362. Lidz, J., Bunger, A., Leddon, E., Baier, R., and Waxman, S. R. (2010). When one cue is better than two: lexical vs . syntactic cues to verb learning. Unpublished manuscript. MacWhinney, B. (1995). The CHILDES Project: Tools for Analyzing Talk. Hillsdale, NJ: Lawrence Erlbaum Associates, second edition. Maurits, L., Perfors, A. F., and Navarro, D. J. (2009). Joint acquisition of word order and word reference. In Proceedings of the 31st Annual Conference of the Cognitive Science Society. Mintz, T. (2003). Frequent frames as a cue for gram- 653 matical categories in child directed speech. Cognition, 90(1):91–1 17. Monaghan, P. and Mattock, K. (2009). Crosssituational language learning: The effects of grammatical categories as constraints on referential labeling. In Proceedings of the 31st Annual Conference of the Cognitive Science Society. Naigles, L. and Hoff-Ginsberg, E. (1995). Input to Verb Learning: Evidence for the Plausibility of Syntactic Bootstrapping. Developmental Psychology, 31(5):827–37. Niyogi, S. (2002). Bayesian learning at the syntaxsemantics interface. In Proceedings of the 24th annual conference of the Cognitive Science Society, pages 697–702. Parisien, C., Fazly, A., and Stevenson, S. (2008). An incremental bayesian model for learning syntactic categories. In Proceedings of the Twelfth Conference on Computational Natural Language Learning. Pinker, S. (1989). Learnability and Cognition: The Acquisition of Argument Structure. Cambridge, MA: MIT Press. Quine, W. (1960). Word and Object. Cambridge University Press, Cambridge, MA. Redington, M., Crater, N., and Finch, S. (1998). Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science: A Multidisciplinary Journal, 22(4):425–469. Regier, T. (2005). The emergence of words: Attentional learning in form and meaning. Cognitive Science, 29:819–865. Siskind, J. M. (1996). A computational study of cross-situational techniques for learning word-tomeaning mappings. Cognition, 61:39–91 . Smith, L. and Yu, C. (2007). Infants rapidly learn words from noisy data via cross-situational statistics. In Proceedings of the 29th Annual Conference of the Cognitive Science Society. Song, X., Lin, C., Tseng, B., and Sun, M. (2005). Modeling and predicting personal information dissemination behavior. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 479–488. ACM. Theakston, A. L., Lieven, E. V., Pine, J. M., and Rowland, C. F. (2001). The role of performance limitations in the acquisition structure: An alternative Child Language, of verb-argument account. 28: 127–152. Journal of Villavicencio, A. (2002). The acquisition of a unification-based generalised categorial grammar. In Proceedings of the Third CLUK Colloquium, pages 59–66. Yu, C. (2005). The emergence of links between lexical acquisition and object categorization: A computational study. Connection Science, 17(3– 4):381–397. Yu, C. (2006). Learning syntax–semantics mappings to bootstrap word learning. In Proceedings of the 28th Annual Conference of the Cognitive Science Society. 654

6 0.1203552 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

7 0.11574066 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

8 0.11543231 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets

9 0.11305001 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage

10 0.10943068 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation

11 0.10447571 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

12 0.098361738 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation

13 0.096622005 62 emnlp-2012-Identifying Constant and Unique Relations by using Time-Series Text

14 0.093251824 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures

15 0.089439414 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context

16 0.086464442 139 emnlp-2012-Word Salad: Relating Food Prices and Descriptions

17 0.084318973 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

18 0.082578018 41 emnlp-2012-Entity based QA Retrieval

19 0.082198821 15 emnlp-2012-Active Learning for Imbalanced Sentiment Classification

20 0.080473006 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.018), (16, 0.016), (25, 0.011), (34, 0.049), (60, 0.087), (63, 0.056), (64, 0.014), (65, 0.015), (70, 0.023), (74, 0.038), (76, 0.046), (79, 0.424), (80, 0.012), (86, 0.024), (95, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77796978 85 emnlp-2012-Local and Global Context for Supervised and Unsupervised Metonymy Resolution

Author: Vivi Nastase ; Alex Judea ; Katja Markert ; Michael Strube

Abstract: Computational approaches to metonymy resolution have focused almost exclusively on the local context, especially the constraints placed on a potentially metonymic word by its grammatical collocates. We expand such approaches by taking into account the larger context. Our algorithm is tested on the data from the metonymy resolution task (Task 8) at SemEval 2007. The results show that incorporation of the global context can improve over the use of the local context alone, depending on the types of metonymies addressed. As a second contribution, we move towards unsupervised resolution of metonymies, made feasible by considering ontological relations as possible readings. We show that such an unsupervised approach delivers promising results: it beats the supervised most frequent sense baseline and performs close to a supervised approach using only standard lexico-syntactic features.

same-paper 2 0.76655513 87 emnlp-2012-Lyrics, Music, and Emotions

Author: Rada Mihalcea ; Carlo Strapparava

3 0.63525057 35 emnlp-2012-Document-Wide Decoding for Phrase-Based Statistical Machine Translation

Author: Christian Hardmeier ; Joakim Nivre ; Jorg Tiedemann

Abstract: Independence between sentences is an assumption deeply entrenched in the models and algorithms used for statistical machine translation (SMT), particularly in the popular dynamic programming beam search decoding algorithm. This restriction is an obstacle to research on more sophisticated discourse-level models for SMT. We propose a stochastic local search decoding method for phrase-based SMT, which permits free document-wide dependencies in the models. We explore the stability and the search parameters ofthis method and demonstrate that it can be successfully used to optimise a document-level semantic language model. 1 Motivation In the field oftranslation studies, it is undisputed that discourse-wide context must be considered care- fully for good translation results (Hatim and Mason, 1990). By contrast, the state of the art in statistical machine translation (SMT), despite significant advances in the last twenty years, still assumes that texts can be translated sentence by sentence under strict independence assumptions, even though it is well known that certain linguistic phenomena such as pronominal anaphora cannot be translated correctly without referring to extra-sentential context. This is true both for the phrase-based and the syntaxbased approach to SMT. In the rest of this paper, we shall concentrate on phrase-based SMT. One reason why it is difficult to experiment with document-wide models for phrase-based SMT is that the dynamic programming (DP) algorithm 1179 which has been used almost exclusively for decoding SMT models in the recent literature has very strong assumptions of locality built into it. DP beam search for phrase-based SMT was described by Koehn et al. (2003), extending earlier work on word-based SMT (Tillmann et al., 1997; Och et al., 2001 ; Tillmann and Ney, 2003). This algorithm con- structs output sentences by starting with an empty hypothesis and adding output words at the end until translations for all source words have been generated. The core models of phrase-based SMT, in particular the n-gram language model (LM), only depend on a constant number of output words to the left of the word being generated. This fact is exploited by the search algorithm with a DP technique called hypothesis recombination (Och et al., 2001), which permits the elimination of hypotheses from the search space if they coincide in a certain number of final words with a better hypothesis and no future expansion can possibly invert the relative ranking of the two hypotheses under the given models. Hypothesis recombination achieves a substantial reduction of the search space without affecting search optimality and makes it possible to use aggressive pruning techniques for fast search while still obtaining good results. The downside of this otherwise excellent approach is that it only works well with models that have a local dependency structure similar to that of an n-gram language model, so they only depend on a small context window for each target word. Sentence-local models with longer dependencies can be added, but doing so greatly increases the risk for search errors by inhibiting hypothesis recombination. Cross-sentence dependencies cannot be directly integrated into DP SMT decoding in LParnogcue agdein Lgesa ornf tihneg, 2 p0a1g2e Jso 1in17t C9–o1n1f9e0re,n Jce ju on Is Elanmdp,ir Kicoarlea M,e 1t2h–o1d4s J iunly N 2a0tu1r2a.l ? Lc a2n0g1u2ag Aes Psorcoicaetsiosin fgo arn Cdo Cmopmutpauti oantiaoln Lailn Ngautiustriacls any obvious way, especially if joint optimisation of a number of interdependent decisions over an entire document is required. Research into models with a more varied, non-local dependency structure is to some extent stifled by the difficulty of decoding such models effectively, as can be seen by the problems some researchers encountered when they attempted to solve discourse-level problems. Consider, for instance, the work on cache-based language models by Tiedemann (2010) and Gong et al. (201 1), where error propagation was a serious issue, or the works on pronominal anaphora by Le Nagard and Koehn (2010), who implemented cross-sentence dependencies with an ad-hoc two-pass decoding strategy, and Hardmeier and Federico (2010) with the use of an external decoder driver to manage backward-only dependencies between sentences. In this paper, we present a method for decoding complete documents in phrase-based SMT. Our decoder uses a local search approach whose state consists of a complete translation of an entire document at any time. The initial state is improved by the application of a series of operations using a hill climbing strategy to find a (local) maximum of the score function. This setup gives us complete freedom to define scoring functions over the entire document. Moreover, by optionally initialising the state with the output of a traditional DP decoder, we can ensure that the final hypothesis is no worse than what would have been found by DP search alone. We start by describing the decoding algorithm and the state operations used by our decoder, then we present empirical results demonstrating the effectiveness of our approach and its usability with a document-level semantic language model, and finally we discuss some related work. 2 SMT Decoding by Hill Climbing In this section, we formally describe the phrasebased SMT model implemented by our decoder as well as the decoding algorithm we use. 2.1 SMT Model Our decoder is based on local search, so its state at any time is a representation of a complete translation of the entire document. Even though the decoder operates at the document level, it is important to keep 1180 track of sentence boundaries, and the individual operations that are applied to the state are still confined to sentence scope, so it is useful to decompose the state of a document into the state of its sentences, and we define the overall state S as a sequence of sentence states: S = S1S2 . . .SN, (1) where N is the number of sentences. This implies that we constrain the decoder to emit exactly one output sentence per input sentence. Let ibe the number of a sentence and mi the number of input tokens of this sentence, p and q (with 1 ≤ p ≤ q ≤ mi) be positions in the input sentence a1n ≤d [p; q] qde ≤no mte the set ofpositions from p up to and including q. We say that [p; q] precedes [p0; q0], or [p; q] ≺ [p0; q0], if q < p0. Let Φi([p; q]) be the set of t[pra;nqs]l ≺atio [pns for the source phrase covering positions [p; q] in the input sentence ias given by the phrase table. We call A = h[p; q] ,φi an anchored phrase pair w.it Wh coverage C(A) = [p; q] nif a φ ∈ Φi([p; q]) sise a target phrase translating =th [ep source w∈o Φrds at positions [p; q] . Then a sequence of ni anchored phrase pairs Si = A1A2 . . .Ani (2) is a valid sentence state for sentence iif the following two conditions hold: 1. The coverage sets C(Aj) for j in 1, . . . , ni are mutually disjoint, and 2. the anchored phrase pairs jointly cover the complete input sentence, or [niC(Aj) = [1;mi]. (3) [j=1 Let f(S) be a scoring function mapping a state S to a real number. As usual in SMT, it is assumed that the scoring function can be decomposed into a linear combination of K feature functions hk(S), each with a constant weight λk, so f(S) =k∑K=1λkhk(S). (4) The problem addressed by the decoder is the search for the state with maximal score, such that Sˆ Sˆ = argSmaxf(S). (5) The feature functions implemented in our baseline system are identical to the ones found in the popular Moses SMT system (Koehn et al., 2007). In particular, our decoder has the following feature functions: 1. phrase translation scores provided by the phrase table including forward and backward conditional probabilities, lexical weights and a phrase penalty (Koehn et al., 2003), 2. n-gram language model scores implemented with the KenLM toolkit (Heafield, 2011), 3. a word penalty score, 4. a distortion model with geometric (Koehn et al., 2003), and decay 5. a feature indicating the number of times a given distortion limit is exceeded in the current state. In our experiments, the last feature is used with a fixed weight of negative infinity in order to limit the gaps between the coverage sets of adjacent anchored phrase pairs to a maximum value. In DP search, the distortion limit is usually enforced directly by the search algorithm and is not added as a feature. In our decoder, however, this restriction is not required to limit complexity, so we decided to add it among the scoring models. 2.2 Decoding Algorithm The decoding algorithm we use (algorithm 1) is very simple. It starts with a given initial document state. In the main loop, which extends from line 3 to line 12, it generates a successor state S0 for the current state S by calling the function Neighbour, which non-deterministically applies one of the operations described in section 3 of this paper to S. The score of the new state is compared to that of the previous one. If it meets a given acceptance criterion, S0 becomes the current state, else search continues from the previous state S. For the experiments in this paper, we use the hill climbing acceptance criterion, which simply accepts a new state if its score is higher than that of the current state. Other acceptance criteria are possible and could be used to endow the search algorithm with stochastic behaviour. 1181 The main loop is repeated until a maximum number of steps (step limit) is reached or until a maximum number of moves are rejected in a row (rejection limit). Algorithm 1 Decoding algorithm Input: an initial document state S; search parameters maxsteps and maxrejected Output: a modified document state 1: nsteps ← 0 2: nrejected ← 0 3: nwrhejileec nsteps < maxsteps and nrejected < maxrejected do 4: S0 ← Neighbour (S) 5: if Accept (f(S0) , f(S)) then 6: S ← S0 7: nrejected ← 0 8: elsner 9: nrejected ← nrejected + 1 10: enndr eifj 11: nsteps ← nsteps + 1 12: ennds wtephsile ← 13: return S A notable difference between this algorithm and other hill climbing algorithms that have been used for SMT decoding (Germann et al., 2004; Langlais et al., 2007) is its non-determinism. Previous work for sentence-level decoding employed a steepest ascent strategy which amounts to enumerating the complete neighbourhood of the current state as defined by the state operations and selecting the next state to be the best state found in the neighbourhood of the current one. Enumerating all neighbours of a given state, costly as it is, has the advantage that it makes it easy to prove local optimality of a state by recognising that all possible successor states have lower scores. It can be rather inefficient, since at every step only one modification will be adopted; many of the modifications that are discarded will very likely be generated anew in the next iteration. As we extend the decoder to the document level, the size of the neighbourhood that would have to be explored in this way increases considerably. Moreover, the inefficiency of the steepest ascent approach potentially increases as well. Very likely, a promising move in one sentence will remain promising after a modification has been applied to another sentence, even though this is not guaranteed to be true in the presence of cross-sentence models. We therefore adopt a first-choice hill climbing strategy that non-deterministically generates successor states and accepts the first one that meets the acceptance criterion. This frees us from the necessity of generating the full set of successors for each state. On the downside, if the full successor set is not known, it is no longer possible to prove local optimality of a state, so we are forced to use a different condition for halting the search. We use a combination of two limits: The step limit is a hard limit on the resources the user is willing to expend on the search problem. The value of the rejection limit determines how much of the neighbourhood is searched for better successors before a state is accepted as a solution; it is related to the probability that a state returned as a solution is in fact locally optimal. To simplify notations in the description of the individual state operations, we write Si −→ Si0 (6) to signify that a state operation, when presented with a document state as in equation 1 and acting on sentence i, returns a new document state of S0 = S1 . . .Si−1 Si0 Si+1 . . .SN. (7) Similarly, Si : Aj . . .Aj+h−1 −→ A01 . . .A0h0 (8) is equivalent to Si −→ A1 . . .Aj−1 A01 . . .A0h0 Aj+h . . .Ani (9) and indicates that the operation returns a state in which a sequence of h consecutive anchored phrase pairs has been replaced by another sequence of h0 anchored phrase pairs. 2.3 Efficiency Considerations When implementing the feature functions for the decoder, we have to exercise some care to avoid recomputing scores for the whole document at every iteration. To achieve this, the scores are computed completely only once, at the beginning of the decoding run. In subsequent iterations, scoring functions are presented with the scores of the previous 1182 iteration and a list of modifications produced by the state operation, a set of tuples hi, r, s,A01 . . .A0h0i, each indicating tthioant ,t ahe s edto ocfu tmupelnets s hhio,ru,sld, Abe modifii,e eda as described by Si :Ar . . .As −→ A01 . . .A0h0 . (10) If a feature function is decomposable in some way, as all the standard features developed under the constraints of DP search are, it can then update the state simply by subtracting and adding score components pertaining to the modified parts of the document. Feature functions have the possibility to store their own state information along with the document state to make sure the required information is available. Thus, the framework makes it possible to exploit decomposability for efficient scoring without impos- ing any particular decomposition on the features as beam search does. To make scoring even more efficient, scores are computed in two passes: First, every feature function is asked to provide an upper bound on the score that will be obtained for the new state. In some cases, it is possible to calculate reasonable upper bounds much more efficiently than computing the exact feature value. If the upper bound fails to meet the acceptance criterion, the new state is discarded right away; if not, the full score is computed and the acceptance criterion is tested again. Among the basic SMT models, this two-pass strategy is only used for the n-gram LM, which requires fairly expensive parameter lookups for scoring. The scores of all the other baseline models are fully computed during the first scoring pass. The n-gram model is more complex. In its state information, it keeps track of the LM score and LM library state for each word. The first scoring pass then identifies the words whose LM scores are affected by the current search step. This includes the words changed by the search operation as well as the words whose LM history is modified. The range of the history de- pendencies can be determined precisely by considering the “valid state length” information provided by the KenLM library. In the first pass, the LM scores of the affected words are subtracted from the total score. The model only looks up the new LM scores for the affected words and updates the total score if the new search state passes the first acceptance check. This two-pass scoring approach allows us to avoid LM lookups altogether for states that will be rejected anyhow because of low scores from the other models, e. g. because the distortion limit is violated. Model score updates become more complex and slower as the number of dependencies of a model increases. While our decoding algorithm does not impose any formal restrictions on the number or type of dependencies that can be handled, there will be practical limits beyond which decoding becomes unacceptably slow or the scoring code becomes very difficult to maintain. These limits are however fairly independent of the types of dependencies handled by a model, which permits the exploration of more varied model types than those handled by DP search. 2.4 State Initialisation Before the hill climbing decoding algorithm can be run, an initial state must be generated. The closer the initial state is to an optimum, the less work remains to be done for the algorithm. If the algorithm is to be self-contained, initialisation must be relatively uninformed and can only rely on some general prior assumptions about what might be a good initial guess. On the other hand, if optimal results are sought after, it pays off to invest some effort into a good starting point. One way to do this is to run DP search first. For uninformed initialisation, we chose to implement a very simple procedure based only on the observation that, at least for language pairs involving the major European languages, it is usually a good guess to keep the word order of the output very similar to that of the input. We therefore create the initial state by selecting, for each sentence in the document, a sequence of anchored phrase pairs covering the input sentence in monotonic order, that is, such that for all pairs of adjacent anchored phrase pairs Aj and Aj+1, we have that C(Aj) ≺ C(Aj+1 ). For initialisation with DP search, we first run the Moses decoder (Koehn et al., 2007) with default search parameters and the same models as those used by our decoder. Then we extract the best output hypothesis from the search graph of the decoder and map it into a sequence of anchored phrase pairs in the obvious way. When the document-level decoder is used with models that are incompatible with beam search, Moses can be run with a subset of the models in order to find an approximation of the solution 1183 which is then refined with the complete feature set. 3 State Operations Given a document state S, the decoder uses a neighbourhood function Neighbour to simulate a move in the state space. The neighbourhood function nondeterministically selects a type of state operation and a location in the document to apply it to and returns the resulting new state. We use a set of three operations that has the property that every possible document state can be reached from every other state in a sequence of moves. Designing operations for state transitions in local search for phrase-based SMT is a problem that has been addressed in the literature (Langlais et al., 2007; Arun et al., 2010). Our decoder’s first- choice hill climbing strategy never enumerates the full neighbourhood of a state. We therefore place less emphasis than previous work on defining a compact neighbourhood, but allow the decoder to make quite extensive changes to a state in a single step with a certain probability. Otherwise our operations are similar to those used by Arun et al. (2010). All of the operations described in this paper make changes to a single sentence only. Each time it is called, the Neighbour function selects a sentence in the document with a probability proportional to the number of input tokens in each sentence to ensure a fair distribution ofthe decoder’s attention over the words in the document regardless of varying sentence lengths. 3.1 Changing Phrase Translations The change-phrase-translation operation replaces the translation of a single phrase with a random translation with the same coverage taken from the phrase table. Formally, the operation selects an anchored phrase pair Aj by drawing uniformly from the elements of Si and then draws a new translation φ0 uniformly from the set Φi(C(Aj)). The new state is given by Si : Aj −→ hC(Aj), φ0i. (11) 3.2 Changing Word Order The swap-phrases operation affects the output word order without changing the phrase translations. It exchanges two anchored phrase pairs Aj and Aj+h, resulting in an output state of Si : Aj . . .Aj+h −→ Aj+h Aj+1 . . .Aj+h−1 Aj. (12) The start location j is drawn uniformly from the eligible sentence positions; the swap range h comes from a geometric distribution with configurable decay. Other word-order changes such as a one-way move operation that does not require another movement in exchange or more advanced permutations can easily be defined. 3.3 Resegmentation The most complex operation is resegment, which allows the decoder to modify the segmentation ofthe source phrase. It takes a number of anchored phrase pairs that form a contiguous block both in the input and in the output and replaces them with a new set of phrase pairs covering the same span of the input sentence. Formally, Si : Aj . . .Aj+h−1 −→ A01 . . .A0h0 (13) such that j+[h−1 [h0 [ C(Aj0) = [ C(A0j0) = [p;q] j[0=j (14) j[0=1 for some p and q, where, for j0 = 1, . . . ,h0, we have that A0j0 = h[pj0; qj0] , φj0i, all [pj0; qj0] are mutually disjoint =an hd[ peach φj0 isi randomly drawn from Φi([pj0;qj0]). Regardless of the ordering of Aj . . .Aj+h−1 , the resegment operation always generates a sequence of anchored phrase pairs in linear order, such that C(A0j0) ≺ C(A0j0+1 ) for j0 = 1, . . . ,h0 −1 . As )f o≺r Cth(eA other operations, j is− generated uniformly and h is drawn from a geometric distribution with a decay parameter. The new segmentation is generated by extending the sequence of anchored phrase pairs with random elements starting at the next free position, proceeding from left to right until the whole range [p; q] is covered. 4 Experimental Results In this section, we present the results of a series of experiments with our document decoder. The 1184 goal of our experiments is to demonstrate the behaviour of the decoder and characterise its response to changes in the fundamental search parameters. The SMT models for our experiments were created with a subset of the training data for the English-French shared task at the WMT 2011workshop (Callison-Burch et al., 2011). The phrase table was trained on Europarl, news-commentary and UN data. To reduce the training data to a manageable size, singleton phrase pairs were removed before the phrase scoring step. Significance-based filtering (Johnson et al., 2007) was applied to the resulting phrase table. The language model was a 5gram model with Kneser-Ney smoothing trained on the monolingual News corpus with IRSTLM (Federico et al., 2008). Feature weights were trained with Minimum Error-Rate Training (MERT) (Och, 2003) on the news-test2008 development set using the DP beam search decoder and the MERT implementation of the Moses toolkit (Koehn et al., 2007). Experimental results are reported for the newstest2009 test set, a corpus of 111 newswire documents totalling 2,525 sentences or 65,595 English input tokens. 4.1 Stability An important difference between our decoder and the classical DP decoder as well as previous work in SMT decoding with local search is that our decoder is inherently non-deterministic. This implies that repeated runs of the decoder with the same search parameters, input and models will not, in general, find the same local maximum of the score space. The first empirical question we ask is therefore how different the results are under repeated runs. The results in this and the next section were obtained with random state initialisation, i. e. without running the DP beam search decoder. Figure 1 shows the results of 7 decoder runs with the models described above, translating the newstest2009 test set, with a step limit of 227 and a rejection limit of 100,000. The x-axis of both plots shows the number of decoding steps on a logarithmic scale, so the number of steps is doubled between two adjacent points on the same curve. In the left plot, the y-axis indicates the model score optimised by the decoder summed over all 2525 sentences of the document. In the right plot, the case-sensitive BLEU score (Papineni et al., 2002) of the current decoder Figure 1: Score stability in repeated decoder runs state against a reference translation is displayed. We note, as expected, that the decoder achieves a considerable improvement of the initial state with diminishing returns as decoding continues. Between 28 and 214 steps, the score increases at a roughly logarithmic pace, then the curve flattens out, which is partly due to the fact that decoding for some documents effectively stopped when the maximum number of rejections was reached. The BLEU score curve shows a similar increase, from an initial score below 5 % to a maximum of around 21.5 %. This is below the score of 22.45 % achieved by the beam search decoder with the same models, which is not surprising considering that our decoder approximates a more difficult search problem, from which a number of strong independence assumptions have been lifted, without, at the moment, having any stronger models at its disposal to exploit this additional freedom for better translation. In terms of stability, there are no dramatic differences between the decoder runs. Indeed, the small differences that exist are hardly discernible in the plots. The model scores at the end of the decoding run range between −158767.9 and −158716.9, a g re rlautniv rea ndgieffe breetnwceee nof − only a6b7.o9ut a n0d.0 −3 %15.8 F1i6n.a9l, BLEU scores range from 21.41 % to 21.63 %, an interval that is not negligible, but comparable to the variance observed when, e. g., feature weights from repeated MERT runs are used with one and the same SMT system. Note that these results were obtained with random state initialisation. With DP initialisation, score differences between repeated runs rarely 1185 exceed 0.02 absolute BLEU percentage points. Overall, we conclude that the decoding results of our algorithm are reasonably stable despite the nondeterminism inherent in the procedure. In our subsequent experiments, the evaluation scores reported are calculated as the mean of three runs for each experiment. 4.2 Search Algorithm Parameters The hill climbing algorithm we use has two parameters which govern the trade-off between decoding time and the accuracy with which a local maximum is identified: The step limit stops the search process after a certain number of steps regardless of the search progress made or lack thereof. The rejection limit stops the search after a certain number of unsuccessful attempts to make a step, when continued search does not seem to be promising. In most of our experiments, we used a step limit of 227 ≈ 1.3 · 108 and a rejection limit of 105. In practice, decoding terminates by reaching the rejection limit for the vast majority of documents. We therefore examined the effect of different rejection limits on the learning curves. The results are shown in figure 2. The results show that continued search does pay off to a certain extent. Indeed, the curve for rejection limit 107 seems to indicate that the model score increases roughly logarithmically, albeit to a higher base, even after the curve has started to flatten out at 214 steps. At a certain point, however, the probability of finding a good successor state drops rather sharply by about two orders of magnitude, as Figure 2: Search performance at different rejection limits evidenced by the fact that a rejection limit of 106 does not give a large improvement over one of 105, while one of 107 does. The continued model score improvement also results in an increase in BLEU scores, and with a BLEU score of 22. 1% the system with rejection limit 107 is fairly close to the score of 22.45 % obtained by DP beam search. Obviously, more exact search comes at a cost, and in this case, it comes at a considerable cost, which is an explosion of the time required to decode the test set from 4 minutes at rejection limit 103 to 224 minutes at rejection limit 105 and 38 hours 45 minutes at limit 107. The DP decoder takes 3 1 minutes for the same task. We conclude that the rejection limit of 105 selected for our experiments, while technically suboptimal, realises a good trade-off between decoding time and accuracy. 4.3 A Semantic Document Language Model In this section, we present the results of the application of our decoder to an actual SMT model with cross-sentence features. Our model addresses the problem of lexical cohesion. In particular, it rewards the use of semantically related words in the translation output by the decoder, where semantic distance is measured with a word space model based on Latent Semantic Analysis (LSA). LSA has been applied to semantic language modelling in previous research with some success (Coccaro and Jurafsky, 1998; Bellegarda, 2000; Wandmacher and Antoine, 2007). In SMT, it has mostly been used for domain adaptation (Kim and Khudanpur, 2004; Tam et al., 1186 2007), or to measure sentence similarities (Banchs and Costa-juss a`, 2011). The model we use is inspired by Bellegarda (2000). It is a Markov model, similar to a standard n-gram model, and assigns to each content word a score given a history of n preceding content words, where n = 30 below. Scoring relies on a 30dimensional LSA word vector space trained with the S-Space software (Jurgens and Stevens, 2010). The score is defined based on the cosine similarity between the word vector of the predicted word and the mean word vector of the words in the history, which is converted to a probability by histogram lookup as suggested by Bellegarda (2000). The model is structurally different from a regular n-gram model in that word vector n-grams are defined over content words occurring in the word vector model only and can cross sentence boundaries. Stop words, identified by an extensive stop word list and amounting to around 60 % of the tokens, are scored by a different mechanism based on their relative frequency (undiscounted unigram probability) in the training corpus. In sum, the score produced by the semantic document LM has the following form: wh(er|h)α=is tεpαheuncipgors(wp)o|hrtinof w fci os nakutneskotn wpon w ,onerldse,in(ls1teh5) training corpus and ε is a small fixed probability. It is integrated into the decoder as an extra feature function. Since we lack an automatic method for training the feature weights of document-wide features, its weight was selected by grid search over a number of values, comparing translation performance for the newstest2009 test set. In these experiments, we used DP beam search to initialise the state of our local search decoder. Three results are presented (table 1): The first table row shows the baseline performance using DP beam search with standard sentence-local features only. The scores in the second row were obtained by running the hill climbing decoder with DP initialisation, but without adding any models. A marginal increase in scores for all three test sets demonstrates that the hill climbing decoder manages to fix some of the search errors made by the DP search. The last row contains the scores obtained by adding in the semantic language model. Scores are presented for three publicly available test sets from recent WMT Machine Translation shared tasks, of which one (newstest2009) was used to monitor progress during development and select the final model. Adding the semantic language model results in a small increase in NIST scores (Doddington, 2002) for all three test sets as well as a small BLEU score gain (Papineni et al., 2002) for two out of three corpora. We note that the NIST score turned out to react more sensitively to improvements due to the semantic LM in all our experiments, which is reasonable because the model specifically targets content words, which benefit from the information weighting done by the NIST score. While the results we present do not constitute compelling evidence in favour of our semantic LM in its current form, they do suggest that this model could be improved to realise higher gains from cross-sentence semantic information. They support our claim that cross- sentence models should be examined more closely and that existing methods should be adapted to deal with them, a problem addressed by our main contribution, the local search document decoder. 5 Related Work Even though DP beam search (Koehn et al., 2003) has been the dominant approach to SMT decoding in recent years, methods based on local search have been explored at various times. For word-based SMT, greedy hill-climbing techniques were advo1187 cated as a faster replacement for beam search (Germann et al., 2001 ; Germann, 2003; Germann et al., 2004), and a problem formulation specifically targeting word reordering with an efficient word reordering algorithm has been proposed (Eisner and Tromble, 2006). A local search decoder has been advanced as a faster alternative to beam search also for phrasebased SMT (Langlais et al., 2007; Langlais et al., 2008). That work anticipates many of the features found in our decoder, including the use of local search to refine an initial hypothesis produced by DP beam search. The possibility of using models that do not fit well into the beam search paradigm is mentioned and illustrated with the example of a reversed n-gram language model, which the authors claim would be difficult to implement in a beam search decoder. Similarly to the work by Germann et al. (2001), their decoder is deterministic and explores the entire neighbourhood of a state in order to identify the most promising step. Our main contribution with respect to the work by Langlais et al. (2007) is the introduction of the possibility of handling document-level models by lifting the assumption of sentence independence. As a consequence, enumerating the entire neighbourhood becomes too expensive, which is why we resort to a “first-choice” strategy that non-deterministically generates states and accepts the first one encountered that meets the acceptance criterion. More recently, Gibbs sampling was proposed as a way to generate samples from the posterior distribution of a phrase-based SMT decoder (Arun et al., 2009; Arun et al., 2010), a process that resembles local search in its use of a set of state-modifying operators to generate a sequence of decoder states. Where local search seeks for the best state attainable from a given initial state, Gibbs sampling produces a representative sample from the posterior. Like all work on SMT decoding that we know of, the Gibbs sampler presented by Arun et al. (2010) assumes independence of sentences and considers the complete neighbourhood of each state before taking a sample. 6 Conclusion In the last twenty years of SMT research, there has been a strong assumption that sentences in a text newstest2009 newstest2010 newstest201 1 BLEU NIST BLEU NIST BLEU NIST 22.56 6.513 27.27 7.034 24.94 7.170 + hill climbing 22.60 6.518 27.33 7.046 24.97 7.169 with semantic LM 22.71 6.549 27.53 7.087 24.90 7.199 DP search only DP Table 1: Experimental results with a cross-sentence semantic language model are independent of one another, and discourse context has been largely neglected. Several factors have contributed to this. Developing good discourse-level models is difficult, and considering the modest translation quality that has long been achieved by SMT, there have been more pressing problems to solve and lower hanging fruit to pick. However, we argue that the popular DP beam search algorithm, which delivers excellent decoding performance, but imposes a particular kind of local dependency structure on the feature models, has also had its share in driving researchers away from discourse-level problems. In this paper, we have presented a decoding procedure for phrase-based SMT that makes it possible to define feature models with cross-sentence dependencies. Our algorithm can be combined with DP beam search to leverage the quality of the traditional approach with increased flexibility for models at the discourse level. We have presented preliminary results on a cross-sentence semantic language model addressing the problem of lexical cohesion to demonstrate that this kind of models is worth exploring further. Besides lexical cohesion, cross-sentence models are relevant for other linguistic phenomena such as pronominal anaphora or verb tense selection. We believe that SMT research has reached a point of maturity where discourse phenomena should not be ignored any longer, and we consider our decoder to be a step towards this goal. References Abhishek Arun, Chris Dyer, Barry Haddow, Phil Blunsom, Adam Lopez, and Philipp Koehn. 2009. Monte carlo inference and maximization for phrase-based translation. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 102–1 10, Boulder, Colorado, June. Association for Computational Linguistics. Abhishek Arun, Barry Haddow, Philipp Koehn, Adam Lopez, Chris Dyer, and Phil Blunsom. 2010. Monte 1188 Ma- Carlo techniques for phrase-based translation. chine translation, 24(2): 103–121 . Rafael E. Banchs and Marta R. Costa-juss a`. 2011. A semantic feature for Statistical Machine Translation. In Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 126– 134, Portland, Oregon, USA, June. Association for Computational Linguistics. Jerome R. Bellegarda. 2000. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8): 1279–1296. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. Findings of the 2011 Workshop on Statistical Machine Translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22–64, Edinburgh, Scotland, July. Association for Computational Linguistics. Noah Coccaro and Daniel Jurafsky. 1998. Towards better integration of semantic predictors in statistical language modeling. In Proceedings of the 5th International Conference on Spoken Language Processing, Sydney. George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second International conference on Human Language Technology Research, pages 138–145, San Diego. Jason Eisner and Roy W. Tromble. 2006. Local search with very large-scale neighborhoods for optimal permutations in machine translation. In Proceedings of the HLT-NAACL Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, pages 57–75. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. In Interspeech 2008, pages 1618–1621 . ISCA. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 228–235, Toulouse, France, July. Association for Computational Linguis- tics. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1–2): 127–143. Ulrich Germann. 2003. Greedy decoding for Statistical Machine Translation in almost linear time. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Zhengxian Gong, Min Zhang, and Guodong Zhou. 2011. Cache-based document-level Statistical Machine Translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 909–919, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Christian Hardmeier and Marcello Federico. 2010. Modelling Pronominal Anaphora in Statistical Machine Translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289. Basil Hatim and Ian Mason. 1990. Discourse and the Translator. Language in Social Life Series. Longman, London. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, July. Association for Computational Linguistics. Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 967– 975, Prague, Czech Republic, June. Association for Computational Linguistics. David Jurgens and Keith Stevens. 2010. The S-Space package: An open source package for word space models. In Proceedings of the ACL 2010 System Demonstrations, pages 30–35, Uppsala, Sweden, July. Association for Computational Linguistics. Woosung Kim and Sanjeev Khudanpur. 2004. Crosslingual latent semantic analysis for language modeling. In IEEE international conference on acoustics, speech, and signal processing (ICASSP), volume 1, pages 257–260, Montr ´eal. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48– 54, Edmonton. 1189 Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Association for Computational Linguistics: Demonstration session, pages 177–180, Prague. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Sk¨ ovde. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2008. Recherche locale pour la traduction statistique par segments. In TALN 2008, pages 119–128, Avignon, France, June. ATALA. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden, July. Association for Computational Linguistics. Franz Josef Och, Nicola Ueffing, and Hermann Ney. 2001. An efficient A* search algorithm for Statistical Machine Translation. In Proceedings of the DataDriven Machine Translation Workshop, 39th Annual Meeting of the Association for Computational Linguistics (ACL), pages 55–62, Toulouse. Franz Josef Och. 2003. Minimum error rate training in Statistical Machine Translation. In Proceedings of the 41st annual meeting of the Association for Computational Linguistics, pages 160–167, Sapporo (Japan). Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 3 11–3 18, Philadelphia. ACL. Yik-Cheung Tam, Ian Lane, and Tanja Schultz. 2007. Bilingual LSA-based adaptation for Statistical Machine Translation. Machine Translation, 21(4): 187– 207. J o¨rg Tiedemann. 2010. To cache or not to cache? Experiments with adaptive models in Statistical Machine Translation. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, pages 189–194, Uppsala, Sweden. Association for Computational Linguistics. Christoph Tillmann and Hermann Ney. 2003. Word reordering and a Dynamic Programming beam search algorithm for Statistical Machine Translation. Computational linguistics, 29(1):97–133. Christoph Tillmann, Stephan Vogel, Hermann Ney, and Alex Zubiaga. 1997. A DP-based search using monotone alignments in Statistical Translation. In Proceedings of the 35th Annual Meeting of the Association for Computational Spain, tics. Linguistics, July. Association for 289–296, Madrid, Computational Linguis- pages Tonio Wandmacher and Jean-Yves Antoine. 2007. Methods to integrate a language model with semantic information for a word prediction component. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 506–5 13, Prague, Czech Republic, June. Association for Computational Linguistics. 1190

4 0.33686 95 emnlp-2012-N-gram-based Tense Models for Statistical Machine Translation

Author: Zhengxian Gong ; Min Zhang ; Chew Lim Tan ; Guodong Zhou

Abstract: Tense is a small element to a sentence, however, error tense can raise odd grammars and result in misunderstanding. Recently, tense has drawn attention in many natural language processing applications. However, most of current Statistical Machine Translation (SMT) systems mainly depend on translation model and language model. They never consider and make full use of tense information. In this paper, we propose n-gram-based tense models for SMT and successfully integrate them into a state-of-the-art phrase-based SMT system via two additional features. Experimental results on the NIST Chinese-English translation task show that our proposed tense models are very effective, contributing performance improvement by 0.62 BLUE points over a strong baseline. 1

5 0.32891992 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation

Author: Nan Duan ; Mu Li ; Ming Zhou

Abstract: A forced derivation tree (FDT) of a sentence pair {f, e} denotes a derivation tree that can tpraainrsl {afte, f} idnetono itetss a acc duerraivtea target etrea tnhsaltat cioann e. In this paper, we present an approach that leverages structured knowledge contained in FDTs to train component models for statistical machine translation (SMT) systems. We first describe how to generate different FDTs for each sentence pair in training corpus, and then present how to infer the optimal FDTs based on their derivation and alignment qualities. As the first step in this line of research, we verify the effectiveness of our approach in a BTGbased phrasal system, and propose four FDTbased component models. Experiments are carried out on large scale English-to-Japanese and Chinese-to-English translation tasks, and significant improvements are reported on both translation quality and alignment quality.

6 0.32617876 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation

7 0.31808484 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation

8 0.31124759 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls

9 0.3055875 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

10 0.30536169 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction

11 0.30389589 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

12 0.3034018 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques

13 0.3008852 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

14 0.29717061 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

15 0.29459834 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

16 0.29433748 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

17 0.29431891 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP

18 0.29412696 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

19 0.29381272 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

20 0.29379392 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis