emnlp emnlp2013 emnlp2013-124 knowledge-graph by maker-knowledge-mining

124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation

Source: pdf

Author: Anca-Roxana Simon ; Guillaume Gravier ; Pascale Sebillot

Abstract: Topic segmentation classically relies on one of two criteria, either finding areas with coherent vocabulary use or detecting discontinuities. In this paper, we propose a segmentation criterion combining both lexical cohesion and disruption, enabling a trade-off between the two. We provide the mathematical formulation of the criterion and an efficient graph based decoding algorithm for topic segmentation. Experimental results on standard textual data sets and on a more challenging corpus of automatically transcribed broadcast news shows demonstrate the benefit of such a combination. Gains were observed in all conditions, with segments of either regular or varying length and abrupt or smooth topic shifts. Long segments benefit more than short segments. However the algorithm has proven robust on automatic transcripts with short segments and limited vocabulary reoccurrences.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Leveraging lexical cohesion and disruption for topic segmentation Anca S ¸imon Universit e´ de Rennes 1 IRISA & INRIA Rennes Guillaume Gravier Pascale S ´ebillot CNRS INSA de Rennes IRISA & INRIA Rennes IRISA & INRIA Rennes anca-roxana . [sent-1, score-1.212]

2 fr Abstract Topic segmentation classically relies on one of two criteria, either finding areas with coherent vocabulary use or detecting discontinuities. [sent-7, score-0.34]

3 In this paper, we propose a segmentation criterion combining both lexical cohesion and disruption, enabling a trade-off between the two. [sent-8, score-0.569]

4 Gains were observed in all conditions, with segments of either regular or varying length and abrupt or smooth topic shifts. [sent-11, score-0.561]

5 However the algorithm has proven robust on automatic transcripts with short segments and limited vocabulary reoccurrences. [sent-13, score-0.488]

6 1 Introduction Topic segmentation consists in evidentiating the semantic structure of a document: Algorithms developed for this task aim at automatically detecting frontiers which define topically coherent segments in a text. [sent-14, score-0.692]

7 Various methods for topic segmentation of textual data are described in the literature, e. [sent-15, score-0.321]

8 , identifying segments with a consistent use of vocabulary, either based on words 1314 or on semantic relations between words. [sent-20, score-0.288]

9 This general principle of lexical cohesion is further exploited for topic segmentation with two radically different strategies. [sent-22, score-0.707]

10 On the one hand, a measure of the lexical cohesion can be used to determine coherent segments (Reynar, 1994; Moens and Busser, 2001 ; Utiyama and Isahara, 2001). [sent-23, score-0.696]

11 On the other hand, shifts in the use of vocabulary can be searched for to directly identify the segment frontiers by measuring the lexical disruption (Hearst, 1997). [sent-24, score-0.968]

12 Techniques based on the first strategy yield more accurate segmentation results, but face a problem of over-segmentation which can, up to now, only be solved by providing prior information regarding the distribution of segment length or the expected number of segments. [sent-25, score-0.457]

13 In this paper, we propose a segmentation criterion combining both cohesion and disruption along with the corresponding algorithm for topic segmentation. [sent-26, score-1.146]

14 Moreover, the combination of these two strategies enables regularizing the number of segments found without resorting to prior knowledge. [sent-28, score-0.288]

15 This piece of work uses the algorithm of Utiyama and Isahara (2001) as a starting point, a versatile and performing topic segmentation algorithm cast in a statistical framework. [sent-29, score-0.365]

16 Among the benefits of this al- gorithm are its independency to any particular domain and its ability to cope with thematic segments Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et. [sent-30, score-0.339]

17 Moreover, the algorithm has proven to be up to the state of the art in several studies, with no need of a priori information about the number of segments (contrary to algorithms in (Malioutov and Barzilay, 2006; Eisenstein and Barzilay, 2008) that can attain a higher segmentation accuracy). [sent-33, score-0.541]

18 To account both for cohesion and disruption, we extend the formalism of Isahara and Utiyama using a Markovian assumption between segments in place of the independence assumption of the original algorithm. [sent-35, score-0.575]

19 Keeping unchanged their probabilistic measure of lexical cohesion, the Markovian assumption enables to introduce the disruption between two consecutive segments. [sent-36, score-0.639]

20 We propose an extended graph based decoding strategy, which is both optimal and efficient, exploiting the notion of generalized segment model or semi hidden Markov models. [sent-37, score-0.228]

21 Existing work on topic segmentation is presented in Section 2, emphasizing the motivations of the model we propose. [sent-44, score-0.321]

22 To skirt the issue of defining a topic, they suggest to focus on topic-shift markers and to identify topic changes, what most current topic segmentation methods do. [sent-51, score-0.464]

23 The most popular ones rely either on the lexical distribution information to measure lexical cohesion (i. [sent-53, score-0.419]

24 The key point with lexical cohesion is that a significant change in the use of vocabulary is considered to be a sign of topic shift. [sent-57, score-0.493]

25 This general idea translates into two families of methods, local ones targeting a local detection of lexical disruptions and global ones relying on a measure of the lexical cohesion to globally find segments exhibiting coherence in their lexical distribution. [sent-58, score-0.855]

26 , 2009) seek to maximize the value of the lexical cohesion on each segment resulting from the segmentation globally on the text. [sent-64, score-0.81]

27 A typical and state-of-theart algorithm is that of Utiyama and Isahara (2001) whose principle is to search globally for the best path in a graph representing all possible segmentations and where edges are valued according to the lexical cohesion measured in a probabilistic way. [sent-66, score-0.486]

28 When the lengths of the respective topic segments in a text (or between two texts) are very different from one another, local methods are challenged. [sent-67, score-0.393]

29 Finding out an appropriate window size and extracting boundaries become critical with segments of varying length, in particular when short segments are present. [sent-68, score-0.61]

30 The lack of a global vision also makes it difficult to normalize properly the similarities between blocks and to deal with statistics on segment length. [sent-70, score-0.233]

31 Short segments are therefore very likely to be coherent which calls for regularization introduced as priors on the segments length. [sent-72, score-0.631]

32 These considerations naturally lead to the idea of methods combining lexical cohesion and disruption to make the best of both worlds. [sent-73, score-0.891]

33 However, this method assumes that the number of segments to find is known beforehand which makes it difficult for real-world usage. [sent-77, score-0.288]

34 3 Combining lexical cohesion and disruption We extend the graph-based formalism of Utiyama and Isahara to jointly account for lexical cohesion and disruption in a global approach. [sent-78, score-1.782]

35 However, graph-based probabilistic topic segmentation has proven very accurate and 1316 versatile, relying on very minimal prior knowledge on the texts to segment. [sent-80, score-0.358]

36 We briefly recall the principle of probabilistic graph-based segmentation before detailing a Markovian extension to account for disruption. [sent-84, score-0.249]

37 1 Probabilistic graph-based segmentation The idea of the probabilistic graph-based segmentation algorithm is to find the segmentation into the most coherent segments constrained by a prior distribution on segments length. [sent-86, score-1.279]

38 This problem is cast into finding the most probable segmentation of a sequence of t basic units (i. [sent-87, score-0.216]

39 o,r tsh segments hthea vto are homogeneous, irnobcraebaisl-ing when words are repeated and decreasing consistently when they are different. [sent-100, score-0.288]

40 The prior distribution on segment length is given by a simple model, P[S1m] = n−m, where n is the total number of words, exhibiting a large value for a small number of segments and conversely. [sent-101, score-0.566]

41 , we have a node between each pair of utterances), the arc between nodes iand j representing a segment containing utterances ui+1 to uj. [sent-106, score-0.359]

42 The corresponding arc weight is the generalized probability of the words within segment Si→j according to v(i,j) = Xj ln(P[uk|Si→j]) − αln(n) kX=i+1 where the probability is given as in Eq. [sent-107, score-0.232]

43 The factor α is introduced to control the trade-off between the segments length and the lexical cohesion. [sent-109, score-0.399]

44 2 derives from the assumption that each segment Si is independent from the others, which makes it impossible to consider disruption between two consecutive segments. [sent-112, score-0.769]

45 To do so, the weight of an arc corresponding to a segment Si should take into account how different this segment is from Si−1. [sent-113, score-0.428]

46 2 is reformulated as Ym P[W|S1m] = P[W|S1] Y P[W|Si, Si−1] , Yi=2 where the notion of disruption can be embedded in the term P[W| Si, Si−1] which explicitly mentions tbhoeth t segments. [sent-116, score-0.538]

47 In this study, we define the score of a segment Si given Si−1 as ln P[W|Si, Si−1] = ln P[Wi|Si] − λ∆(Wi, Wi−1) (4) where Wi designates the set of utterances in Si and the rightmost part reflects the disruption between the content of Si and of Si−1. [sent-119, score-0.91]

48 4 clearly combines the measure of lexical cohesion with a measure of the disruption between consecutive segments: ∆(Wi, Wi−1) > 0 measures the coherence 1317 between Si and Si−1, the substraction thus accounting for disruption by penalizing consecutive coherent segments. [sent-121, score-1.554]

49 The underlying assumption is that the bigger ∆(Wi, Wi−1), the weaker the disruption between the two segments. [sent-122, score-0.538]

50 Parameter λ controls the respective contributions of cohesion and disruption. [sent-123, score-0.287]

51 We initially adopted a probabilistic measure of disruption based on cross probabilities, i. [sent-124, score-0.538]

52 Given the quantities defined above, the algorithm boils down to finding the best scoring segmentation as given by Sˆ = Xm argmSaxXi=1ln(P[Wi|Si]) Xm λX∆(Wi,Wi−1) − αmln(n) . [sent-135, score-0.216]

53 7 into an efficient algorithm is not straightforward since all possible combinations of adjacent segments need be considered. [sent-138, score-0.328]

54 In other words, only paths of the same length ending at a given point, with different predecessors, should be recombined so that disruption can be considered properly in subsequent steps of the algorithm. [sent-140, score-0.583]

55 We employ a strategy inspired from the decoding strategy of segment models or semi-hidden Markov model with explicit duration model (Ostendorf et al. [sent-142, score-0.228]

56 The set V is defined as V = {nij|0 ≤ i,j ≤ N} , where nij represents a boundary after utterance ui reached by a segment of length j utterances and N = t+1. [sent-148, score-0.491]

57 For example, the node n42 is positioned after u4 and all incoming segments contain the two utterances u3 and u4. [sent-151, score-0.45]

58 Thus, an edge eip,jl represents a segment olf ≤ length lcontaining utterances from ui+1 to uj, denoted Si→j . [sent-153, score-0.335]

59 1, e01,33 represents a segment of length 3 from n01 to n33, covering utterances u1 to u3. [sent-155, score-0.335]

60 To avoid explosion of the lattice, a maximum segment length Lmax is defined. [sent-156, score-0.241]

61 The property of this lattice, where, by construction, all edges out of a node have the same segment as a predecessor, makes it possible to weight each edge in the lattice according to Eq. [sent-158, score-0.298]

62 Consider a node nij for which all incoming edges encompass utterances ui−j to ui. [sent-160, score-0.229]

63 , the edge length), one can therefore easily determine the lexical cohesion as defined by the generalized probability of Eq. [sent-163, score-0.353]

64 3 and the disruption with respect to the previous segment as defined by Eq. [sent-164, score-0.766]

65 1318 Algorithm 1 Maximum probability segmentation Step 0. [sent-166, score-0.216]

66 Assign best score to each node for i= 0 → t do = fio =r j Lmin → Lmax do for k = Lmin → Lmax do /* extend→ path ending after ui with a segment of length j with an arc of length k */ q[i+k] =max  qCλ[o∆ih]+[(ejus]i+ko−]nj[(ku→],i+1ui→; u+i1→k)−ui+k) end for end for end for Step 2. [sent-168, score-0.444]

67 1 Corpora The artificial data set of Choi (2000) is widely used in the literature and enables comparison of a new segmentation method with existing ones. [sent-175, score-0.216]

68 Hence, Choi’s corpus is adapted to test the ability of our model to deal with variable segments length, z=3–1 1 being the most difficult condition. [sent-180, score-0.288]

69 The data set has a total of 1,136 segments with an average of 5 segments per document and an average of 28 sentences per segment. [sent-186, score-0.576]

70 The reference segmentation was established by associating a topic with each report, i. [sent-193, score-0.36]

71 On the one hand, segments are short, with a reduced number of repetitions, synonyms being frequently employed. [sent-197, score-0.288]

72 On the other hand, transcripts significantly differ from written texts: no punctuation signs or capital letters; no sentence structure but rather utterances which are only loosely syntactically motivated; presence of transcription errors which may imply an accentuated lack of word repetitions. [sent-199, score-0.222]

73 2 Results Performance is measured by comparison of hypothesized frontiers with reference ones. [sent-204, score-0.241]

74 Recall refers to the proportion of reference frontiers correctly detected; Precision corresponds to the ratio of hypothesized frontiers that belong to the reference segmentation; F1-measure combines recall and precision in a single value. [sent-207, score-0.413]

75 These evaluation measures were selected because recall and precision are not sensitive to variations of segment length contrary to the Pk measure (Beeferman et al. [sent-208, score-0.241]

76 7, the parameter α, which controls the contribution of the prior model with respect to the lexical cohesion and disruption, allows for different trade-offs between precision and recall. [sent-212, score-0.385]

77 de/projekte/corplex/TreeTagger 1320 ing lexical cohesion and disruption, and the corresponding 95 % confidence intervals for the F1-measure. [sent-224, score-0.353]

78 a specific variation in the size of the thematic segments forming the documents (e. [sent-227, score-0.386]

79 Figure 2 shows that, whatever the segments length, results globally improve according to the importance given to the disruption (λ variable). [sent-234, score-0.871]

80 Moreover, the variation in F1-measure diminishes when disruption is considered, thus indicating the influence of the prior model diminishes. [sent-235, score-0.585]

81 In each graphic, the leftmost boxplot UI corresponds to results obtained by using the sole lexical cohesion (baseline), while the λ value is the importance given to the lexical disruption in our approach. [sent-246, score-1.026]

82 Results are provided for the same range of variation of factor α, allowing a tolerance of 1 sentence between the hypothesized and reference frontiers. [sent-247, score-0.254]

83 A qualitative analysis of the segmentations obtained confirmed that employing disruption helps eliminate wrong hypothesis and shift hypothesized frontiers closer to the reference ones (explaining the higher gain at tolerance 0 for 9-11 data set). [sent-249, score-1.004]

84 Our model is globally stable with respect to segment length, with relatively similar gain for 3–1 1 and 6–8 data sets in which the average number of words (distinct or repeated) is close. [sent-253, score-0.305]

85 3 for the baseline and the method combining cohesion and disruption. [sent-262, score-0.287]

86 The medical textbook corpus was previously used for topic segmentation by Eisenstein and Barzilay (2008) with their algorithm BayesSeg2. [sent-264, score-0.361]

87 When considering the F1-measure value for which the number of hypothesized frontiers is the closest to the number of reference boundaries, improvement is of resp. [sent-278, score-0.241]

88 These results show that our model combining lexical cohesion and disruption is also able to deal with topic segmentation of corpora from a homogeneous domain, with smooth topic changes and segments of regular size. [sent-284, score-1.65]

89 While BayesSeg has only one free parameter (as opposed to two in our case), the number of segments is assumed to be provided as prior knowledge. [sent-286, score-0.288]

90 TV news transcripts corpus Figure 3 provides results, in terms of F1-measure variation, for TV news transcripts obtained with the two ASR sys- 1322 tems. [sent-292, score-0.33]

91 Results are confirmed in Table 4 which presents the gain in F1-measure of our model together with the 95 % confidence interval, where F1-measure values correspond to that of segmentations with a number of hypothesized frontiers the closest to the reference. [sent-295, score-0.328]

92 The two first lines show that the gain is smaller for IRENE transcripts which have a higher WER, thus fewer words available to discriminate between segments belonging to different topics. [sent-296, score-0.448]

93 5 Conclusions We have proposed a method to combine lexical cohesion and disruption for topic segmentation. [sent-298, score-0.996]

94 Experimental results on various data sets with various characteristics demonstrate the impact of taking into account disruption in addition to lexical cohesion. [sent-299, score-0.604]

95 We observed gains both on data sets with segments of regular length and on data sets exhibiting segments of highly varying length within a document. [sent-300, score-0.737]

96 56] Table 4: Gain in F1-measure for TV news corpus automatic and manual transcripts when using lexical cohesion and disruption, and the corresponding 95 % confidence intervals. [sent-325, score-0.518]

97 However the segmentation algorithm has proven to be robust on automatic transcripts with short segments and limited vocabulary reoccurrences. [sent-328, score-0.704]

98 Further work can be considered to improve segmentation of documents characterized by small segments and few words repetitions, such as using semantic relations or vectorization techniques to better exploit implicit relations not considered by lexical reoccurrence. [sent-330, score-0.57]

99 Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation. [sent-379, score-0.494]

100 Un mod e`le segmental probabiliste combinant coh e´sion lexicale et rupture lexicale pour la segmentation th ´ematique. [sent-441, score-0.284]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('disruption', 0.538), ('segments', 0.288), ('cohesion', 0.287), ('segmentation', 0.216), ('segment', 0.196), ('si', 0.175), ('choi', 0.136), ('frontiers', 0.133), ('utiyama', 0.129), ('transcripts', 0.128), ('isahara', 0.124), ('topic', 0.105), ('lmax', 0.101), ('lmin', 0.101), ('tolerance', 0.099), ('utterances', 0.094), ('ui', 0.089), ('rennes', 0.084), ('wji', 0.08), ('gravier', 0.073), ('wi', 0.07), ('lattice', 0.069), ('hypothesized', 0.069), ('bayesseg', 0.067), ('limsi', 0.067), ('nij', 0.067), ('lexical', 0.066), ('tv', 0.064), ('vi', 0.059), ('malioutov', 0.059), ('guillaume', 0.056), ('segmentations', 0.055), ('hearst', 0.055), ('coherent', 0.055), ('thematic', 0.051), ('claveau', 0.05), ('ferret', 0.05), ('graphic', 0.05), ('irene', 0.05), ('irisa', 0.05), ('lef', 0.05), ('misra', 0.05), ('repetitions', 0.05), ('grosz', 0.05), ('markovian', 0.047), ('variation', 0.047), ('length', 0.045), ('smooth', 0.045), ('globally', 0.045), ('pascale', 0.045), ('barzilay', 0.044), ('abrupt', 0.044), ('versatile', 0.044), ('reynar', 0.044), ('ln', 0.041), ('moens', 0.04), ('textbook', 0.04), ('inria', 0.04), ('adjacent', 0.04), ('confirmed', 0.039), ('reference', 0.039), ('markers', 0.038), ('exhibiting', 0.037), ('transcribed', 0.037), ('proven', 0.037), ('blocks', 0.037), ('news', 0.037), ('arc', 0.036), ('spoken', 0.036), ('wer', 0.035), ('broadcast', 0.035), ('incoming', 0.035), ('sole', 0.035), ('vocabulary', 0.035), ('consecutive', 0.035), ('varying', 0.034), ('eisenstein', 0.034), ('anca', 0.034), ('beeferman', 0.034), ('boxplot', 0.034), ('boxplots', 0.034), ('brigitte', 0.034), ('busser', 0.034), ('classically', 0.034), ('delakis', 0.034), ('guinaudeau', 0.034), ('hernandez', 0.034), ('huet', 0.034), ('lexicale', 0.034), ('niekrasz', 0.034), ('pevzner', 0.034), ('reoccurrences', 0.034), ('texttiling', 0.034), ('windowdiff', 0.034), ('node', 0.033), ('principle', 0.033), ('decoding', 0.032), ('gain', 0.032), ('discourse', 0.032), ('respect', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation

Author: Anca-Roxana Simon ; Guillaume Gravier ; Pascale Sebillot

2 0.25267991 125 emnlp-2013-Lexical Chain Based Cohesion Models for Document-Level Statistical Machine Translation

Author: Deyi Xiong ; Yang Ding ; Min Zhang ; Chew Lim Tan

Abstract: Lexical chains provide a representation of the lexical cohesion structure of a text. In this paper, we propose two lexical chain based cohesion models to incorporate lexical cohesion into document-level statistical machine translation: 1) a count cohesion model that rewards a hypothesis whenever a chain word occurs in the hypothesis, 2) and a probability cohesion model that further takes chain word translation probabilities into account. We compute lexical chains for each source document to be translated and generate target lexical chains based on the computed source chains via maximum entropy classifiers. We then use the generated target chains to provide constraints for word selection in document-level machine translation through the two proposed lexical chain based cohesion models. We verify the effectiveness of the two models using a hierarchical phrase-based translation system. Ex- periments on large-scale training data show that they can substantially improve translation quality in terms of BLEU and that the probability cohesion model outperforms previous models based on lexical cohesion devices.

3 0.12856838 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability

Author: Micha Elsner ; Sharon Goldwater ; Naomi Feldman ; Frank Wood

Abstract: We present a cognitive model of early lexical acquisition which jointly performs word segmentation and learns an explicit model of phonetic variation. We define the model as a Bayesian noisy channel; we sample segmentations and word forms simultaneously from the posterior, using beam sampling to control the size of the search space. Compared to a pipelined approach in which segmentation is performed first, our model is qualitatively more similar to human learners. On data with vari- able pronunciations, the pipelined approach learns to treat syllables or morphemes as words. In contrast, our joint model, like infant learners, tends to learn multiword collocations. We also conduct analyses of the phonetic variations that the model learns to accept and its patterns of word recognition errors, and relate these to developmental evidence.

4 0.10334747 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu

Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.

5 0.0832 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

Author: Jun-Ping Ng ; Min-Yen Kan ; Ziheng Lin ; Wei Feng ; Bin Chen ; Jian Su ; Chew Lim Tan

Abstract: In this paper we classify the temporal relations between pairs of events on an article-wide basis. This is in contrast to much of the existing literature which focuses on just event pairs which are found within the same or adjacent sentences. To achieve this, we leverage on discourse analysis as we believe that it provides more useful semantic information than typical lexico-syntactic features. We propose the use of several discourse analysis frameworks, including 1) Rhetorical Structure Theory (RST), 2) PDTB-styled discourse relations, and 3) topical text segmentation. We explain how features derived from these frameworks can be effectively used with support vector machines (SVM) paired with convolution kernels. Experiments show that our proposal is effective in improving on the state-of-the-art significantly by as much as 16% in terms of F1, even if we only adopt less-than-perfect automatic discourse analyzers and parsers. Making use of more accurate discourse analysis can further boost gains to 35%.

6 0.07967253 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

7 0.072775312 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication

8 0.066642299 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

9 0.064339146 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models

10 0.063621625 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM

11 0.062446669 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training

12 0.059699744 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

13 0.058879457 151 emnlp-2013-Paraphrasing 4 Microblog Normalization

14 0.054681998 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction

15 0.051722247 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech

16 0.051082093 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

17 0.048772909 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology

18 0.046898711 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

19 0.046883501 58 emnlp-2013-Dependency Language Models for Sentence Completion

20 0.045336347 111 emnlp-2013-Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.173), (1, 0.004), (2, -0.048), (3, 0.012), (4, -0.088), (5, -0.043), (6, 0.07), (7, 0.123), (8, -0.121), (9, 0.04), (10, -0.017), (11, -0.057), (12, 0.018), (13, 0.074), (14, 0.012), (15, 0.136), (16, 0.078), (17, -0.047), (18, 0.061), (19, 0.177), (20, -0.041), (21, -0.067), (22, -0.151), (23, 0.03), (24, -0.044), (25, -0.238), (26, 0.024), (27, -0.127), (28, -0.084), (29, -0.037), (30, 0.251), (31, 0.075), (32, 0.225), (33, 0.003), (34, -0.266), (35, 0.137), (36, 0.109), (37, 0.199), (38, 0.047), (39, -0.006), (40, -0.039), (41, 0.006), (42, -0.019), (43, -0.07), (44, 0.029), (45, -0.059), (46, 0.053), (47, -0.08), (48, 0.119), (49, -0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95467484 124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation

Author: Anca-Roxana Simon ; Guillaume Gravier ; Pascale Sebillot

2 0.86274821 125 emnlp-2013-Lexical Chain Based Cohesion Models for Document-Level Statistical Machine Translation

Author: Deyi Xiong ; Yang Ding ; Min Zhang ; Chew Lim Tan

3 0.44217333 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability

Author: Micha Elsner ; Sharon Goldwater ; Naomi Feldman ; Frank Wood

4 0.32882139 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

Author: Rebecca Dridan

Abstract: A precise syntacto-semantic analysis of English requires a large detailed lexicon with the possibility of treating multiple tokens as a single meaning-bearing unit, a word-with-spaces. However parsing with such a lexicon, as included in the English Resource Grammar, can be very slow. We show that we can apply supertagging techniques over an ambiguous token lattice without resorting to previously used heuristics, a process we call ubertagging. Our model achieves an ubertagging accuracy that can lead to a four to eight fold speed up while improving parser accuracy. 1 Introduction and Motivation Over the last decade or so, supertagging has become a standard method for increasing parser efficiency for heavily lexicalised grammar formalisms such as LTAG (Bangalore and Joshi, 1999), CCG (Clark and Curran, 2007) and HPSG (Matsuzaki et al., 2007). In each of these systems, fine-grained lexical categories, known as supertags, are used to prune the parser search space prior to full syntactic parsing, leading to faster parsing at the risk of removing necessary lexical items. Various methods are used to configure the degree of pruning in order to balance this trade-off. The English Resource Grammar (ERG; Flickinger (2000)) is a large hand-written HPSGbased grammar of English that produces finegrained syntacto-semantic analyses. Given the high level of lexical ambiguity in its lexicon, parsing with the ERG should therefore also benefit from supertagging, but while various attempts have shown possibilities (Blunsom, 2007; Dridan et al., 2008; Dridan, 2009), supertagging is still not a standard element in the ERG parsing pipeline. 1201 There are two main reasons for this. The first is that the ERG lexicon does not assign simple atomic categories to words, but instead builds complex structured signs from information about lemmas and lexical rules, and hence the shape and integration of the supertags is not straightforward. Bangalore and Joshi (2010) define a supertag as a primitive structure that contains all the information about a lexical item, including argument structure, and where the arguments should be found. Within the ERG, that information is not all contained in the lexicon, but comes from different places. The choice, therefore, of what information may be predicted prior to parsing and how it should be integrated into parsing is an open question. The second reason that supertagging is not standard with ERG processing is one that is rarely considered when processing English, namely ambiguous segmentation. In most mainstream English parsing, the segmentation of parser input into tokens that will become the leaves of the parse tree is considered a fixed, unambiguous process. While recent work (Dridan and Oepen, 2012) has shown that producing even these tokens is not a solved problem, the issue we focus on here is the ambiguous mapping from these tokens to meaning-bearing units that we might call words. Within the ERG lexicon are many multi-token lexical entries that are sometimes referred to as words-with-spaces. These multi-token entries are added to the lexicon where the grammarian finds that the semantics of a fixed expression is non-compositional and has the distributional properties of other single word entries. Some examples include an adverb-like all of a sudden, a prepositionlike for example and an adjective-like over and done with. Each of these entries create an segmentation ambiguity between treating the whole expression as a single unit, or allowing analyses comprising enProce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is2t0ic1s–1212, tries triggered by the individual tokens. Previous supertagging research using the ERG has either used the gold standard tokenisation, hence making the task artificially easier, or else tagged the individual tokens, using various heuristics to apply multi-token tags to single tokens. Neither approach has been wholly satisfactory. In this work we avoid the heuristic approaches and learn a sequential classification model that can simultaneously determine the most likely segmentation and supertag sequences, a process we dub ubertagging. We also experiment with more fine- grained tag sets than have been previously used, and find that it is possible to achieve a level of ubertagging accuracy that can improve both parser speed and accuracy for a precise semantic parser. 2 Previous Work As stated above, supertagging has become a standard tool for particular parsing paradigms, but the definitions of a supertag, the methods used to learn them, and the way they are used in parsing varies across formalisms. The original supertags were 300 LTAG elementary trees, predicted using a fairly simple trigram tagger that provided a configurable number of tags per token, since the tagger was not accurate enough to make assigning a single tree viable parser input (Bangalore and Joshi, 1999). The C&C; CCG parser uses a more complex Maximum Entropy tagger to assign tags from a set of 425 CCG lexical categories (Clark and Curran, 2007). They also found it necessary to supply more than one tag per token, and hence assign all tags that have a probability within a percentage β of the most likely tag for each token. Their standard parser configuration uses a very restrictive β value initially, relax- ing it when no parse can be found. Matsuzaki et al. (2007) use a supertagger similar to the C&C; tagger alongside a CFG filter to improve the speed of their HPSG parser, feeding sequences of single tags to the parser until a parse is possible. As in the ERG, category and inflectional information are separate in the automatically-extracted ENJU grammar: their supertag set consists of 1361 tags constructed by combining lexical categories and lexical rules. Figure 1 shows examples of supertags from these three tag sets, all describing the simple transitive use of lends. 1202 S NP0↓ VP VNP1↓ lends (a) LTAG (S[dcl]\NP)/NP (b) CCG [NP.nom NP.acc]-singular3rd verb rule (c) ENJU HPSG Figure 1: Examples of supertags from LTAG, CCG and ENJU HPSG, for the word lends. The ALPINO system for parsing Dutch is the closest in spirit to our ERG parsing setup, since it also uses a hand-written HPSG-based grammar, including multi-token entries in its lexicon. Prins and van Noord (2003) use a trigram HMM tagger to calculate the likelihood of up to 2392 supertags, and discard those that are not within τ of the most likely tag. For their multi-token entries, they assign a constructed category to each token, so that instead of assigning prepos it ion to the expression met betrekking tot (“with respect to”), they use ( 1 prepo s it ion ) , ( 2 prepo s it i ) , on ( 3 prepos it ion ) . Without these constructed categories, they would only have 1365 supertags. Most previous supertagging attempts with the ERG have used the grammar’s lexical types, which describe the coarse-grained part of speech, and the subcategorisation of a word, but not the inflection. Hence both lends and lent have a possible lexical type v np*pp* t o le, which indicates a verb, with optional noun phrase and prepositional phrase arguments, where the preposition has the form to. , , , The number of lexical types changes as the grammar grows, and is currently just over 1000. Dridan (2009) and Fares (2013) experimented with other tag types, but both found lexical types to be the optimal balance between predictability and efficiency. Both used a multi-tagging approach dubbed selective tagging to integrate the supertags into the parser. This involved only applying the supertag filter when the tag probability is above a configurable threshold, and not pruning otherwise. For multi-token entries, both Blunsom (2007) and adve rb adve rb adve rb adve rb ditt o ditt o 1 adve rb 2 adve rb 3 adve rb all in all , , , Figure 2: Options for tagging parts of the multitoken adverb all in all separately. Dridan (2009) assigned separate tags to each token, with Blunsom (2007) assigning a special ditto tag all but the initial token of a multi-token entry, while Dridan (2009) just assigned the same tag to each token (leading to example in the expression for example receiving p np i le, a preposition-type cate- gory). Both of these solutions (demonstrated in Figure 2), as well as that of Prins and van Noord (2003), in some ways defeat one of the purposes of treating these expressions as fixed units. The grammarian, by assigning the same category to, for example, all of a sudden and suddenly, is declaring that these two expressions have the same distributional properties, the properties that a sequential classifier is trying to exploit. Separating the tokens loses that information, and introduces extra noise into the sequence model. Ytrestøl (2012) and Fares (2013) treat the multientry tokens as single expressions for tagging, but with no ambiguity. Ytrestøl (2012) manages this by using gold standard tokenisation, which is, as he states, the standard practice for statistical parsing, but is an artificially simplified setup. Fares (2013) is the only work we know about that has tried to predict the final segmentation that the ERG produces. We compare segmentation accuracy between our joint model and his stand-alone tokeniser in Section 6. Looking at other instances of joint segmentation and tagging leads to work in non-whitespace separated languages such as Chinese (Zhang and Clark, 2010) and Japanese (Kudo et al., 2004). While at a high level, this work is solving the same problem, the shape of the problems are quite different from a data point of view. Regular joint morphological analysis and segmentation has much greater ambiguity in terms of possible segmentations but, in most cases, less ambiguity in terms of labelling than our situation. This also holds for other lemmatisation and morphological research, such as Toutanova and Cherry (2009). While we drew inspiration from this 1203 a j - i le v nge Foreign r-t r dl r v prp ol r v pst ol r v - unacc le v np*l-epndpin*gto le increased w period pl av - s r -vp-po le as well. p vp i le w period pl as av - dg-v le r well. Figure 3: A selection from the 70 lexitems instantiated for Foreign lending increased as well. related area, as well as from the speech recognition field, differences in the relative frequency of observations and labels, as well as in segmentation ambiguity mean that conclusions found in these areas did not always hold true in our problem space. 3 The Parser The parsing environment we work with is the PET parser (Callmeier, 2000), a unification-based chart parser that has been engineered for efficiency with precision grammars, and incorporates subsumptionbased ambiguity packing (Oepen and Carroll, 2000) and statistical model driven selective unpacking (Zhang et al., 2007). Parsing in PET is divided in two stages. The first stage, lexical parsing, covers everything from tokenising the raw input string to populating the base of the parse chart with the appropriate lexical items, ready for the second syntactic parsing stage. In this work, we embed our ubertagging model between the two stages. By this point, the input has been segmented into what we call internal t okens, which broadly means — — splitting at whitespace and hyphens, and making ’s a separate token. These tokens are subject to a morphological analysis component which proposes possible inflectional and derivational rules based on word form, and then are used in retrieving possible lexical entries from the lexicon. The results of applying the appropriate lexical rules, plus affixation rules triggered by punctuation, to the lexical entries form a lexical item object, that for this work we dub a lexitem. Figure 3 shows some examples of lexitems instantiated after the lexical parsing stage when analysing Foreign lending increased as well. The pre-terminal labels on these subtrees are the lexical types that have previously been used as supertags for the ERG. For uninflected words, with no punctuation affixed, the lexical type is the only element in the lexitem, other than the word form (e.g. Foreign, as). In this example, we also see lexitems with inflectional rules (v prp ol r, v pst ol r), derivational rules (v nger-t r dl r) and punctuation affixation rules (w period pl r). These lexitems are put in to a chart, forming a lexical lattice, and it is over this lattice that we apply our ubertagging model, removing unlikely lexitems before they are seen by the syntactic parsing stage. 4 The Data The primary data sets we use in these experiments are from the 1.0 version of DeepBank (Flickinger et al., 2012), an HPSG annotation of the Wall Street Journal text used for the Penn Treebank (PTB; Marcus et al. (1993)). The current version has gold standard annotations for approximately 85% of the first 22 sections. We follow the recommendations of the DeepBank developers in using Sections 00–19 for training, Section 20 (WSJ20) for development and Section 21 (WSJ21) as test data. In addition, we use two further sources of training data: the training portions of the LinGO Redwoods Treebank (Oepen et al., 2004), a steadily growing collection of gold standard HPSG annotations in a variety of domains; and the Wall Street Journal section of the North American News Corpus (NANC), which has been parsed, but not manually annotated. This builds on observations by Prins and van Noord (2003), Dridan (2009) and Ytrestøl (2012) that even uncorrected parser output makes very good train- ing data for a supertagger, since the constraints in the parser lead to viable, if not entirely correct sequences. This allows us to use much larger training sets than would be possible if we required manually annotated data. In final testing, we also include two further data sets to observe how domain affects the contribution of the ubertagging. These are both from the test portion of the Redwoods Treebank: CatB, an essay about open-source software;1 and WeScience13, 1http : / / catb .org/ esr /writ ings / 1204 text from Wikipedia articles about Natural Language Processing from the WeScience project (Ytrestøl et al., 2009). Table 1 summarises the vital statistics of the data we use. With the focus on multi-token lexitems, it is instructive to see just how frequent they are. In terms of type frequency, almost 10% of the approximately 38500 lexical entries in the current ERG lexicon have more than one token in their canonical form.2 However, while this is a significant percentage of the lexicon, they do not account for the same percentage of tokens during parsing. An analysis of WSJ00:19 shows that approximately one third of the sentences had at least one multi-token lexitem in the unpruned lexical lattice, and in just under half of those, the gold standard analysis included a multi-word entry. That gives the multi-token lexitems the awkward property of being rare enough to be difficult for a statistical classifier to accurately detect (just under 1% of the leaves of gold parse trees contain multiple tokens), but too frequent to ignore. In addition, since these multi-token expressions have often been distinguished because they are non-compositional, failing to detect the multi-word usage can lead to a disproportionately adverse effect on the semantic analysis of the text. 5 Ubertagging Model Our ubertagging model is very similar to a standard trigram Hidden Markov Model (HMM), except that the states are not all of the same length. Our states are based on the lexitems in the lexical lattice produced by the lexical parsing stage of PET, and as such, can be partially overlapping. We formalise this be defining each state by its start position, end po- sition, and tag. This turns out to make our model equivalent to a type of Hidden semi-Markov Model called a segmental HMM in Murphy (2002). In a segmental HMM, the states are segments with a tag (t) and a length in frames (l). In our setup, the frames are the ERG internal tokens and the segments are the lexitems, which are the potential candidates cathedral-baz aar / by Eric S. Raymond 2While the parser has mechanisms for handling words unknown to the lexicon, with the current grammar these mechanisms will never propose a multi-token lexitem, and so only the multi-token entries explicitly in the lexicon will be recognised as such. Lexitems Data Set Source Use Gold? Trees All M-T WSJ00:19DeepBank 1.0 §00–19trainyes337836614516309 Redwoods RDeeedwpBooandks 1Tr.0ee §b0a0n–k1 train yes 39478 432873 6568 NANC LDC2008T15 train no 2185323 42376523 399936 WSJ20DeepBank 1.0 §20devyes172134063312 WSJ21DDeeeeppBBaannkk 11..00 §§2210testyes141427515253 WeScience13 RDeeedwpBooandks T1.r0ee §b2a1nk test yes 802 11844 153 CatB Redwoods Treebank test yes 608 11653 115 Table 1: Test, development and training data used in these experiments. The final two columns show the total number of lexitems used for training (All), as well as how many of those were multi-token lexitems (M-T). to become leaves of the parse tree. As indicated above, the majority of segments (over 99%) will be one frame long, but segments of up to four frames are regularly seen in the training data. A standard trigram HMM has a transition proba- bility matrix A, where the elements Aijk represent the probability P(k|ij), and an emission probability tmhaetr pirxo bBa bwilhitoys eP (elke|mije),nt asn Bjo r eemcoisrdsi othne p probabilities P(o|j). Given these matrices and a vector of obstieersve Pd( frames, vOen, th thee posterior probabilities or fo fe oacbhstate at frame v are calculated as:3 P(qv= qy|O) =αv(Pqy()Oβv)(qy) (1) where αv(qy) is the forward probability at frame v, given a current state qy (i.e. the probability of the observation up to v, given the state): = qy) Xαv(qxqy) αv (qy) ≡ P(O0:v |qv = αv(qxqy) (2) (3) Xqx = Bqyov Xαv−1(qwqx)Aqwqxqy (4) Xqw βv (qy) is the backwards probability at frame v, given a current state qy (the probability of the observation 3Since we will require per-state probabilities for integration the parser, we focus on the calculation of posterior probabilities, rather than determing the single best path. to 1205 from v, given the state): βv(qy) ≡ P(Ov+1:V|qv = Xβv(qxqy) = qy) (5) (6) Xqx βv(qxqy) = Xβv+1(qyqz)AqxqyqzBqzov+1 (7) Xqz and the probability of the full observation sequence is equal to the forward probability at the end of the sequence, or the backwards probability at the start of the sequence: P(O) = αV(hEi) = β0(hSi) (8) In implementation, our model varies only in what we consider the previous or next states. While v still indexes frames, qv now indicates a state that ends with frame v, and we look forwards and backwards to adjacent states, not frames, formally designated in terms of l, the length of the state. Hence, we modify equation (4): αv(qxqy) = BqyOv−l+1:v Xαv−l(qwqx)Aqwqxqy Xqw (9) where v−l indexes the frame before the current state starts, va−ndl nhedencxee we are summing over arelln st tsattaetes that lead directly to our current state. An equivalent modification to equation (7) gives: βv(qxqy) = X Xβv+l(qyqz)AqxqyqzBqzOv+1:v+l ∈XQqznXl(qz) (10) LTTyYpPeEv np-pp*to leExample#1T0a2g8s INFL v np-pp * t o le :v pas odl r FULL v np-pp*to le :v pas odlr :w period plr 3626 21866 wv pe praiso oddl prlr l v np-pp*to le recommended. Figure 4: Possible tag types and their tag set size, with examples derived from the lexitem on the right. where Qn is the set of states that start at v + 1(i.e., the states immediately following the current state), and l(qz) is the length of state qz. We construct the transition and emission probability matrices using relative frequencies directly observed from the training data, where we make the simplifying assumption that P(qk |qiqj) ≡ P(t(qk) |t(qi)t(qk)). Which is to say, w|qhile lex≡items w)|itt(hq the same tag, but different length will trigger distinct states with distinct emission probabilities, they will have the same transition probabilities, given the same proceeding tag.4 Even with our large training set, some tag trigrams are rare or unseen. To smooth these probabilities, we use deleted interpolation to calculate a weighted sum of the trigram, bigram and unigram probabilities, since it has been successfully used in effective PoS taggers like the TnT tagger (Brants, 2000). Future work will look more closely at the effects of different smoothing methods. 6 Intrinsic Ubertag Evaluation In order to develop and tune the ubertagging model, we first looked at segmentation and tagging performance in isolation over the development set. We looked at three tag granularities: lexical types (LTYPE) which have previously been shown to be the optimal granularity for supertagging with the ERG, inflected types (INFL) which encompass inflectional and derivational rules applied to the lexical type, and the full lexical item (FULL), which also includes affixation rules used for punctuation handling. Examples of each tag type are shown in Figure 4, along with the number of tags of each type seen in the training data. 4Since the multi-token lexical entries are defined because they have the same properties as the single token variants, there is no reason to think the length of a state should influence the tag sequence probability. 1206 Tag Type Segmentation F1 Sent. Tagging F1 Sent. FULL99.5594.4893.9242.13 INFL LTYPE 99.45 99.40 93.55 93.03 93.74 93.27 41.49 38.12 Table 2: Segmentation and tagging performance of the best path found for each model, measured per segment in terms of F1, and also as complete sentence accuracy. Single sequence results Table 2 shows the results when considering the best path through the lattice. In terms of segmentation, our sentence accuracy is comparable to that of the stand-alone segmentation performance reported by Fares et al. (2013) over similar data.5 In that work, the authors used a binary CRF classifier to label points between objects they called micro-tokens as either SPLIT or NOSPLIT. The CRF classifier used a less informed input (since it was external to the parser), but a much more complex model, to produce a best single path sentence accuracy of 94.06%. Encouragingly, this level of segmentation performance was shown in later work to produce a viable parser input (Fares, 2013). Switching to the tagging results, we see that the F1 numbers are quite good for tag sets of this size.6 The best tag accuracy seen for ERG LTYPE-style tags was 95.55 in Ytrestøl (2012), using gold standard segmentation on a different data set. Dridan (2009) experimented with a tag granularity similar to our INFL (letype+morph) and saw a tag accuracy of 91.51, but with much less training data. From other formalisms, Kummerfeld et al. (2010) 5Fares et al. (2013) used a different section of an earlier version of DeepBank, but with the same style of annotation. 6We need to measure F1 rather than tag accuracy here, since the number of tokens tagged will vary according to the segmentation. report a single tag accuracy of 95.91, with the smaller CCG supertag set. Despite the promising tag F1 numbers however, the sentence level accuracy still indicates a performance level unacceptable for parser input. Comparing between tag types, we see that, possibly surprisingly, the more fine-grained tags are more accurately assigned, although the differences are small. While instinctively a larger tag set should present a more difficult problem, we find that this is mitigated both by the sparse lexical lattice provided by the parser, and by the extra constraints provided by the more informative tags. Multi-tagging results The multi-tagging methods from previous supertagging work becomes more complicated when dealing with ambiguous tokenisation. Where, in other setups, one can compare tag probabilities for all tags for a particular token, that no longer holds directly when tokens can partially overlap. Since ultimately, the parser uses lexitems which encompass segmentation and tagging information, we decided to use a simple integration method, where we remove any lexitem which our model assigns a probability below a certain threshold (ρ). The effect of the different tag granularities is now mediated by the relationship between the states in the ubertagging lattice and the lexitems in the parser’s lattice: for the FULL model, this is a one-to-one relationship, but states from the models that use coarser-grained tags may affect multiple lexitems. To illustrate this point, Figure 5 shows some lexitems for the token forecast,, where there are multiple possible analyses for the comma. A FULL tag of v cp le :v p st olr :w comma pl r will select only lexitem (b), whereas an INFL tag v cp le :v pst ol r will select (b) and (c) and the LTYPE tag v cp le picks out (a), (b) and (c). On the other hand, where there is no ambiguity in inflection or affixation, an LTYPE tag of n - mc le may relate to only a single lexitem ((f) in this case). Since we are using an absolute, rather than relative, threshold, the number needs to be tuned for each model7 and comparisons between models can only be made based on the effects (accuracy or pruning power) of the threshold. Table 3 shows how a selection of threshold values affect the accuracy 7A tag set size of 1028 will lead to higher probabilities in general than a tag set size of 21866. 1207 w comma-nf pl r w comma pl r w comma-n f pl r v pst ol r v pst o l r v cp le v cp le v cp le forecast, (a) w comma pl r forecast, (b) w comma pl r forecast, (c) v p st ol r v pas o l r w comma pl r v np le v np le n - mc le forecast, (d) forecast, (e) forecast, (f) Figure 5: Some of the lexitems triggered by forecast, in Despite the gloomy forecast, profits were up. Tag Type Lexitems ρ Acc. Kept Ave. FULL0.0000199.7141.63.34 FULL FULL FULL 0.0001 0.001 0.01 99.44 98.92 97.75 33.1 25.5 19.4 2.66 2.05 1.56 INFL0.000199.6737.93.04 INFL INFL INFL 0.001 0.01 0.02 99.25 98.21 97.68 29.0 21.6 19.7 2.33 1.73 1.58 LTYPE0.000299.7566.35.33 LTYPE LTYPE LTYPE 0.002 0.02 0.05 99.43 98.41 97.54 55.0 43.5 39.4 4.42 3.50 3.17 Table 3: Accuracy and ambiguity after pruning lexitems in WSJ20, at a selection of thresholds ρ for each model. Accuracy is measured as the percentage of gold lexitems remaining after pruning, while ambiguity is presented both as a percentage of lexitems kept, and the average number of lexitems per initial token still remaining. Tag accuracy versus ambiguity Average lexitems per initial token Figure 6: Accuracy over gold lexitems versus average lexitems per initial token over the development set, for each of the different ubertagging models. and pruning impact of our different disambiguation models, where the accuracy is measured in terms of percentage of gold lexitems retained. The pruning effect is given both as percentage of lexitems retained after pruning, and average number of lexitems per initial token.8 Comparison between the different models can be more easily made by examining Figure 6. Here we see clearly that the LTYPE model provides much less pruning for any given level of lexitem accuracy, while the performance of the other models is almost indistinguishable. Analysis The current state-of-the-art POS tagging accuracy (using the 45 tags in the PTB) is approximately 97.5%. The most restrictive ρ value we report for each model was selected to demonstrate that level of accuracy, which we can see would lead to pruning over 80% of lexitems when using FULL tags, an average of 1.56 tags per token. While this level of accuracy has been sufficient for statistical treebank parsing, previous work (Dridan, 2009) has shown that tag accuracy cannot directly predict parser performance, since errors of different types can have very different effects. This is hard to quantify without parsing, but we made a qualitative analysis at the lexitems that were incorrectly being 8The average number of lexitems per token for the unrestricted parser is 8.03, although the actual assignment is far from uniform, with up to 70 lexitems per token seen for the very ambiguous tokens. 1208 pruned. For all models, the most difficult lexitems to get correct were proper nouns, particular those that are also used as common nouns (e.g. Bank, Airline, Report). While capitalisation provides a clue here, it is not always deterministic, particularly since the treebank incorporates detailed decisions regarding the distinction between a name and a capitalised common noun that require real world knowledge, and are not necessarily always consistent. Almost two thirds of the errors made by the FULL and INFL models are related to these decisions, but only about 40% for the LTYPE model. The other errors are predominately over noun and verb type lexitems, as the open classes, with the only difference between models being that the FULL model seems marginally better at classifying verbs. The next section describes the end-to-end setup and results when parsing the development set. 7 Parsing With encouraging ubertagging results, we now take the next step and evaluate the effect on end-to-end parsing. Apart from the issue of different error types having unpredictable effects, there are two other factors that make the isolated ubertagging results only an approximate indication of parsing performance. The first confounding factor is the statistical parsing disambiguation model. To show the effect of ubertagging in a realistic configuration, we only evaluate the first analysis that the parser returns. That means that when the unrestricted parser does not rank the gold analysis first, errors made by our model may not be visible, because we would never see the gold analysis in any case. On the other hand, it is possible to improve parser accuracy by pruning incorrect lexitems that were in a top ranked, nongold analysis. The second new factor that parser integration brings to the picture is the effect of resource limitations. For reasons of tractability, PET is run with per sentence time and memory limits. For treebank creation, these limits are quite high (up to four minutes), but for these experiments, we set the timeout to a more practical 60 seconds and the memory limit to 2048Mb. Without lexical pruning, this leads to approximately 3% of sentences not receiving an analysis. Since the main aim of ubertagging is to inTag F1 Type ρ Lexitem Bracket Time No Pruning94.0688.586.58 FULL0.0000195.6289.843.99 FULL FULL FULL 0.0001 0.001 0.01 95.95 95.81 94.19 90.09 89.88 88.29 2.69 1.34 0.64 INFL0.000196.1090.373.45 INFL INFL INFL 0.001 0.01 0.02 96.14 95.07 94.32 90.33 89.27 88.49 1.78 0.84 0.64 LTYPE0.000295.3789.634.73 LTYPE LTYPE LTYPE 0.002 0.02 0.05 96.03 95.04 93.36 90.20 89.04 87.26 2.89 1.23 0.88 Table 4: Lexitem and bracket F1over WSJ20, with average per sentence parsing time in seconds. crease efficiency, we would expect to regain at least some of these unanalysed sentences, even when a lexitem needed for the gold analysis has been removed. Table 4 shows the parsing results at the same threshold values used in Table 3. Accuracy is calculated in terms of F1 both over lexitems, and PARSEVAL-style labelled brackets (Black et al., 1991), while efficiency is represented by average parsing time per sentence. We can see here that an ubertagging F1 of below 98 (cf. Table 3) leads to a drop in parser accuracy, but that an ubertagging performance of between 98 and 99 can improve parser F1 while also achieving speed increases up to 8-fold. From the table we confirm that, contrary to earlier pipeline supertagging configurations, tags of a finer granularity than LTYPE can deliver better performance, both in terms of accuracy and efficiency. Again, comparing graphically in Figure 7 gives a clearer picture. Here we have graphed labelled bracket F1 against parsing time for the full range of threshold values explored, with the unpruned parsing results indicated by a cross. From this figure, we see that the INFL model, despite being marginally less accurate when measured in isolation, leads to slightly more accurate parse results than the FULL model at all levels of efficiency. Looking at the same graph for different samples of the development set (not shown) shows some 1209 Parser accuracy versus efficiency Time per sentence Figure 7: Labelled bracket F1 versus parsing time per sentence over the development set, for each of the different ubertagging models. The cross indicates unpruned performance, while the circle pinpoints the configuration we chose for the final test runs. variance in which threshold value gives the best F1, but the relative differences and basic curve shape re- mains the same. From these different views, using the guideline of maximum efficiency without harming accuracy we selected our final configuration: the INFL model with a threshold value of 0.001 (marked with a circle in Figure 7). On the development set, this configuration leads to a 1.75 point improvement in F1 in 27% of the parsing time. 8 Final Results Table 5 shows the results obtained when parsing using the configuration selected on the development set, over our three test sets. The first, WSJ21 is from the same domain as the development set. Here we see that the effect over the WSJ21 set fairly closely mirrored that of the development set, with an F1 increase of 1.81 in 29% of the parsing time. The Wikipedia domain of our WeScience13 test set, while very different to the newswire domain of the development set could still be considered in domain for the parsing and ubertagging models, since there is Wikipedia data in the training sets. With an average sentence length of 15.18 (compared to 18.86 in WSJ21), the baseline parsing time is faster than for WSJ21, and the speedup is not quite as large Data Set Baseline F1 Time Pruned F1 Time WSJ2188.126.0689.931.77 WeScience13 CatB 86.25 86.31 4.09 5.00 87.14 87.1 1 1.48 1.78 Table 5: Parsing accuracy in terms of labelled bracket F1 and average time per sentence when parsing the test sets, without pruning, and then with lexical pruning using the INFL model with a threshold of 0.001. but still welcome, at 36% of the baseline time. The increase is accuracy is likewise smaller (due to less issues with resource exhaustion in the baseline), but as our primary goal is to not harm accuracy, the results are pleasing. The CatB test set is the standard out-of-domain test for the parser, and is also out of domain for the ubertagging model. The average sentence length is not much below that of WSJ21, at 18.61, but the baseline parsing speed is still noticeably faster, which appears to be a reflection of greater structural ambiguity in the newswire text. We still achieve a reduction in parsing time to 35% of the baseline, again with a small improvement in accuracy. The across-the-board performance improvement on all our test sets suggests that, while tuning the pruning threshold could help, it is a robust parameter that can provide good performance across a variety of domains. This means that we finally have a robust supertagging setup for use with the ERG that doesn’t require heuristic shortcuts and can be reliably applied in general parsing. 9 Conclusions and Outlook In this work we have demonstrated a lexical disambiguation process dubbed ubertagging that can assign fine-grained supertags over an ambiguous token lattice, a setup previously ignored for English. It is the first completely integrated supertagging setup for use with the English Resource Grammar, which avoids the previously necessary heuristics for dealing with ambiguous tokenisation, and can be robustly configured for improved performance without loss of accuracy. Indeed, by learning a joint segmentation and supertagging model, we have been able to achieve usefully high tagging accuracies for very 1210 fine-grained tags, which leads to potential parser speedups of between 4 and 8 fold. Analysis of the tagging errors still being made have suggested some possibly avoidable inconsistencies in the grammar and treebank, which have been fed back to the developers, hopefully leading to even better results in the future. In future work, we will investigate more advanced smoothing methods to try and boost the ubertagging accuracy. We also intend to more fully explore the domain adaptation potentials of the lexical model that have been seen in other parsing setups (see Rimell and Clark (2008) for example), as well as examine the limits on the effects of more training data. Finally, we would like to explore just how much the statistic properties of our data dictate the success of the model by looking at related problems like morphological analysis of unsegmented languages such as Japanese. Acknowledgements Iam grateful to my colleagues from the Oslo Language Technology Group and the DELPH-IN consortium for many discussions on the issues involved in this work, and particularly to Stephan Oepen who inspired the initial lattice tagging idea. Thanks also to three anonymous reviewers for their very constructive feedback which improved the final version. Large-scale experimentation and engineering is made possible though access to the TITAN highperformance computing facilities at the University of Oslo, and Iam grateful to the Scientific Computating staff at UiO, as well as to the Norwegian Metacenter for Computational Science and the Norwegian tax payer. References Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: an approach to almost parsing. Computational Linguistics, 25(2):237 –265. Srinavas Bangalore and Aravind Joshi, editors. 2010. Supertagging: Using Complex Lexical Descriptions in Natural Language Processing. The MIT Press, Cambridge, US. Ezra Black, Steve Abney, Dan Flickinger, Claudia Gdaniec, Ralph Grishman, Phil Harrison, Don Hindle, Robert Ingria, Fred Jelinek, Judith Klavans, Mark Liberman, Mitch Marcus, S. Roukos, Beatrice Santorini, and Tomek Strzalkowski. 1991. A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of the Workshop on Speech and Natural Language, page 306 311, Pacific Grove, USA. Philip Blunsom. 2007. Structured Classification for Multilingual Natural Language Processing. Ph.D. thesis, Department of Computer Science and Software Engineering, University of Melbourne. Thorsten Brants. 2000. TnT a statistical part-ofspeech tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing ANLP-2000, page 224 –23 1, Seattle, USA. Ulrich Callmeier. 2000. PET. A platform for experimentation with efficient HPSG processing techniques. Natural Language Engineering, 6(1):99 108, March. Stephen Clark and James R. Curran. 2007. Formalismindependent parser evaluation with CCG and DepBank. In Proceedings of the 45th Meeting of the Association for Computational Linguistics, page 248 255, Prague, Czech Republic. Rebecca Dridan and Stephan Oepen. 2012. Tokenization. Returning to a long solved problem. A survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Meeting of the Association for Computational Linguistics, page 378 382, Jeju, Republic of Korea, July. Rebecca Dridan, Valia Kordoni, and Jeremy Nicholson. 2008. Enhancing performance of lexicalised grammars. page 613 621. – — – – – – Rebecca Dridan. 2009. Using lexical statistics to improve HPSG parsing. Ph.D. thesis, Department of Computational Linguistics, Saarland University. Murhaf Fares, Stephan Oepen, and Yi Zhang. 2013. Machine learning for high-quality tokenization. Replicating variable tokenization schemes. In Computational Linguistics and Intelligent Text Processing, page 23 1 244. Springer. Murhaf Fares. 2013. ERG tokenization and lexical categorization: a sequence labeling approach. Master’s thesis, Department of Informatics, University of Oslo. – 1211 Dan Flickinger, Yi Zhang, and Valia Kordoni. 2012. DeepBank. A dynamically annotated treebank of the Wall Street Journal. In Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, page 85 –96, Lisbon, Portugal. Edi ¸c˜ oes Colibri. Dan Flickinger. 2000. On building a more efficient grammar by exploiting types. Natural Language Engineering, 6 (1): 15 28. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, page 230 237. Jonathan K. Kummerfeld, Jessika Roesner, Tim Daw– – born, James Haggerty, James R. Curran, and Stephen Clark. 2010. Faster parsing by supertagger adaptation. In Proceedings of the 48th Meeting of the Association for Computational Linguistics, page 345 355, Uppsala, Sweden. Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpora of English: The Penn Treebank. Computational Linguistics, 19:3 13 –330. Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2007. Efficient HPSG parsing with supertagging and CFG-filtering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2007), page 1671 1676, Hyderabad, India. Kevin P. Murphy. 2002. Hidden semi-Markov models (HSMMs). Stephan Oepen and John Carroll. 2000. Ambiguity packing in constraint-based parsing. Practical results. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, page 162 169, Seattle, WA, USA. Stephan Oepen, Daniel Flickinger, Kristina Toutanova, and Christopher D. Manning. 2004. LinGO Redwoods. A rich and dynamic treebank for HPSG. Research on Language and Computation, 2(4):575 596. Robbert Prins and Gertjan van Noord. 2003. Reinforcing parser preferences through tagging. Traitement Au– – – – des Langues, 44(3): 121 139. Laura Rimell and Stephen Clark. 2008. Adapting a lexicalized-grammar parser to contrasting domains. page 475 –484. Kristina Toutanova and Colin Cherry. 2009. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the 47th Meeting of the Association for Computational Linguistics, page 486 494, Singapore. Gisle Ytrestøl. 2012. Transition-based Parsing for Large-scale Head-Driven Phrase Structure Grammars. Ph.D. thesis, Department of Informatics, University of Oslo. tomatique – – Gisle Ytrestøl, Stephan Oepen, and Dan Flickinger. 2009. Extracting and annotating Wikipedia subdomains. In Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories, page 185 197, Groningen, The Netherlands. Yue Zhang and Stephen Clark. 2010. A fast decoder for joint word segmentation and POS-tagging using a single discriminative model. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, page 843 852, Cambridge, MA, USA. Yi Zhang, Stephan Oepen, and John Carroll. 2007. Efficiency in unification-based n-best parsing. In Proceedings of the 10th International Conference on Parsing Technologies, page 48 59, Prague, Czech Republic, July. – – – 1212

5 0.30312148 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models

Author: Hiroshi Noji ; Daichi Mochihashi ; Yusuke Miyao

Abstract: One of the language phenomena that n-gram language model fails to capture is the topic information of a given situation. We advance the previous study of the Bayesian topic language model by Wallach (2006) in two directions: one, investigating new priors to alleviate the sparseness problem caused by dividing all ngrams into exclusive topics, and two, developing a novel Gibbs sampler that enables moving multiple n-grams across different documents to another topic. Our blocked sampler can efficiently search for higher probability space even with higher order n-grams. In terms of modeling assumption, we found it is effective to assign a topic to only some parts of a document.

6 0.29773954 182 emnlp-2013-The Topology of Semantic Knowledge

7 0.27756757 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

8 0.27616411 123 emnlp-2013-Learning to Rank Lexical Substitutions

9 0.27379805 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication

10 0.25397155 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training

11 0.24597253 138 emnlp-2013-Naive Bayes Word Sense Induction

12 0.23490216 115 emnlp-2013-Joint Learning of Phonetic Units and Word Pronunciations for ASR

13 0.23319088 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

14 0.2327328 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

15 0.23056145 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM

16 0.2249576 26 emnlp-2013-Assembling the Kazakh Language Corpus

17 0.22349037 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

18 0.22298218 191 emnlp-2013-Understanding and Quantifying Creativity in Lexical Composition

19 0.22092493 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

20 0.21599248 58 emnlp-2013-Dependency Language Models for Sentence Completion

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.04), (9, 0.026), (18, 0.029), (22, 0.056), (30, 0.076), (47, 0.021), (50, 0.031), (51, 0.133), (61, 0.363), (66, 0.027), (71, 0.026), (75, 0.023), (77, 0.028), (96, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77813447 115 emnlp-2013-Joint Learning of Phonetic Units and Word Pronunciations for ASR

Author: Chia-ying Lee ; Yu Zhang ; James Glass

Abstract: The creation of a pronunciation lexicon remains the most inefficient process in developing an Automatic Speech Recognizer (ASR). In this paper, we propose an unsupervised alternative requiring no language-specific knowledge to the conventional manual approach for creating pronunciation dictionaries. We present a hierarchical Bayesian model, which jointly discovers the phonetic inventory and the Letter-to-Sound (L2S) mapping rules in a language using only transcribed data. When tested on a corpus of spontaneous queries, the results demonstrate the superiority of the proposed joint learning scheme over its sequential counterpart, in which the latent phonetic inventory and L2S mappings are learned separately. Furthermore, the recognizers built with the automatically induced lexicon consistently outperform grapheme-based recognizers and even approach the performance of recognition systems trained using – – conventional supervised procedures.

same-paper 2 0.70879024 124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation

Author: Anca-Roxana Simon ; Guillaume Gravier ; Pascale Sebillot

3 0.59563929 150 emnlp-2013-Pair Language Models for Deriving Alternative Pronunciations and Spellings from Pronunciation Dictionaries

Author: Russell Beckley ; Brian Roark

Abstract: Pronunciation dictionaries provide a readily available parallel corpus for learning to transduce between character strings and phoneme strings or vice versa. Translation models can be used to derive character-level paraphrases on either side of this transduction, allowing for the automatic derivation of alternative pronunciations or spellings. We examine finitestate and SMT-based methods for these related tasks, and demonstrate that the tasks have different characteristics finding alternative spellings is harder than alternative pronunciations and benefits from round-trip algorithms when the other does not. We also show that we can increase accuracy by modeling syllable stress. –

4 0.4366087 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability

Author: Micha Elsner ; Sharon Goldwater ; Naomi Feldman ; Frank Wood

5 0.43427229 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu

6 0.42803857 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

7 0.42788416 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

8 0.42782798 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

9 0.42712015 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

10 0.42698094 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

11 0.42589068 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation

12 0.425432 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction

13 0.4244554 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation