acl acl2010 acl2010-246 knowledge-graph by maker-knowledge-mining

246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure


Source: pdf

Author: Minwoo Jeong ; Ivan Titov

Abstract: Documents often have inherently parallel structure: they may consist of a text and commentaries, or an abstract and a body, or parts presenting alternative views on the same problem. Revealing relations between the parts by jointly segmenting and predicting links between the segments, would help to visualize such documents and construct friendlier user interfaces. To address this problem, we propose an unsupervised Bayesian model for joint discourse segmentation and alignment. We apply our method to the “English as a second language” podcast dataset where each episode is composed of two parallel parts: a story and an explanatory lecture. The predicted topical links uncover hidden re- lations between the stories and the lectures. In this domain, our method achieves competitive results, rivaling those of a previously proposed supervised technique.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract Documents often have inherently parallel structure: they may consist of a text and commentaries, or an abstract and a body, or parts presenting alternative views on the same problem. [sent-4, score-0.283]

2 Revealing relations between the parts by jointly segmenting and predicting links between the segments, would help to visualize such documents and construct friendlier user interfaces. [sent-5, score-0.333]

3 To address this problem, we propose an unsupervised Bayesian model for joint discourse segmentation and alignment. [sent-6, score-0.478]

4 We apply our method to the “English as a second language” podcast dataset where each episode is composed of two parallel parts: a story and an explanatory lecture. [sent-7, score-0.717]

5 The predicted topical links uncover hidden re- lations between the stories and the lectures. [sent-8, score-0.184]

6 1 Introduction Many documents consist of parts exhibiting a high degree of parallelism: e. [sent-10, score-0.17]

7 , abstract and body of academic publications, summaries and detailed news stories, etc. [sent-12, score-0.044]

8 0 technologies: many texts on the web are now accompanied with comments and discussions. [sent-14, score-0.032]

9 Segmentation of these parallel parts into coherent fragments and discovery of hidden relations between them would facilitate the development of better user interfaces and improve the performance of summarization and information retrieval systems. [sent-15, score-0.447]

10 Discourse segmentation of the documents composed of parallel parts is a novel and challenging problem, as previous research has mostly focused on the linear segmentation of isolated texts (e. [sent-16, score-0.891]

11 The most straightforward approach would be to use a pipeline strategy, where an existing segmentation algorithm finds discourse boundaries of each part independently, and then the segments are aligned. [sent-19, score-0.783]

12 Or, conversely, a sentence-alignment stage can be followed by a segmentation stage. [sent-20, score-0.261]

13 However, as we will see in our experiments, these strategies may result in poor segmentation and alignment quality. [sent-21, score-0.359]

14 To address this problem, we construct a nonparametric Bayesian model for joint segmentation and alignment of parallel parts. [sent-22, score-0.544]

15 In comparison with the discussed pipeline approaches, our method has two important advantages: (1) it leverages the lexical cohesion phenomenon (Halliday and Hasan, 1976) in modeling the parallel parts of documents, and (2) ensures that the effective number of segments can grow adaptively. [sent-23, score-0.767]

16 Lexical cohesion is an idea that topicallycoherent segments display compact lexical distributions (Hearst, 1994; Utiyama and Isahara, 2001 ; Eisenstein and Barzilay, 2008). [sent-24, score-0.412]

17 We hypothesize that not only isolated fragments but also each group of linked fragments displays a compact and consistent lexical distribution, and our generative model leverages this inter-part cohesion assumption. [sent-25, score-0.507]

18 In this paper, we consider the dataset of “English as a second language” (ESL) podcast1 ,where each episode consists of two parallel parts: a story (an example monologue or dialogue) and an explanatory lecture discussing the meaning and usage of English expressions appearing in the story. [sent-26, score-0.65]

19 1 presents an example episode, consisting of two parallel parts, and their hidden topical relations. [sent-28, score-0.21]

20 2 From the figure we may conclude that there is a tendency of word repetition between each pair of aligned segments, illustrating our hypothesis of compactness of their joint distribution. [sent-29, score-0.094]

21 c C2o0n1f0er Aenscseoc Sihatoirotn P faopre Crso,m papguetsat 1io5n1a–l1 L5i5n,guistics St ory Lecture t ran s cript Thispodcast is all about business vocabulary related to accounting. [sent-37, score-0.036]

22 Is mha lve b au sdinaeys sjo bon, tbhuet sI idre c. [sent-38, score-0.032]

23 ently started a TATh de a ytsi t ljo erb yo fibs teh ygeoi npusro rdbecyga uMslta rigs jd oBabul etshnina et s yasoy uBin owgo okthrk ae ta estp hifnreo gmh. [sent-39, score-0.047]

24 this ats yo uy oneu edca tno RA c oc blmaonouaodn ktabiknegege gi pysniso ntubhg rye. [sent-50, score-0.047]

25 a pti nthg e croer aseocnt rtehcaotr dyso uo fn tehe d m to ndeoy y o u r sbpoeonkdk;e iet'psi nvge riys s iom yiloaur tcoa n . [sent-54, score-0.063]

26 means having enough money to run your business - to pay your bills. [sent-62, score-0.036]

27 to divide the lecture transcript into discourse units and to align each unit to the related segment of the story. [sent-65, score-0.631]

28 Predicting these structures for the ESL podcast could be the first step in development of an e-learning system and a podcast search engine for ESL learners. [sent-66, score-0.506]

29 2 Related Work Discourse segmentation has been an active area of research (Hearst, 1994; Utiyama and Isahara, 2001 ; Galley et al. [sent-67, score-0.261]

30 Our work extends the Bayesian segmentation model (Eisenstein and Barzilay, 2008) for isolated texts, to the problem of segmenting parallel parts of documents. [sent-69, score-0.606]

31 The task of aligning each sentence of an abstract to one or more sentences of the body has been studied in the context of summarization (Marcu, 1999; Jing, 2002; Daum e´ and Marcu, 2004). [sent-70, score-0.121]

32 Our work is different in that we do not try to extract the most relevant sentence but rather aim to find coherent fragments with maximally overlapping lexical distributions. [sent-71, score-0.158]

33 , (Daum ´e and Marcu, 2006)) is also related but it focuses on sentence extraction rather than on joint segmentation. [sent-74, score-0.043]

34 We are aware ofonly one previous work onjoint segmentation and alignment of multiple texts (Sun et al. [sent-75, score-0.423]

35 , 2007) but their approach is based on similarity functions rather than on modeling lexical cohesion in the generative framework. [sent-76, score-0.144]

36 Our application, the analysis of the ESL podcast, was previously studied in (Noh et al. [sent-77, score-0.033]

37 They proposed a supervised method which is driven by pairwise classification decisions. [sent-79, score-0.044]

38 The main drawback of their approach is that it neglects the discourse structure and the lexical cohesion phenomenon. [sent-80, score-0.281]

39 3 Model In this section we describe our model for discourse segmentation of documents with inherently parallel structure. [sent-81, score-0.706]

40 We start by clarifying our assumptions about their structure. [sent-82, score-0.054]

41 We assume that a document x consists of K {x(k) parallel parts, that is, x = }k=1:K, and peaarchal part aorft ,the th adotc ius,m exnt =con {sixsts o}f segments, = Note that the effective number of= fragments I unknown. [sent-83, score-0.266]

42 Each segment can is either be specific to this part (drawn from a part- x(k) {s(ik)}i=1:I. [sent-84, score-0.2]

43 φi(k)) specific language model or correspond to the entire document (drawn from a document-level language model For example, the first and the second sentences of the lecture transcript in Fig. [sent-85, score-0.418]

44 The document-level language models define topical links between segments in different parts of the document, whereas the part-specific language models define the linear segmentation of the remaining unaligned text. [sent-87, score-0.717]

45 Each document-level language model corresponds to the set of aligned segments, at most one segment per part. [sent-88, score-0.288]

46 Similarly, each part-specific language model corresponds to a single segment of the single corresponding part. [sent-89, score-0.237]

47 Note that all the documents are modeled independently, as we aim not to discover collection-level topics (as e. [sent-90, score-0.079]

48 , 2003)), but to perform joint discourse segmentation and alignment. [sent-94, score-0.441]

49 Unlike (Eisenstein and Barzilay, 2008), we cannot make an assumption that the number of segments is known a-priori, as the effective number of part-specific segments can vary significantly from document to document, depending on their size and structure. [sent-95, score-0.586]

50 To tackle this problem, we use Dirichlet processes (DP) (Ferguson, 1973) to de152 fine priors on the number of segments. [sent-96, score-0.066]

51 We incorporate them in our model in a similar way as it is done for the Latent Dirichlet Allocation (LDA) by Yu et al. [sent-97, score-0.037]

52 Unlike the standard LDA, the topic proportions are chosen not from a Dirichlet prior but from the marginal distribution GEM(α) defined by the stick breaking construction (Sethuraman, 1994), where α is the concentration parameter of the underlying DP distribution. [sent-99, score-0.149]

53 The formal definition ofour model is as follows: • • Draw the document-level topic proportions β(doc) GDErawM t(hαe(d doocc)u)m. [sent-101, score-0.186]

54 Choose the document-level language model φ(idoc) ∼ ∼ Dir(γ(doc)) for i ∈ {1, 2, . [sent-102, score-0.037]

55 • • • Draw the part-specific topic proportions β(k) ∼ GDrEawM( thαe(k) p)a frot-rs pke ∈ {1, . [sent-106, score-0.149]

56 If = Doc); draw topic ∼ β(doc); gen– – – t(nk) (t(nk) zn(k) erate words x(nk) ∼ Mult(φz((dnokc))) Otherwise; draw topic zn(k) β(k); words x(nk) ∼ Mult(φz((knk))). [sent-118, score-0.308]

57 ∼ γ(doc), γ(k), α(doc) generate α(k) The priors and can be estimated at learning time using non-informative hyperpriors (as we do in our experiments), or set manually to indicate preferences of segmentation granularity. [sent-119, score-0.399]

58 At inference time, we enforce each latent topic zn(k) to be assigned to a contiguous span of text, assuming that coherent topics are not recurring across the document (Halliday and Hasan, 1976). [sent-120, score-0.224]

59 In fact, this constraint can be integrated in the model definition but it would significantly complicate the model description. [sent-122, score-0.074]

60 At each iteration of the MH algorithm, a new potential alignment-segmentation pair (z0, t0) is drawn from a proposal distribution Q(z0, t0 |z, t), where (z, t) (a) (b) (c) Figure 2: Three types of moves: (a) shift, (b) split and (c) merge. [sent-124, score-0.098]

61 In order to implement the MH algorithm for our model, we need to define the set of potential moves (i. [sent-129, score-0.191]

62 admissible changes from (z, t) to (z0, t0)), and the proposal distribution Q over these moves. [sent-131, score-0.053]

63 If the actual number of segments is known and only a linear discourse structure is acceptable, then a single move, shift of the segment border (Fig. [sent-132, score-0.65]

64 In our case, however, a more complex set of moves is required. [sent-134, score-0.191]

65 We make two assumptions which are motivated by the problem considered in Section 5: we assume that (1) we are given the number of document-level segments and also that (2) the aligned segments appear in the same order in each part of the document. [sent-135, score-0.641]

66 With these assumptions in mind, we introduce two additional moves (Fig. [sent-136, score-0.245]

67 2(b) and (c)): • Split move: select a segment, and split it at one to fm tohvee spanned sentences; i afn nthde s segment was a document-level segment then one of the fragments becomes the same documentlevel segment. [sent-137, score-0.575]

68 • Merge move: select a pair of adjacent segMmeerngtse w mhoevree: :a ts eleleacstt one iorf o tfhe a segments gispart-specific, and merge them; if one of them was a document-level segment then the new segment has the same document-level topic. [sent-138, score-0.701]

69 All the moves are selected with the uniform probability, and the distance c for the shift move is drawn from the proposal distribution proportional to c−1/cmax. [sent-139, score-0.382]

70 Although the above two assumptions are not crucial as a simple modification to the set ofmoves would support both introduction and deletion of document-level fragments, this modification was not necessary for our experiments. [sent-141, score-0.054]

71 1 Dataset and setup Dataset We apply our model to the ESL podcast dataset (Noh et al. [sent-143, score-0.336]

72 , 2010) of 200 episodes, with an average of 17 sentences per story and 80 sentences per lecture transcript. [sent-144, score-0.322]

73 The gold standard alignments assign each fragment of the story to a segment of the lecture transcript. [sent-145, score-0.555]

74 We can induce segmentations at different levels of granularity on both the story and the lecture side. [sent-146, score-0.322]

75 However, given that the segmentation of the story was obtained by an automatic sentence splitter, there is no reason to attempt to reproduce this segmentation. [sent-147, score-0.397]

76 (2010) and restrict our model to alignment structures which agree with the given segmentation of the story. [sent-149, score-0.396]

77 Evaluation metrics To measure the quality of segmentation of the lecture transcript, we use two standard metrics, Pk (Beeferman et al. [sent-151, score-0.488]

78 , 1999) and WindowDiff (WD) (Pevzner and Hearst, 2002), but both metrics disregard the alignment links (i. [sent-152, score-0.18]

79 Consequently, we also use the macro-averaged F1 score on pairs of aligned span, which measures both the segmentation and alignment quality. [sent-155, score-0.41]

80 For the first baseline, we consider the pairwise sentence alignment (SentAlign) based on the unigram and bigram overlap. [sent-157, score-0.142]

81 The second baseline is a pipeline approach (Pipeline), where we first segment the lecture transcript with BayesSeg (Eisenstein and Barzilay, 2008) and then use the pairwise alignment to find their best alignment to the segments of the story. [sent-158, score-1.119]

82 Our model We evaluate our joint model of segmentation and alignment both with and without the split/merge moves. [sent-159, score-0.476]

83 For the model without these moves, we set the desired number of segments in the lecture to be equal to the actual number of segments in the story I. [sent-160, score-0.895]

84 In this setting, the moves can only adjust positions of the segment borders. [sent-161, score-0.433]

85 For the model with the split/merge moves, we start with the same number of segments I it can be increased or decreased during inbut ference. [sent-162, score-0.305]

86 362418345 2934 Table 1: Results on the ESL podcast dataset. [sent-166, score-0.253]

87 Also we perform L-BFGS optimization to automatically adjust the non-informative hyperpriors after each 1,000 iterations of sampling. [sent-170, score-0.114]

88 ‘Uniform’ denotes the minimal baseline which uniformly draws a random set of I spans for each lecture, and then aligns them to the segments of the story preserving the linear order. [sent-173, score-0.404]

89 Also, we consider two variants of the pipeline approach: segmenting the lecture on I 2I + 1segments, reand spectively. [sent-174, score-0.353]

90 The significant improvement over the pipeline results demonstrates benefits of joint modeling for the considered problem. [sent-178, score-0.16]

91 Moreover, additional benefits are obtained by using the DP priors and the split/merge moves (the last line in Table 1). [sent-179, score-0.257]

92 Finally, our model significantly outperforms the previously proposed supervised model (Noh et al. [sent-180, score-0.074]

93 This observation confirms that lexical cohesion modeling is crucial for suc- cessful discourse analysis. [sent-184, score-0.281]

94 6 Conclusions We studied the problem of joint discourse segmentation and alignment of documents with inherently parallel structure and achieved favorable results on the ESL podcast dataset outperforming the cascaded baselines. [sent-185, score-1.142]

95 Accurate prediction of these hidden relations would open interesting possibilities 3The use of the DP priors and the split/merge moves on the first stage of the pipeline did not result in any improvement in accuracy. [sent-186, score-0.423]

96 One example being an application which, given a userselected fragment of the abstract, produces a summary from the aligned segment of the document body. [sent-188, score-0.334]

97 The automatic construction of large-scale corpora for summarization research. [sent-243, score-0.044]

98 Script-description pair extraction from text documents of English as second language podcast. [sent-247, score-0.079]

99 Topic segmentation with shared topic detection and alignment of multiple documents. [sent-260, score-0.444]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('segments', 0.268), ('segmentation', 0.261), ('podcast', 0.253), ('noh', 0.216), ('segment', 0.2), ('moves', 0.191), ('doc', 0.19), ('lecture', 0.186), ('eisenstein', 0.18), ('esl', 0.18), ('cohesion', 0.144), ('discourse', 0.137), ('story', 0.136), ('episode', 0.126), ('barzilay', 0.119), ('pipeline', 0.117), ('fragments', 0.111), ('transcript', 0.108), ('parallel', 0.105), ('alignment', 0.098), ('parts', 0.091), ('inherently', 0.087), ('nk', 0.087), ('dirichlet', 0.087), ('halliday', 0.087), ('topic', 0.085), ('documents', 0.079), ('dp', 0.076), ('utiyama', 0.073), ('mh', 0.073), ('beeferman', 0.072), ('friendlier', 0.072), ('hyperpriors', 0.072), ('hyungjong', 0.072), ('idoc', 0.072), ('malioutov', 0.072), ('pevzner', 0.072), ('daum', 0.071), ('hearst', 0.07), ('draw', 0.069), ('zn', 0.068), ('priors', 0.066), ('proportions', 0.064), ('gem', 0.063), ('uo', 0.063), ('isolated', 0.062), ('bayesian', 0.06), ('mult', 0.058), ('mmci', 0.058), ('topical', 0.056), ('minwoo', 0.054), ('assumptions', 0.054), ('proposal', 0.053), ('explanatory', 0.051), ('hasan', 0.051), ('hongyan', 0.051), ('isahara', 0.051), ('yu', 0.051), ('aligned', 0.051), ('document', 0.05), ('segmenting', 0.05), ('jeong', 0.049), ('hidden', 0.049), ('move', 0.048), ('coherent', 0.047), ('yo', 0.047), ('dataset', 0.046), ('shift', 0.045), ('drawn', 0.045), ('summarization', 0.044), ('dir', 0.044), ('body', 0.044), ('pairwise', 0.044), ('joint', 0.043), ('adjust', 0.042), ('leverages', 0.042), ('latent', 0.042), ('marcu', 0.042), ('metrics', 0.041), ('links', 0.041), ('blei', 0.039), ('lda', 0.039), ('stories', 0.038), ('marti', 0.038), ('model', 0.037), ('business', 0.036), ('galley', 0.034), ('hal', 0.034), ('regina', 0.034), ('fragment', 0.033), ('studied', 0.033), ('merge', 0.033), ('texts', 0.032), ('ruqaiya', 0.032), ('nae', 0.032), ('ofonly', 0.032), ('afn', 0.032), ('igor', 0.032), ('sjo', 0.032), ('documentlevel', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

Author: Minwoo Jeong ; Ivan Titov

Abstract: Documents often have inherently parallel structure: they may consist of a text and commentaries, or an abstract and a body, or parts presenting alternative views on the same problem. Revealing relations between the parts by jointly segmenting and predicting links between the segments, would help to visualize such documents and construct friendlier user interfaces. To address this problem, we propose an unsupervised Bayesian model for joint discourse segmentation and alignment. We apply our method to the “English as a second language” podcast dataset where each episode is composed of two parallel parts: a story and an explanatory lecture. The predicted topical links uncover hidden re- lations between the stories and the lectures. In this domain, our method achieves competitive results, rivaling those of a previously proposed supervised technique.

2 0.1435917 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

Author: Coskun Mermer ; Ahmet Afsin Akin

Abstract: We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation (SMT) and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efficient computation. Submerging the proposed segmentation method in a SMT task from morphologically-rich Turkish to English does not exhibit the expected improvement in translation BLEU scores and confirms the robustness of phrase-based SMT to translation unit combinatorics. A positive outcome of this work is the described modification to the sequential search algorithm of Morfessor (Creutz and Lagus, 2007) that enables arbitrary-fold parallelization of the computation, which unexpectedly improves the translation performance as measured by BLEU.

3 0.11481187 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

Author: Ivan Titov ; Mikhail Kozhevnikov

Abstract: We argue that groups of unannotated texts with overlapping and non-contradictory semantics represent a valuable source of information for learning semantic representations. A simple and efficient inference method recursively induces joint semantic representations for each group and discovers correspondence between lexical entries and latent semantic concepts. We consider the generative semantics-text correspondence model (Liang et al., 2009) and demonstrate that exploiting the noncontradiction relation between texts leads to substantial improvements over natural baselines on a problem of analyzing human-written weather forecasts.

4 0.10954595 86 acl-2010-Discourse Structure: Theory, Practice and Use

Author: Bonnie Webber ; Markus Egg ; Valia Kordoni

Abstract: unkown-abstract

5 0.1022153 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell

Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.

6 0.09891171 158 acl-2010-Latent Variable Models of Selectional Preference

7 0.095779963 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

8 0.094949335 151 acl-2010-Intelligent Selection of Language Model Training Data

9 0.092637718 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization

10 0.089939684 106 acl-2010-Event-Based Hyperspace Analogue to Language for Query Expansion

11 0.088154912 46 acl-2010-Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression

12 0.087099545 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization

13 0.086407788 133 acl-2010-Hierarchical Search for Word Alignment

14 0.085211553 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

15 0.083916083 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization

16 0.08236447 79 acl-2010-Cross-Lingual Latent Topic Extraction

17 0.081470281 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

18 0.079058506 10 acl-2010-A Latent Dirichlet Allocation Method for Selectional Preferences

19 0.078047007 262 acl-2010-Word Alignment with Synonym Regularization

20 0.077939659 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.203), (1, -0.023), (2, -0.062), (3, -0.03), (4, -0.001), (5, 0.026), (6, -0.049), (7, -0.086), (8, 0.094), (9, -0.122), (10, -0.045), (11, -0.081), (12, 0.064), (13, 0.043), (14, -0.019), (15, -0.035), (16, -0.034), (17, 0.0), (18, 0.065), (19, -0.112), (20, 0.029), (21, -0.141), (22, -0.043), (23, 0.064), (24, -0.043), (25, -0.039), (26, 0.041), (27, 0.069), (28, -0.063), (29, 0.036), (30, 0.055), (31, 0.044), (32, -0.11), (33, -0.088), (34, -0.003), (35, -0.019), (36, 0.078), (37, 0.002), (38, -0.138), (39, 0.17), (40, -0.106), (41, 0.009), (42, -0.145), (43, -0.159), (44, 0.075), (45, 0.053), (46, -0.036), (47, 0.066), (48, 0.03), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95712548 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

Author: Minwoo Jeong ; Ivan Titov

Abstract: Documents often have inherently parallel structure: they may consist of a text and commentaries, or an abstract and a body, or parts presenting alternative views on the same problem. Revealing relations between the parts by jointly segmenting and predicting links between the segments, would help to visualize such documents and construct friendlier user interfaces. To address this problem, we propose an unsupervised Bayesian model for joint discourse segmentation and alignment. We apply our method to the “English as a second language” podcast dataset where each episode is composed of two parallel parts: a story and an explanatory lecture. The predicted topical links uncover hidden re- lations between the stories and the lectures. In this domain, our method achieves competitive results, rivaling those of a previously proposed supervised technique.

2 0.61993909 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

Author: Ivan Titov ; Mikhail Kozhevnikov

Abstract: We argue that groups of unannotated texts with overlapping and non-contradictory semantics represent a valuable source of information for learning semantic representations. A simple and efficient inference method recursively induces joint semantic representations for each group and discovers correspondence between lexical entries and latent semantic concepts. We consider the generative semantics-text correspondence model (Liang et al., 2009) and demonstrate that exploiting the noncontradiction relation between texts leads to substantial improvements over natural baselines on a problem of analyzing human-written weather forecasts.

3 0.51264268 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

Author: Coskun Mermer ; Ahmet Afsin Akin

Abstract: We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation (SMT) and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efficient computation. Submerging the proposed segmentation method in a SMT task from morphologically-rich Turkish to English does not exhibit the expected improvement in translation BLEU scores and confirms the robustness of phrase-based SMT to translation unit combinatorics. A positive outcome of this work is the described modification to the sequential search algorithm of Morfessor (Creutz and Lagus, 2007) that enables arbitrary-fold parallelization of the computation, which unexpectedly improves the translation performance as measured by BLEU.

4 0.49358615 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers

Author: Vipul Mittal

Abstract: In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph an- alyzer.

5 0.49286461 100 acl-2010-Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble

Author: Sebastian Spiegler ; Peter A. Flach

Abstract: This paper demonstrates that the use of ensemble methods and carefully calibrating the decision threshold can significantly improve the performance of machine learning methods for morphological word decomposition. We employ two algorithms which come from a family of generative probabilistic models. The models consider segment boundaries as hidden variables and include probabilities for letter transitions within segments. The advantage of this model family is that it can learn from small datasets and easily gen- eralises to larger datasets. The first algorithm PROMODES, which participated in the Morpho Challenge 2009 (an international competition for unsupervised morphological analysis) employs a lower order model whereas the second algorithm PROMODES-H is a novel development of the first using a higher order model. We present the mathematical description for both algorithms, conduct experiments on the morphologically rich language Zulu and compare characteristics of both algorithms based on the experimental results.

6 0.47510502 86 acl-2010-Discourse Structure: Theory, Practice and Use

7 0.4742589 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields

8 0.47305757 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

9 0.46630883 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

10 0.44055045 196 acl-2010-Plot Induction and Evolutionary Search for Story Generation

11 0.43792927 34 acl-2010-Authorship Attribution Using Probabilistic Context-Free Grammars

12 0.4363468 81 acl-2010-Decision Detection Using Hierarchical Graphical Models

13 0.41206026 79 acl-2010-Cross-Lingual Latent Topic Extraction

14 0.40835246 262 acl-2010-Word Alignment with Synonym Regularization

15 0.39539459 194 acl-2010-Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning

16 0.3908098 106 acl-2010-Event-Based Hyperspace Analogue to Language for Query Expansion

17 0.38068208 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization

18 0.36722216 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

19 0.36427417 16 acl-2010-A Statistical Model for Lost Language Decipherment

20 0.361891 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.064), (28, 0.329), (42, 0.011), (59, 0.115), (73, 0.046), (76, 0.015), (78, 0.035), (83, 0.101), (84, 0.027), (98, 0.158)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89630747 86 acl-2010-Discourse Structure: Theory, Practice and Use

Author: Bonnie Webber ; Markus Egg ; Valia Kordoni

Abstract: unkown-abstract

same-paper 2 0.80470777 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

Author: Minwoo Jeong ; Ivan Titov

Abstract: Documents often have inherently parallel structure: they may consist of a text and commentaries, or an abstract and a body, or parts presenting alternative views on the same problem. Revealing relations between the parts by jointly segmenting and predicting links between the segments, would help to visualize such documents and construct friendlier user interfaces. To address this problem, we propose an unsupervised Bayesian model for joint discourse segmentation and alignment. We apply our method to the “English as a second language” podcast dataset where each episode is composed of two parallel parts: a story and an explanatory lecture. The predicted topical links uncover hidden re- lations between the stories and the lectures. In this domain, our method achieves competitive results, rivaling those of a previously proposed supervised technique.

3 0.78675973 49 acl-2010-Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates

Author: Matthew Gerber ; Joyce Chai

Abstract: Despite its substantial coverage, NomBank does not account for all withinsentence arguments and ignores extrasentential arguments altogether. These arguments, which we call implicit, are important to semantic processing, and their recovery could potentially benefit many NLP applications. We present a study of implicit arguments for a select group of frequent nominal predicates. We show that implicit arguments are pervasive for these predicates, adding 65% to the coverage of NomBank. We demonstrate the feasibility of recovering implicit arguments with a supervised classification model. Our results and analyses provide a baseline for future work on this emerging task.

4 0.68861437 163 acl-2010-Learning Lexicalized Reordering Models from Reordering Graphs

Author: Jinsong Su ; Yang Liu ; Yajuan Lv ; Haitao Mi ; Qun Liu

Abstract: Lexicalized reordering models play a crucial role in phrase-based translation systems. They are usually learned from the word-aligned bilingual corpus by examining the reordering relations of adjacent phrases. Instead of just checking whether there is one phrase adjacent to a given phrase, we argue that it is important to take the number of adjacent phrases into account for better estimations of reordering models. We propose to use a structure named reordering graph, which represents all phrase segmentations of a sentence pair, to learn lexicalized reordering models efficiently. Experimental results on the NIST Chinese-English test sets show that our approach significantly outperforms the baseline method. 1

5 0.57697874 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

Author: Fei Huang ; Alexander Yates

Abstract: Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Semantic role labeling techniques are typically trained on newswire text, and in tests their performance on fiction is as much as 19% worse than their performance on newswire text. We investigate techniques for building open-domain semantic role labeling systems that approach the ideal of a train-once, use-anywhere system. We leverage recently-developed techniques for learning representations of text using latent-variable language models, and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling. In experiments, our novel system reduces error by 16% relative to the previous state of the art on out-of-domain text.

6 0.57595384 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

7 0.575019 169 acl-2010-Learning to Translate with Source and Target Syntax

8 0.57401621 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

9 0.57368845 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

10 0.57337612 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

11 0.57258725 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

12 0.57237017 133 acl-2010-Hierarchical Search for Word Alignment

13 0.57230979 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

14 0.57168436 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

15 0.57158065 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment

16 0.57135373 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

17 0.57054269 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery

18 0.57053816 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese

19 0.56944007 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

20 0.56903195 148 acl-2010-Improving the Use of Pseudo-Words for Evaluating Selectional Preferences