emnlp emnlp2010 emnlp2010-60 knowledge-graph by maker-knowledge-mining

60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

Source: pdf

Author: Roi Reichart ; Ari Rappoport

Abstract: We introduce a novel training algorithm for unsupervised grammar induction, called Zoomed Learning. Given a training set T and a test set S, the goal of our algorithm is to identify subset pairs Ti, Si of T and S such that when the unsupervised parser is trained on a training subset Ti its results on its paired test subset Si are better than when it is trained on the entire training set T. A successful application of zoomed learning improves overall performance on the full test set S. We study our algorithm’s effect on the leading algorithm for the task of fully unsupervised parsing (Seginer, 2007) in three different English domains, WSJ, BROWN and GENIA, and show that it improves the parser F-score by up to 4.47%.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Given a training set T and a test set S, the goal of our algorithm is to identify subset pairs Ti, Si of T and S such that when the unsupervised parser is trained on a training subset Ti its results on its paired test subset Si are better than when it is trained on the entire training set T. [sent-5, score-0.885]

2 A successful application of zoomed learning improves overall performance on the full test set S. [sent-6, score-0.242]

3 We study our algorithm’s effect on the leading algorithm for the task of fully unsupervised parsing (Seginer, 2007) in three different English domains, WSJ, BROWN and GENIA, and show that it improves the parser F-score by up to 4. [sent-7, score-0.434]

4 i l their dependency model with valence (DMV) for unsupervised dependency parsing when it is trained and tested on the same corpus (both when sentence length restriction is imposed, such as for WSJ10, and when it is not, such as for the entire WSJ). [sent-19, score-0.337]

5 , (2010) demonstrated that training the DMV model on sentences of up to 15 words length yields better results on the entire section 23 of WSJ (with no sentence length restriction) than training with the entire WSJ corpus. [sent-23, score-0.224]

6 In contrast to these dependency models, the Seginer constituency parser achieves its best performance when trained on the entire WSJ corpus either if sentence length restriction is imposed on the test corpus or not. [sent-24, score-0.547]

7 When the parser is trained with the entire WSJ corpus its F-score performance on the WSJ10, WSJ20 and the entire WSJ corpora are 76, 64. [sent-27, score-0.477]

8 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 6t8ic4s–693, successful application of zoomed learning improves performance on the full test set S. [sent-38, score-0.242]

9 In the simplest algorithm the subsets are randomly selected while in the more sophisticated versions subset selection is done using a fully unsupervised measure of constituency parse tree quality. [sent-40, score-0.497]

10 We apply ZL to the Seginer parser, the best algorithm for fully unsupervised constituency parsing. [sent-41, score-0.203]

11 We experiment in three different English do- mains: WSJ (economic newspaper), GENIA (biological articles) and BROWN (heterogeneous domains), and show that ZL improves the parser F-score by as much as 4. [sent-43, score-0.243]

12 Our confidence-based ZL algorithms use the PUPA unsupervised parsing quality score (Reichart and Rappoport, 2009b). [sent-52, score-0.257]

13 As far as we know, PUPA is the only unsupervised quality assessment algorithm for syntactic parsers that has been proposed. [sent-53, score-0.216]

14 Combining PUPA with Seginer’s parser thus preserves the fully unsupervised nature of the task. [sent-54, score-0.37]

15 We experiment with the Seginer parser for two reasons. [sent-61, score-0.243]

16 Second, this is the only publicly available unsupervised parser that induces constituency trees. [sent-63, score-0.373]

17 Interestingly, the results reported for other constituency models (the CCM model (Klein and Manning, 2002) and the U-DOP model (Bod, 2006a; Bod, 2006b)) are reported when the parser is trained on its test corpus even if the sentences is that corpus are of bounded length (e. [sent-66, score-0.494]

18 The approaches have been applied to the DMV unsupervised dependency parser (Klein and Manning, 2004) and improved its performance. [sent-74, score-0.347]

19 Moreover, after class decomposition a classifier is trained with the entire training data while the subsets identified by a ZL algorithm are parsed by a parser trained only with the sentences they contain. [sent-95, score-0.614]

20 Finally, we briefly discuss the PUPA quality measure that we use to evaluate the quality of a parse tree. [sent-117, score-0.237]

21 dEraacwhn ns fetro ims t ahe unn parsed by a parser nth {at1 ,is2 t. [sent-126, score-0.297]

22 Consequently, the best way to learn the syntactic patterns of any given set of sentences might be to train the parser on the sentences contained in the set. [sent-132, score-0.364]

23 The idea of the basic ZL algorithm is that sentences for which the parser provides low quality parses manifest different syntactic patterns than the sentences for which the parser provides high quality parses. [sent-135, score-0.882]

24 In the first, we create the fully-trained model by training the parser using all of the N sentences of T. [sent-138, score-0.306]

25 We divide the training sentences to two subsets: a high quality subset H consisting of the top scored NH sentences, and a lower quality subset L consisting of the other NL = N NH sentences. [sent-141, score-0.468]

26 Our test set is thus naturally divided into two subsets, a high quality subset HT consisting of the test set sentences contained in H and a lower quality subset LT consisting of the test set sentences contained in L. [sent-144, score-0.639]

27 In the third stage, each of the test subsets is parsed by a model trained only on its corresponding training subset. [sent-145, score-0.225]

28 This stage is motivated by our assumption that the high and low quality subsets manifest dissimilar syntactic patterns, and consequently the statistics of the parser’s parameters suitable for one subset differ from those suitable for another. [sent-146, score-0.354]

29 POS tags for it are − induced using the fully unsupervised algorithm of Clark (2003). [sent-148, score-0.201]

30 The parser we experiment with is the incremental parser of Seginer (2007), whose input consists of raw sentences and does not include any kind of supervised POS tags (created either manually or by a supervised algorithm). [sent-149, score-0.587]

31 The only parameter it has is NH but ZL improves parser performance for most NH values. [sent-151, score-0.243]

32 BZL does something sim687 ilar: it uses PUPA to estimate which sentences are given high quality parse trees, and down-weights examples with high (low) PUPA score to 0 when training the L-trained (H-trained) model. [sent-154, score-0.238]

33 However, in boosting the entire test set is annotated by the same learning model, while ZL parses each test subset with a model trained on its corresponding training subset. [sent-155, score-0.386]

34 The basic algorithm produces an ensemble of two parsing experts: the one trained on H and the one trained on L. [sent-157, score-0.214]

35 In addition, even if parse trees generated by the experts are better with high probability than those of the fully trained parser, they are not guaranteed to be so. [sent-160, score-0.204]

36 The fully trained parser is therefore also a valuable member in the ensemble. [sent-161, score-0.333]

37 Consequently, we introduce an extended zoomed learning algorithm (EZL). [sent-162, score-0.234]

38 In this stage, the two test subsets are parsed by the fully trained parsing model, in addition to being parsed by the zooming parsing models. [sent-164, score-0.384]

39 We now have two parses for each test sentence s: PZ(s), the parse created by a parser trained with the sentences contained in its corresponding training subset, and PF(s), created by the fully trained parser. [sent-165, score-0.69]

40 Therefore, there are two sources for a difference between the scores of the two parse trees of a given test sentence: the difference between the trees themselves, and the difference between the parses of the other sentences in the set. [sent-168, score-0.253]

41 The PUPA score for PZ(s) is computed using the parses created for the sentences contained in the test subset of s by a parser trained with the corresponding training subset. [sent-169, score-0.612]

42 The PUPA score for PF(s) is computed using the parses created for the entire test set by the fully trained parser. [sent-170, score-0.302]

43 In the second and fourth stages of the confidence-based algorithms, an unsupervised confidence score is computed for each of the induced parse trees. [sent-173, score-0.228]

44 We follow Reichart and Rappoport (2009b) and induce the POS tags using the fully unsupervised POS induction algorithm of Clark (2003). [sent-179, score-0.201]

45 The resulting score was shown to be strongly correlated with the extrinsic quality of the parse tree, defined to be its Fscore similarity to the manually created (gold standard) parse tree of the sentence. [sent-184, score-0.292]

46 For all corpora we report the parser performance on the entire corpus (WSJ: 49206 sentences, BROWN: 24243 sentences, GENIA: 4661 sentences). [sent-191, score-0.365]

47 For WSJ we also provide an analysis of the performance of the parser when applied to sentences of bounded length. [sent-192, score-0.316]

48 Seginer’s parser achieves its best reported results when trained on the full WSJ corpus. [sent-194, score-0.285]

49 Consequently, for all corpora, we compare the performance of the parser when trained with the ZL algorithms to its performance when trained with the full corpus. [sent-195, score-0.36]

50 The POS tags required as input by the PUPA algorithm are induced by the fully unsupervised POS induction algorithm of Clark (2003)2. [sent-196, score-0.254]

51 Reichart and Rappoport (2009b) demonstrated an unsupervised technique for the estimation of the number of induced POS tags with which the correlation between PUPA’s score and the parse F-score is maximized. [sent-197, score-0.219]

52 In each experiment the size of the high quality H and lower quality L training subsets is different. [sent-202, score-0.264]

53 We report the parser performance on the test corpus for each training protocol. [sent-206, score-0.323]

54 Results are presented for four test corpora WSJ1 table: Results for various values of NH (the number of sentences in the high quality training subset). [sent-239, score-0.206]

55 Evaluation For each algorithm, the top line is its F-score performance the bottom line is the difference from the F-score of the fully-trained Seginer parser (denoted by F(Full)). [sent-241, score-0.362]

56 To evaluate the quality of a parse tree with respect to its gold standard, the unlabeled parsing F-score is used. [sent-249, score-0.214]

57 We start by discussing the effect of ZL on the performance of the Seginer parser when no length restriction is imposed on the test corpus sentences (WSJ, BROWN and GENIA). [sent-251, score-0.375]

58 For all test corpora and sizes of the high quality training subset (NH), zoomed learning improves the parser performance. [sent-253, score-0.701]

59 Note, that for all three corpora zoomed learning with random selection (RZL) improves the parser performance on the entire test corpus, although to a lesser extent than confidence-based ZL. [sent-259, score-0.624]

60 Results are presented for the entire corpus (left column section), the low quality test subset (middle column section, LT) and the high quality test subset (right column section, HT) of each corpus, as a function of the high quality training set size (NH). [sent-295, score-0.625]

61 Since the tables present entire corpus results, the training and test subsets are identical. [sent-296, score-0.225]

62 A key principle of ZL is the selection of subsets that are better parsed by a parser trained only with the sentences they contain than with a parser trained with the entire training corpus. [sent-299, score-0.875]

63 In the first, the entire test corpus is parsed with a parser that was trained with a subset of randomly selected sentences from the training set. [sent-301, score-0.616]

64 We run this protocol for all three corpora (and for the WSJ subcorpora) with various training set sizes and obtained substantial degradation in the parser performance. [sent-302, score-0.436]

65 We conclude that using less training material harms the parser performance if a test subset is not carefully selected. [sent-304, score-0.434]

66 , (2010) in which we parsed each test corpus using a parser that was trained with all training sentences of a bounded length. [sent-306, score-0.492]

67 Unlike in their paper, in which this protocol improves the perofrmance of the DMV unsupervised dependency parser (Klein and Manning, 2004), for the Seginer parser the protocol harms the results. [sent-307, score-0.792]

68 When parsing the entire WSJ with a WSJ10-trained parser or with a WSJ20-trained parser, the F-score results are 59. [sent-308, score-0.352]

69 29 while the F-score of the fully-trained parser on this corpus is 76. [sent-321, score-0.269]

70 For GENIA, however, while parsing GENIA10 with a GENIA10-trained parser harms the performance (45. [sent-331, score-0.334]

71 23), parsing GENIA20 with a GENIA20-trained parser enhances the performance (53. [sent-334, score-0.282]

72 Table 1 (bottom), the middle and right sections of Table 2 (both tables) and Figure 1 (second and third lines) present the performance of the ZL algorithms on the lower quality and higher quality test subsets (LT and HT). [sent-342, score-0.309]

73 For WSJ (and its sub-corpora) and BROWN, 3We repeated this protocol multiple times for each corpus, training the parser with sentences of length 5 to 45 in steps of 5. [sent-344, score-0.381]

74 Top Three Lines: Difference in F-score performance of the Seginer parser between training with ZL and training with the entire WSJ corpus. [sent-350, score-0.355]

75 Results are presented for the entire corpus (top line), the lower quality test subset (LT, middle line) and the higher quality test subset (HT, bottom line) as a function of the size of the high quality training subset X = NH, measured in sentences. [sent-351, score-0.722]

76 The curve with triangles is for the extended zoomed x 104 learning algorithm (EZL), the solid curve is for the basic zoomed learning algorithm (BZL) and the dashed curve is for zoomed learning with random selection (RZL). [sent-352, score-0.789]

77 Bottom line: Comparison between the performance of the Seginer parser with the EZL algorithm (curves with triangles) and when subset selection is performed using the oracle F-score of the trees (solid curves). [sent-353, score-0.448]

78 F-score differences from the performance of the fully trained parser are presented for the WSJ test corpus as a function of NH, the high quality training subset size. [sent-354, score-0.582]

79 Oracle selection is superior for the lower quality subset but inferior for the high quality subset. [sent-355, score-0.296]

80 For GENIA, EZL and BZL improve the parser performance on both LT and HT for most NH values. [sent-367, score-0.243]

81 Our initial hypothesis is that due to the relative small size of the GENIA corpus (4661 sentences compared 24243 and 49206 sentences of WSJ and BROWN respectively), there is more room for improvement in the parser performance on this corpus, and consequently ZL improves on both sets. [sent-369, score-0.408]

82 Confidence-based ZL is based on the idea that sentences for which the fully-trained parser provides parses of similar quality manifest similar syntactic patterns. [sent-371, score-0.468]

83 Consequently, the parser performance on a set of such sentences can be improved if it is trained only with the sentences contained in the set. [sent-372, score-0.406]

84 Figure 1 (bottom line) compares the performance of EZL with that of the oracle-based zoomed learning algorithm when the test corpus is the entire WSJ. [sent-374, score-0.363]

85 For the low quality test subset, oracle selection is dramatically better than confidence-based selection. [sent-375, score-0.208]

86 For the high quality test subset the opposite pattern holds, that is, EZL is superior. [sent-376, score-0.202]

87 Oracle-based and confidence-based zoomed learning demonstrate the same trend: they improve over the baseline for LT much more than for HT. [sent-378, score-0.209]

88 The magnitude of the effect of oracle-based zoomed learning is much stronger. [sent-380, score-0.209]

89 These results support our idea that training the parser on a set selected by a well-designed confidence test leads to improvement of the parser performance for the selected sentences when the fully-trained parser produces parses of mediocre quality for them. [sent-381, score-1.023]

90 Integration of the experimental results for zoomed learning with the three selection methods: random, confidence-based and oracle-based leads to an important conclusion that should guide future research. [sent-382, score-0.252]

91 In BZL, the L-trained parser and the H-trained parser generate parse trees for LT and HT sentences respectively. [sent-386, score-0.621]

92 In EZL, for each sentence the final parse is selected between the parse 692 created by a parser trained with the sentences contained in its corresponding training subset, and the parse created by the fully trained parser. [sent-387, score-0.734]

93 While for all corpora it is beneficial to use the L-trained parser for the low quality test subset (LT), the results for WSJ and BROWN imply that it might be better to use the fully-trained parser or the EZL algorithm to parse the high quality test subset (HT). [sent-389, score-1.03]

94 We also explored a ZL scenario in which the entire test set is parsed either by the H-trained parser or by the L-trained parser. [sent-393, score-0.4]

95 These protocols result in substantial degradation in parser performance (compared to the fully-trained parser) since the performance of the H-trained parser on LT and the performance of the L-trained parser on HT are poor. [sent-394, score-0.8]

96 6 Conclusions We introduced zoomed learning a training algorithm for unsupervised parsers. [sent-395, score-0.334]

97 We applied three variants of ZL to the best fully unsupervised parsing algorithm (Seginer, 2007) and show an improvement of up to 4. [sent-396, score-0.214]

98 Future research should focus on the development of more accurate estimators of parser output quality, and experimentation with different corpora, languages and parsers. [sent-398, score-0.243]

99 Developing a quality assessment algorithm for dependency trees will allow us to apply confidencebased ZL to unsupervised dependency parsing. [sent-399, score-0.29]

100 Automatic selection of high quality parses created by a fully unsupervised parser. [sent-496, score-0.341]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zl', 0.574), ('pupa', 0.293), ('seginer', 0.293), ('ezl', 0.244), ('parser', 0.243), ('zoomed', 0.209), ('wsj', 0.202), ('genia', 0.17), ('nh', 0.165), ('bzl', 0.159), ('lt', 0.1), ('rappoport', 0.098), ('reichart', 0.098), ('brown', 0.087), ('rzl', 0.086), ('subset', 0.085), ('quality', 0.084), ('bod', 0.084), ('unsupervised', 0.079), ('ht', 0.077), ('protocol', 0.075), ('subsets', 0.075), ('entire', 0.07), ('parse', 0.069), ('ensemble', 0.066), ('parses', 0.061), ('parsed', 0.054), ('harms', 0.052), ('spitkovsky', 0.052), ('constituency', 0.051), ('fully', 0.048), ('degradation', 0.045), ('klein', 0.044), ('dmv', 0.044), ('selection', 0.043), ('sentences', 0.042), ('trained', 0.042), ('boosting', 0.041), ('parsing', 0.039), ('manifest', 0.038), ('contained', 0.037), ('clark', 0.035), ('roi', 0.035), ('bracketing', 0.035), ('test', 0.033), ('ari', 0.033), ('manning', 0.033), ('algorithms', 0.033), ('bottom', 0.032), ('cohen', 0.032), ('line', 0.032), ('consequently', 0.032), ('rens', 0.031), ('bounded', 0.031), ('restriction', 0.031), ('pos', 0.03), ('confidence', 0.03), ('induced', 0.028), ('bagging', 0.028), ('assessment', 0.028), ('induction', 0.028), ('oracle', 0.028), ('created', 0.026), ('substantial', 0.026), ('corpus', 0.026), ('corpora', 0.026), ('algorithm', 0.025), ('dependency', 0.025), ('caruana', 0.024), ('cfneirfsc', 0.024), ('csiorfdecnref', 0.024), ('economic', 0.024), ('kucera', 0.024), ('rish', 0.024), ('rofed', 0.024), ('vilalta', 0.024), ('trees', 0.024), ('smith', 0.023), ('top', 0.023), ('improvement', 0.023), ('headden', 0.023), ('curve', 0.023), ('al', 0.023), ('consisting', 0.022), ('score', 0.022), ('tree', 0.022), ('tags', 0.021), ('training', 0.021), ('yoav', 0.021), ('experts', 0.021), ('plain', 0.02), ('stage', 0.02), ('low', 0.02), ('mcclosky', 0.019), ('supervised', 0.019), ('kawahara', 0.019), ('henderson', 0.019), ('pz', 0.019), ('freund', 0.019), ('vided', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

Author: Roi Reichart ; Ari Rappoport

2 0.15043038 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

Author: Slav Petrov ; Pi-Chuan Chang ; Michael Ringgaard ; Hiyan Alshawi

Abstract: It is well known that parsing accuracies drop significantly on out-of-domain data. What is less known is that some parsers suffer more from domain shifts than others. We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers, which are of highest interest for practical applications because of their linear running time, drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.

3 0.10899707 114 emnlp-2010-Unsupervised Parse Selection for HPSG

Author: Rebecca Dridan ; Timothy Baldwin

Abstract: Parser disambiguation with precision grammars generally takes place via statistical ranking of the parse yield of the grammar using a supervised parse selection model. In the standard process, the parse selection model is trained over a hand-disambiguated treebank, meaning that without a significant investment of effort to produce the treebank, parse selection is not possible. Furthermore, as treebanking is generally streamlined with parse selection models, creating the initial treebank without a model requires more resources than subsequent treebanks. In this work, we show that, by taking advantage of the constrained nature of these HPSG grammars, we can learn a discriminative parse selection model from raw text in a purely unsupervised fashion. This allows us to bootstrap the treebanking process and provide better parsers faster, and with less resources.

4 0.10559292 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.

5 0.093952529 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

Author: Jackie Chi Kit Cheung ; Gerald Penn

Abstract: Syntactic consistency is the preference to reuse a syntactic construction shortly after its appearance in a discourse. We present an analysis of the WSJ portion of the Penn Treebank, and show that syntactic consistency is pervasive across productions with various lefthand side nonterminals. Then, we implement a reranking constituent parser that makes use of extra-sentential context in its feature set. Using a linear-chain conditional random field, we improve parsing accuracy over the generative baseline parser on the Penn Treebank WSJ corpus, rivalling a similar model that does not make use of context. We show that the context-aware and the context-ignorant rerankers perform well on different subsets of the evaluation data, suggesting a combined approach would provide further improvement. We also compare parses made by models, and suggest that context can be useful for parsing by capturing structural dependencies between sentences as opposed to lexically governed dependencies.

6 0.086345226 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

7 0.076108567 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

8 0.071521081 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

9 0.067314103 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

10 0.064364769 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

11 0.06323465 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

12 0.060735531 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

13 0.052005947 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

14 0.050092578 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

15 0.045994654 88 emnlp-2010-On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

16 0.045860063 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

17 0.042323899 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

18 0.037277907 61 emnlp-2010-Improving Gender Classification of Blog Authors

19 0.037012037 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

20 0.033145268 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.149), (1, 0.085), (2, 0.181), (3, 0.018), (4, 0.003), (5, 0.105), (6, 0.016), (7, 0.063), (8, 0.155), (9, 0.016), (10, 0.096), (11, 0.027), (12, 0.134), (13, 0.143), (14, 0.036), (15, 0.034), (16, 0.088), (17, 0.084), (18, -0.003), (19, 0.029), (20, 0.068), (21, 0.055), (22, -0.044), (23, 0.078), (24, 0.173), (25, 0.055), (26, -0.11), (27, -0.013), (28, 0.123), (29, -0.023), (30, -0.1), (31, 0.004), (32, 0.202), (33, 0.214), (34, 0.127), (35, 0.011), (36, -0.017), (37, -0.012), (38, 0.044), (39, 0.035), (40, 0.063), (41, 0.06), (42, 0.004), (43, -0.038), (44, -0.0), (45, -0.222), (46, -0.069), (47, 0.005), (48, -0.055), (49, 0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94813496 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

Author: Roi Reichart ; Ari Rappoport

2 0.67616051 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

Author: Slav Petrov ; Pi-Chuan Chang ; Michael Ringgaard ; Hiyan Alshawi

3 0.50841618 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

Author: Jackie Chi Kit Cheung ; Gerald Penn

4 0.49581668 114 emnlp-2010-Unsupervised Parse Selection for HPSG

Author: Rebecca Dridan ; Timothy Baldwin

5 0.48546371 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

6 0.40302286 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

7 0.40214118 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

8 0.29925573 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

9 0.26609293 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

10 0.26240471 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

11 0.26055354 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

12 0.24369699 61 emnlp-2010-Improving Gender Classification of Blog Authors

13 0.24325667 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

14 0.23993888 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

15 0.2135625 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

16 0.2012471 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

17 0.1896553 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

18 0.18637761 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

19 0.18594091 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

20 0.18570469 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.018), (5, 0.011), (10, 0.013), (12, 0.021), (29, 0.114), (30, 0.017), (32, 0.017), (52, 0.013), (56, 0.06), (62, 0.016), (66, 0.124), (72, 0.052), (76, 0.057), (77, 0.011), (83, 0.027), (87, 0.029), (89, 0.012), (90, 0.301)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71736562 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

Author: Roi Reichart ; Ari Rappoport

2 0.61400431 104 emnlp-2010-The Necessity of Combining Adaptation Methods

Author: Ming-Wei Chang ; Michael Connor ; Dan Roth

Abstract: Problems stemming from domain adaptation continue to plague the statistical natural language processing community. There has been continuing work trying to find general purpose algorithms to alleviate this problem. In this paper we argue that existing general purpose approaches usually only focus on one of two issues related to the difficulties faced by adaptation: 1) difference in base feature statistics or 2) task differences that can be detected with labeled data. We argue that it is necessary to combine these two classes of adaptation algorithms, using evidence collected through theoretical analysis and simulated and real-world data experiments. We find that the combined approach often outperforms the individual adaptation approaches. By combining simple approaches from each class of adaptation algorithm, we achieve state-of-the-art results for both Named Entity Recognition adaptation task and the Preposition Sense Disambiguation adaptation task. Second, we also show that applying an adaptation algorithm that finds shared representation between domains often impacts the choice in adaptation algorithm that makes use of target labeled data.

3 0.52596909 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

Author: Hui Zhang ; Min Zhang ; Haizhou Li ; Eng Siong Chng

Abstract: This paper studies two issues, non-isomorphic structure translation and target syntactic structure usage, for statistical machine translation in the context of forest-based tree to tree sequence translation. For the first issue, we propose a novel non-isomorphic translation framework to capture more non-isomorphic structure mappings than traditional tree-based and tree-sequence-based translation methods. For the second issue, we propose a parallel space searching method to generate hypothesis using tree-to-string model and evaluate its syntactic goodness using tree-to-tree/tree sequence model. This not only reduces the search complexity by merging spurious-ambiguity translation paths and solves the data sparseness issue in training, but also serves as a syntax-based target language model for better grammatical generation. Experiment results on the benchmark data show our proposed two solutions are very effective, achieving significant performance improvement over baselines when applying to different translation models.

4 0.52514541 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

5 0.522147 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

6 0.51998872 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

7 0.51874995 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

8 0.51865685 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

9 0.51794928 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

10 0.51788914 40 emnlp-2010-Effects of Empty Categories on Machine Translation

11 0.5163433 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

12 0.51622051 114 emnlp-2010-Unsupervised Parse Selection for HPSG

13 0.51621389 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

14 0.51608795 26 emnlp-2010-Classifying Dialogue Acts in One-on-One Live Chats

15 0.51585412 84 emnlp-2010-NLP on Spoken Documents Without ASR

16 0.51426059 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

17 0.51335329 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

18 0.51206619 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

19 0.51184601 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

20 0.51074034 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning