emnlp emnlp2010 emnlp2010-41 knowledge-graph by maker-knowledge-mining

41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

Source: pdf

Author: Amarnag Subramanya ; Slav Petrov ; Fernando Pereira

Abstract: We describe a new scalable algorithm for semi-supervised training of conditional random fields (CRF) and its application to partof-speech (POS) tagging. The algorithm uses a similarity graph to encourage similar ngrams to have similar POS tags. We demonstrate the efficacy of our approach on a domain adaptation task, where we assume that we have access to large amounts of unlabeled data from the target domain, but no additional labeled data. The similarity graph is used during training to smooth the state posteriors on the target domain. Standard inference can be used at test time. Our approach is able to scale to very large problems and yields significantly improved target domain accuracy.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The algorithm uses a similarity graph to encourage similar ngrams to have similar POS tags. [sent-3, score-0.338]

2 We demonstrate the efficacy of our approach on a domain adaptation task, where we assume that we have access to large amounts of unlabeled data from the target domain, but no additional labeled data. [sent-4, score-0.628]

3 The similarity graph is used during training to smooth the state posteriors on the target domain. [sent-5, score-0.607]

4 Our approach is able to scale to very large problems and yields significantly improved target domain accuracy. [sent-7, score-0.238]

5 1 Introduction Semi-supervised learning (SSL) is the use of small amounts of labeled data with relatively large amounts of unlabeled data to train predictors. [sent-8, score-0.386]

6 Annotating training data for all sub-domains of a varied domain such as all of Web text is impractical, giving impetus to the development of SSL techniques that can learn from unlabeled data to perform well across domains. [sent-10, score-0.383]

7 The earliest SSL algorithm is self-training (Scudder, 1965), where one makes use of a previously trained model to annotate unlabeled data which is then used to re-train the model. [sent-11, score-0.251]

8 Thus we have a conflict between wanting to use SSL with large unlabeled data sets for best accuracy, but being unable to do so because of computational complexity. [sent-20, score-0.283]

9 Here one assumes that the data (both labeled and unlabeled) is represented by vertices in a graph. [sent-27, score-0.253]

10 , 2005) and they make use of a graph as a smoothness regularizer. [sent-39, score-0.339]

11 Our method is scalable because it trains with efficient standard building blocks for CRF inference and learning and also standard graph label propagation machinery. [sent-51, score-0.455]

12 Graph regularizer computations are only used for training, so at test time, standard CRF inference can be used, unlike in graph-based transductive methods. [sent-52, score-0.296]

13 Briefly, our approach starts by training a CRF on the source domain labeled data, and then uses it to decode unlabeled data from the target domain. [sent-53, score-0.697]

14 The state posteriors on the target domain are then smoothed using the graph regularizer. [sent-54, score-0.666]

15 Best state sequences for the unlabeled target data are then created by Viterbi decod168 ing with the smoothed state posteriors, and this automatic target domain annotation is combined with the labeled source domain data to retrain the CRF. [sent-55, score-0.898]

16 For example, on the question domain used in this paper, the tagging accuracy of a supervised CRF is only 84%. [sent-60, score-0.262]

17 2 Supervised CRF We assume that we have a set of labeled source domain examples Dl = {(xi, yi)}il=1, but only unlmabaienled ex target sdo Dmai=n examples Du = {xi}li+=ul+1. [sent-62, score-0.371]

18 Here xi = xi(1)x(i2) ···x(i|xi|) is the sequence of yi(1)y(i2) ···y(i|xi|) is POS tag sequence, with yi(j) ∈ Y words in sentence iand yi = the corresponding where Y is the set of POS tags. [sent-63, score-0.362]

19 In our case, we also have access to the unlabeled data Du from the target dhoavmea aincc weshsic toh we uwnolaublde elikde d taota use for training the CRF. [sent-73, score-0.321]

20 We first describe how we construct a similarity graph over the unlabeled which will be used in our algorithm as a graph regularizer. [sent-74, score-0.902]

21 The standard approach for unstructured problems is to construct a graph whose vertices are labeled and unlabeled examples, and whose weighted edges encode the degree to which the examples they link should have the same label (Zhu et al. [sent-76, score-0.936]

22 Then the main graph construction choice is what similarity function to use for the weighted edges between examples. [sent-78, score-0.338]

23 While we might be able to choose some appropriate sequence similarity to construct the graph, such as edit distance or a string kernel, it is not clear how to use whole sequence similarity to constrain whole tag sequences assigned to linked examples in the learning algorithm. [sent-81, score-0.325]

24 However, their approach is too demanding computationally (see Section 5), so instead we use local sequence contexts as graph vertices, exploting the empirical observation that the part of speech of a word occurrence is mostly determined by its local context. [sent-84, score-0.336]

25 Specifically, the set V of graph vertices consists of all the word n-grams1 (types) that have occurrences (tokens) in training sentences (labeled and unlabeled). [sent-85, score-0.399]

26 We partition V = Vl ∪ Vu where Vl corresponds to n-grams that occur a∪t lVeast once in the labeled data, and Vu corresponds to n-grams that occur only in the unlabeled data. [sent-86, score-0.48]

27 To define the similarity function, for each token of a given type in the labeled and unlabeled data, we extract a set of context features. [sent-91, score-0.523]

28 We have thus circumvented the problem of defining similarities over sequences by defining the graph over types that represent local sequence contexts. [sent-97, score-0.336]

29 Since our CRF tagger only uses local features of the input to score tag pairs, we believe that the graph we construct captures all significant context information. [sent-98, score-0.382]

30 We expect the similarity graph to provide information that cannot be expressed directly in a sequence model. [sent-106, score-0.393]

31 First, the graph allows new features to be discovered. [sent-110, score-0.281]

32 Many words occur only in the unlabeled data and a purely supervised CRF would not be able to learn feature weights for those observations. [sent-111, score-0.355]

33 The similarity graph on the other hand can link events that occur only in the unlabeled data to similar events in the labeled data. [sent-113, score-0.808]

34 Furthermore, because the graph is built over types rather than tokens, it will encourage the same interpretation to be chosen for similar trigrams occurring in different sentences. [sent-114, score-0.426]

35 For example, the word ‘unrar’ will most likely not occur in the labeled training data. [sent-115, score-0.182]

36 Second, the graph propagates adjustments to the weights of known features. [sent-117, score-0.281]

37 Many words occur only a handful of times in our labeled data, resulting in poor estimates of their contributions. [sent-118, score-0.182]

38 Even for fre- quently occurring events, their distribution in the target domain might be different from their distribution in the source domain. [sent-119, score-0.236]

39 In contrast, labeled vertices in the similarity graph can help disambiguate ambiguous contexts and correct (some of) the errors of the supervised model. [sent-121, score-0.648]

40 4 Semi-Supervised CRF Given unlabeled data Du, we only have access to tGheiv prior p(x). [sent-122, score-0.251]

41 dAatsa t hDe CRF is a discriminative model, the lack of label information renders the CRF weights independent of p(x) and thus we cannot directly utilize the unlabeled data when training the CRF. [sent-123, score-0.297]

42 Therefore, semi-supervised approaches to training discriminative models typically use the unlabeled data to construct a regularizer that is used to guide the learning process (Joachims, 1999; Lawrence and Jordan, 2005). [sent-124, score-0.422]

43 Here we use the graph as a smoothness regularizer to train CRFs in a semi- supervised manner. [sent-125, score-0.535]

44 The marginals over tokens are then aggregated to marginals over types (token to type), which are used to initialize the graph label distributions. [sent-127, score-0.839]

45 After running label propagation (graph propagate), the posteriors from the graph are used to smooth the state posteri- ors. [sent-128, score-0.6]

46 Decoding the unlabeled data (viterbi decode) produces a new set of automatic annotations that can be combined with the labeled data to retrain the CRF using the supervised CRF training objective (crftrain). [sent-129, score-0.524]

47 1 Posterior Decoding Let (t refers to target domain) represent the estimate ofthe CRF parameters for the target domain after the n-th iteration. [sent-132, score-0.272]

48 2 Token-to-Type Mapping Recall that our graph is defined over types while the posteriors computed above involve particular tokens. [sent-135, score-0.429]

49 We accumulate token-based marginals to create type marginals as follows. [sent-136, score-0.515]

50 For a sentence iand word position j in that sentence, let T(i, j) be the 2In the first iteration, we initialize the target domain eters to the source domain parameters: Λ0(t) = Λ(s). [sent-137, score-0.411]

51 Conversely, for a trigram type u, let T−1 (u) be the set of actual occurrences (tokens) of that trigram u; that is, all pairs (i, j) where iis the index of a sentence where u occurs and j is the position of the center word of an occurrence of u in that sentence. [sent-139, score-0.339]

52 3 Graph Propagation We now use our similarity graph (Section 3) to smooth the type-level marginals by minimizing the µ following convex objective: C(q) +µ s. [sent-143, score-0.726]

53 The setting of the hyperparameters and ν wil}l be discussed in Section 6, N(u) is the set of neighbors of node u, and ru ins 6th,e N empirical marginal elaigbhelb odirsstor ifb untoiodne fuo,r a atnridgram u in the labeled data. [sent-149, score-0.207]

54 Our graph propagation objective can be seen as a multi-class generalization of the quadratic cost criterion (Bengio et al. [sent-153, score-0.397]

55 The first term in the above objective requires that we respect the information in our labeled data. [sent-155, score-0.177]

56 The second term is the graph smoothness regularizer which requires that the qi’s be smooth with respect to the graph. [sent-156, score-0.529]

57 In other words, if wuv is large, then qu and qv should be close in the squared-error sense. [sent-157, score-0.206]

58 This implies that vertices u and v are likely to have similar marginals over POS tags. [sent-158, score-0.359]

59 The last term is a regularizer and encourages all type marginals to be uniform to the extent that is allowed by the first two terms. [sent-159, score-0.413]

60 If a unlabeled vertex does not have a path to any labeled vertex, this term ensures that the converged marginal for this vertex will be uniform over all tags, ensuring that our algorithm performs at least as well as a standard self-training based algorithm, as we will see later. [sent-160, score-0.563]

61 In all our experiments we run 10 iterations of the above algorithm, and we denote the type marginals at completion by qu∗(y). [sent-163, score-0.274]

62 4 Viterbi Decoding Given the type marginals computed in the previous step, we interpolate them with the original CRF token marginals. [sent-165, score-0.321]

63 This interpolation between type and token marginals encourages similar n-grams to have similar posteriors, while still allowing n-grams in different sentences to differ in their posteriors. [sent-166, score-0.321]

64 172 The interpolated marginals summarize all the information obtained so far about the tag distribution at each position. [sent-169, score-0.35]

65 This happens because the type marginals obtained from the graph after label propagation will have lost most of the sequence information. [sent-171, score-0.73]

66 To enforce the first-order tag dependencies we therefore use Viterbi decoding over the combined interpolated marginals and the CRF transition potentials to compute the best POS tag sequence for each unlabeled sentence. [sent-172, score-0.76]

67 5 Re-training the CRF Now that we have successfully labeled the unlabeled target domain data, we can use it in conjunction with the source domain labeled data to re-train the CRF: Λ(nt+)1=aΛrg∈mRKin? [sent-175, score-0.889]

68 Unlike III (2007), we do not require target domain labeled data. [sent-184, score-0.337]

69 , 2006) has been evaluated without target domain labeled data, that evaluation was to some extent transductive in that the target test data (unlabeled) was included in the unsupervised stage of SCL training that creates the structural correspondence between the two domains. [sent-186, score-0.564]

70 (2005), which is unlikely to scale up because its dual formulation requires the inversion of a matrix whose size depends on the graph size. [sent-188, score-0.281]

71 (2009) also constrain similar trigrams to have similar POS tags by forming cliques of similar trigrams and maximizing the agreement score over these cliques. [sent-190, score-0.32]

72 We achieve similar effects by using our simple, scalable convex graph regularization framework. [sent-192, score-0.431]

73 6 Experiments and Results We use the Wall Street Journal (WSJ) section of the Penn Treebank as our labeled source domain training set. [sent-196, score-0.301]

74 To evaluate our domain-adaptation approach, we consider two different target domains: questions and biomedical data. [sent-198, score-0.262]

75 Both target domains are relatively far from the source domain (newswire), making this a very challenging task. [sent-199, score-0.236]

76 As our unlabeled data, we use a set of 10 million questions collected from anonymized Internet search queries. [sent-208, score-0.392]

77 Because the graph nodes and the features tuhsiesd a isn D Dthe similarity function are based on n-grams, data sparsity can be a serious problem, and we therefore use the entire unlabeled data set for graph construction. [sent-211, score-0.904]

78 We estimate the mutual information-based features for each trigram type over all the 10 million questions, and then construct the graph over only the set of trigram types that actually occurs in the 100,000 random subset and the WSJ training set. [sent-212, score-0.726]

79 Furthermore, the POS tag set for this data is a super-set of the Penn Treebank’s, including the two new tags HYPH (for hyphens) and AFX (for common post-modifiers of biomedical entities such as genes). [sent-218, score-0.224]

80 For unlabeled data we used 100,000 sentences that were chosen by searching MEDLINE for abstracts pertaining to cancer, in particular genomic variations and mutations (Blitzer et al. [sent-222, score-0.251]

81 Since we did not have access to additional unlabeled data, we used the same set of sentences as target domain unlabeled data, Du. [sent-224, score-0.704]

82 sTehteo graph ehnecrees was acrognesttdroucmteadin over atbheel 100,000 unlabeled sentences and the WSJ training set. [sent-225, score-0.532]

83 Finally, we remind the reader that we did not use label information for graph construction in either corpus. [sent-226, score-0.327]

84 Both supervised and semi-supervised models are regularized with a squared ‘2-norm regularizer with weight set to 0. [sent-242, score-0.233]

85 In this approach, we first train a supervised CRF on the labeled data and then do semisupervised training without label propagation. [sent-245, score-0.325]

86 This is different from plain self-training because it aggregates the posteriors over tokens into posteriors over µ types. [sent-246, score-0.326]

87 This aggregation step allows instances of the same trigram in different sentences to share information and works better in practice than direct selftraining on the output of the supervised CRF. [sent-247, score-0.21]

88 2 Domain Adaptation Results The data set obtained concatenating the WSJ training set with the 10 million questions had about 20 million trigram types. [sent-249, score-0.368]

89 1 million trigram types occurred in the WSJ training set or in the 100,000 sentence sub-sample. [sent-251, score-0.227]

90 For the biomedical domain, the graph had about 2. [sent-252, score-0.406]

91 For all our experiments we set hyperparameters as follows: for graph propagation, = 0. [sent-254, score-0.281]

92 We hypothesize that this caused by sparsity in the graph generated from the biomedical dataset. [sent-276, score-0.44]

93 For the questions graph, the PMI statistics were estimated over 10 million sentences while in the case ofthe biomedical dataset, the same statistics were computed over just 100,000 sentences. [sent-277, score-0.266]

94 For the biomedical data, close to 50% of the trigrams from the target data do not have a path to a trigram from the source data. [sent-281, score-0.589]

95 Labeled trigrams occur at least once in the WSJ training data. [sent-285, score-0.192]

96 the other hand, for the question corpus, only about 12% of the target domain trigrams are disconnected, and the average path length is about 9. [sent-286, score-0.409]

97 We believe that it is this sparsity that causes the graph propagation to not have a more noticeable effect on the final performance. [sent-288, score-0.389]

98 It is noteworthy that making use of even such a sparse graph does not lead to any degradation in results, which we attribute to the choice of graph-propagation regular- izer (Section 4. [sent-289, score-0.281]

99 We presented a simple, scalable algorithm for training structured prediction models in a semisupervised manner. [sent-291, score-0.211]

100 The approach is based on using as a regularizer a nearest-neighbor graph constructed over trigram types. [sent-292, score-0.573]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('crf', 0.37), ('ssl', 0.316), ('graph', 0.281), ('unlabeled', 0.251), ('marginals', 0.241), ('transductive', 0.157), ('trigram', 0.153), ('posteriors', 0.148), ('trigrams', 0.145), ('regularizer', 0.139), ('labeled', 0.135), ('domain', 0.132), ('biomedical', 0.125), ('vertices', 0.118), ('xi', 0.103), ('convex', 0.096), ('pos', 0.093), ('yi', 0.092), ('chapelle', 0.09), ('wuv', 0.09), ('semisupervised', 0.087), ('dl', 0.081), ('qu', 0.081), ('altun', 0.077), ('nt', 0.077), ('decode', 0.075), ('million', 0.074), ('propagation', 0.074), ('tagging', 0.073), ('wsj', 0.073), ('blitzer', 0.071), ('target', 0.07), ('ij', 0.07), ('structured', 0.07), ('tag', 0.069), ('belkin', 0.068), ('subramanya', 0.068), ('questions', 0.067), ('vl', 0.064), ('viterbi', 0.063), ('path', 0.062), ('zhu', 0.061), ('smoothness', 0.058), ('supervised', 0.057), ('similarity', 0.057), ('sequence', 0.055), ('scalable', 0.054), ('haffari', 0.052), ('neighborhoods', 0.052), ('smooth', 0.051), ('occur', 0.047), ('du', 0.047), ('token', 0.047), ('label', 0.046), ('alexandrescu', 0.045), ('brefeld', 0.045), ('corduneanu', 0.045), ('grandvalet', 0.045), ('inverting', 0.045), ('niyogi', 0.045), ('sindhwani', 0.045), ('vu', 0.045), ('pmi', 0.045), ('bengio', 0.045), ('google', 0.044), ('iand', 0.043), ('objective', 0.042), ('marginal', 0.041), ('adaptation', 0.04), ('interpolated', 0.04), ('clique', 0.039), ('collobert', 0.039), ('gupta', 0.039), ('manifold', 0.039), ('questionbank', 0.039), ('retrain', 0.039), ('scl', 0.039), ('link', 0.037), ('atomic', 0.037), ('squared', 0.037), ('vertex', 0.037), ('problems', 0.036), ('absolute', 0.036), ('smoothed', 0.035), ('potentials', 0.035), ('qv', 0.035), ('icml', 0.035), ('sparsity', 0.034), ('joachims', 0.034), ('source', 0.034), ('type', 0.033), ('mountain', 0.032), ('conflict', 0.032), ('construct', 0.032), ('neighbors', 0.031), ('tags', 0.03), ('blum', 0.03), ('tokens', 0.03), ('com', 0.03), ('toutanova', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

Author: Amarnag Subramanya ; Slav Petrov ; Fernando Pereira

2 0.15701807 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay

Abstract: In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.1

3 0.15042952 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models

Author: Avihai Mejer ; Koby Crammer

Abstract: Confidence-Weighted linear classifiers (CW) and its successors were shown to perform well on binary and multiclass NLP problems. In this paper we extend the CW approach for sequence learning and show that it achieves state-of-the-art performance on four noun phrase chucking and named entity recognition tasks. We then derive few algorithmic approaches to estimate the prediction’s correctness of each label in the output sequence. We show that our approach provides a reliable relative correctness information as it outperforms other alternatives in ranking label-predictions according to their error. We also show empirically that our methods output close to absolute estimation of error. Finally, we show how to use this information to improve active learning.

4 0.15040284 104 emnlp-2010-The Necessity of Combining Adaptation Methods

Author: Ming-Wei Chang ; Michael Connor ; Dan Roth

Abstract: Problems stemming from domain adaptation continue to plague the statistical natural language processing community. There has been continuing work trying to find general purpose algorithms to alleviate this problem. In this paper we argue that existing general purpose approaches usually only focus on one of two issues related to the difficulties faced by adaptation: 1) difference in base feature statistics or 2) task differences that can be detected with labeled data. We argue that it is necessary to combine these two classes of adaptation algorithms, using evidence collected through theoretical analysis and simulated and real-world data experiments. We find that the combined approach often outperforms the individual adaptation approaches. By combining simple approaches from each class of adaptation algorithm, we achieve state-of-the-art results for both Named Entity Recognition adaptation task and the Preposition Sense Disambiguation adaptation task. Second, we also show that applying an adaptation algorithm that finds shared representation between domains often impacts the choice in adaptation algorithm that makes use of target labeled data.

5 0.14941168 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

Author: Slav Petrov ; Pi-Chuan Chang ; Michael Ringgaard ; Hiyan Alshawi

Abstract: It is well known that parsing accuracies drop significantly on out-of-domain data. What is less known is that some parsers suffer more from domain shifts than others. We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers, which are of highest interest for practical applications because of their linear running time, drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.

6 0.13608755 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

7 0.11654307 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

8 0.10983597 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

9 0.10960101 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

10 0.10766554 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

11 0.10375385 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

12 0.10358308 88 emnlp-2010-On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

13 0.10051903 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

14 0.099147573 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

15 0.097021297 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

16 0.091419749 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

17 0.081676327 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

18 0.078139573 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

19 0.077830993 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

20 0.077315673 39 emnlp-2010-EMNLP 044

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.279), (1, 0.158), (2, 0.027), (3, -0.071), (4, -0.206), (5, 0.088), (6, 0.147), (7, 0.201), (8, -0.069), (9, 0.229), (10, -0.044), (11, 0.026), (12, -0.011), (13, 0.051), (14, -0.067), (15, 0.032), (16, -0.039), (17, -0.077), (18, 0.06), (19, 0.059), (20, 0.045), (21, 0.001), (22, 0.025), (23, -0.173), (24, -0.006), (25, -0.105), (26, -0.142), (27, 0.077), (28, -0.111), (29, -0.013), (30, -0.046), (31, 0.045), (32, 0.1), (33, -0.013), (34, -0.06), (35, -0.075), (36, -0.097), (37, -0.083), (38, 0.026), (39, -0.072), (40, -0.041), (41, -0.092), (42, 0.088), (43, 0.049), (44, 0.039), (45, 0.054), (46, 0.105), (47, -0.045), (48, -0.033), (49, 0.185)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96050304 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

Author: Amarnag Subramanya ; Slav Petrov ; Fernando Pereira

2 0.5354104 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models

Author: Avihai Mejer ; Koby Crammer

3 0.51721793 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

Author: Andre Martins ; Noah Smith ; Eric Xing ; Pedro Aguiar ; Mario Figueiredo

Abstract: We present a unified view of two state-of-theart non-projective dependency parsers, both approximate: the loopy belief propagation parser of Smith and Eisner (2008) and the relaxed linear program of Martins et al. (2009). By representing the model assumptions with a factor graph, we shed light on the optimization problems tackled in each method. We also propose a new aggressive online algorithm to learn the model parameters, which makes use of the underlying variational representation. The algorithm does not require a learning rate parameter and provides a single framework for a wide family of convex loss functions, includ- ing CRFs and structured SVMs. Experiments show state-of-the-art performance for 14 languages.

4 0.50741148 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

Author: Wei Lu ; Hwee Tou Ng

Abstract: This paper focuses on the task of inserting punctuation symbols into transcribed conversational speech texts, without relying on prosodic cues. We investigate limitations associated with previous methods, and propose a novel approach based on dynamic conditional random fields. Different from previous work, our proposed approach is designed to jointly perform both sentence boundary and sentence type prediction, and punctuation prediction on speech utterances. We performed evaluations on a transcribed conversational speech domain consisting of both English and Chinese texts. Empirical results show that our method outperforms an approach based on linear-chain conditional random fields and other previous approaches.

5 0.49160212 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

Author: Ioannis Klapaftis ; Suresh Manandhar

Abstract: Graph-based methods have gained attention in many areas of Natural Language Processing (NLP) including Word Sense Disambiguation (WSD), text summarization, keyword extraction and others. Most of the work in these areas formulate their problem in a graph-based setting and apply unsupervised graph clustering to obtain a set of clusters. Recent studies suggest that graphs often exhibit a hierarchical structure that goes beyond simple flat clustering. This paper presents an unsupervised method for inferring the hierarchical grouping of the senses of a polysemous word. The inferred hierarchical structures are applied to the problem of word sense disambiguation, where we show that our method performs sig- nificantly better than traditional graph-based methods and agglomerative clustering yielding improvements over state-of-the-art WSD systems based on sense induction.

6 0.46530142 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

7 0.46132761 104 emnlp-2010-The Necessity of Combining Adaptation Methods

8 0.458556 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

9 0.44770655 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

10 0.39899519 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

11 0.37789097 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

12 0.35114455 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

13 0.34775072 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

14 0.34392515 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

15 0.3213892 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

16 0.30142671 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

17 0.28903759 39 emnlp-2010-EMNLP 044

18 0.28058884 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

19 0.27455238 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

20 0.27413052 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.014), (12, 0.031), (29, 0.106), (30, 0.013), (32, 0.03), (52, 0.027), (56, 0.082), (62, 0.028), (66, 0.15), (72, 0.051), (76, 0.018), (79, 0.01), (87, 0.012), (89, 0.354)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93451589 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

same-paper 2 0.79727107 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

Author: Amarnag Subramanya ; Slav Petrov ; Fernando Pereira

3 0.79228288 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

4 0.55180979 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

Author: Chen Zhang ; Joyce Chai

Abstract: While a significant amount of research has been devoted to textual entailment, automated entailment from conversational scripts has received less attention. To address this limitation, this paper investigates the problem of conversation entailment: automated inference of hypotheses from conversation scripts. We examine two levels of semantic representations: a basic representation based on syntactic parsing from conversation utterances and an augmented representation taking into consideration of conversation structures. For each of these levels, we further explore two ways of capturing long distance relations between language constituents: implicit modeling based on the length of distance and explicit modeling based on actual patterns of relations. Our empirical findings have shown that the augmented representation with conversation structures is important, which achieves the best performance when combined with explicit modeling of long distance relations.

5 0.54763377 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

6 0.54582614 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

7 0.53546417 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

8 0.5350219 84 emnlp-2010-NLP on Spoken Documents Without ASR

9 0.53449988 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

10 0.53299308 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

11 0.53104866 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

12 0.53016526 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

13 0.52747893 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

14 0.52534854 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

15 0.52463168 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

16 0.52287662 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

17 0.52245224 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

18 0.52203465 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

19 0.52039188 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

20 0.5164144 63 emnlp-2010-Improving Translation via Targeted Paraphrasing