emnlp emnlp2010 emnlp2010-113 knowledge-graph by maker-knowledge-mining

113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

Source: pdf

Author: Phil Blunsom ; Trevor Cohn

Abstract: Inducing a grammar directly from text is one of the oldest and most challenging tasks in Computational Linguistics. Significant progress has been made for inducing dependency grammars, however the models employed are overly simplistic, particularly in comparison to supervised parsing models. In this paper we present an approach to dependency grammar induction using tree substitution grammar which is capable of learning large dependency fragments and thereby better modelling the text. We define a hierarchical non-parametric Pitman-Yor Process prior which biases towards a small grammar with simple productions. This approach significantly improves the state-of-the-art, when measured by head attachment accuracy.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Inducing a grammar directly from text is one of the oldest and most challenging tasks in Computational Linguistics. [sent-5, score-0.121]

2 Significant progress has been made for inducing dependency grammars, however the models employed are overly simplistic, particularly in comparison to supervised parsing models. [sent-6, score-0.127]

3 In this paper we present an approach to dependency grammar induction using tree substitution grammar which is capable of learning large dependency fragments and thereby better modelling the text. [sent-7, score-0.802]

4 We define a hierarchical non-parametric Pitman-Yor Process prior which biases towards a small grammar with simple productions. [sent-8, score-0.155]

5 This approach significantly improves the state-of-the-art, when measured by head attachment accuracy. [sent-9, score-0.313]

6 In particular the constituent labels are highly ambiguous, firstly we don’t know a priori how many there are, and secondly labels that appear high in a tree (e. [sent-13, score-0.096]

7 However recent work on the induction of dependency grammars has proved 1204 Trevor Cohn Department of Computer Science University of Sheffield T . [sent-16, score-0.248]

8 Dependency grammars (Mel0 ˇcuk, 1988) should be easier to induce from text compared to phrase-structure grammars because the set of labels (heads) are directly observed as the words in the sentence. [sent-21, score-0.142]

9 Approaches to unsupervised grammar induction, both for phrase-structure and dependency grammars, have typically used very simplistic models (Clark, 2001 ; Klein and Manning, 2004), especially in comparison to supervised parsing models (Collins, 2003; Clark and Curran, 2004; McDonald, 2006). [sent-22, score-0.286]

10 Simple models are attractive for grammar induction because they have a limited capacity to overfit, however they are incapable of modelling many known linguistic phenomena. [sent-23, score-0.264]

11 We posit that more complex grammars could be used to better model the unsupervised task, provided that active measures are taken to prevent overfitting. [sent-24, score-0.109]

12 In this paper we present an approach to dependency grammar induction using a tree-substitution grammar (TSG) with a Bayesian non-parametric prior. [sent-25, score-0.419]

13 This allows the model to learn large dependency fragments to best describe the text, with the prior biasing the model towards fewer and smaller grammar productions. [sent-26, score-0.277]

14 We adopt the split-head construction (Eisner, 2000; Johnson, 2007) to map dependency parses to context free grammar (CFG) derivations, over which we apply a model of TSG induction (Cohn et al. [sent-27, score-0.298]

15 The model uses a hierarchical Pitman-Yor process to encode a backoff path from TSG to CFG rules, and from lexicalised to unlexicalised rules. [sent-29, score-0.453]

16 Our best lexicalised model achieves a head attachment accuracy of of 55. [sent-30, score-0.476]

17 c od2s01 in0 N Aastsuorcaialt Lioanng foura Cgeom Prpoucteastisoinnga,l p Laignegsui 1s2ti0c4s–1213, CFG Rule DMV Distribution Description S p(root = H) The head of the sentence is H. [sent-35, score-0.172]

20 R,head = H) Unambiguous Unambiguous → Table 1: The CFG-DMV grammar schema. [sent-46, score-0.121]

21 Valency (val) can take the value 0 (no attachment in the direction (dir) d) and 1 (one or more attachment). [sent-48, score-0.141]

22 L and R indicates child dependents left or right of the parent; superscripts encode the stopping and valency distributions, X1 indicates that the head will continue to attach more children and X∗ that it has already attached a child. [sent-49, score-0.31]

23 2 Background The most successful framework for unsupervised dependency induction is the Dependency Model with Valence (DMV) (Klein and Manning, 2004). [sent-50, score-0.215]

24 This model has been adapted and extended by a number of authors and currently represents the stateof-the-art for dependency induction (Cohen and Smith, 2009; Headden III et al. [sent-51, score-0.177]

25 Eisner (2000) introduced the split-head algorithm which permits efficient O(|w|3) parsing complexity by replicating (splitting) ewa|ch terminal and processing left and right dependents separately. [sent-53, score-0.114]

26 We employ the related fold-unfold representation of Johnson (2007) that defines a CFG equivalent of the splithead parsing algorithm, allowing us to easily adapt CFG-based grammar models to dependency grammar. [sent-54, score-0.248]

27 Table 1 shows the equivalent CFG grammar for the DMV model (CFG-DMV) using the unfold-fold transformation. [sent-55, score-0.121]

28 The key insight to understanding the non-terminals in this grammar is that the subscripts encode the terminals at the boundaries of the span of that non-terminal. [sent-56, score-0.155]

29 The ∗ and 1 superscripts are used to encode the valency of the head, both indicate that the head has at least one attached dependent in the specified direction. [sent-58, score-0.265]

30 The transform is illustrated in figures 1a and 1c which show the CFG tree for an example sentence and the equivalent dependency tree. [sent-60, score-0.189]

31 We are also able to show that this basic approach to lexicalisation improves the performance of our models. [sent-65, score-0.121]

32 S Lhates[V ] hates[V ]R Lh1ates[V ] hates[V ]R1 hates[V]∗MN NR LN NMhates[V]∗ Nl N Rr Lha∗hteatse[ V[ ]l h a te s [V ]R∗r LN l Nr (a) A TSG-DMV derivation for the sentence George hates broccoli. [sent-66, score-0.252]

33 George and broccoli occur less than the lexicalisation cutoff and are thus represented by the part-of-speech N, while hates is common and therefore is represented by a word/tag pair. [sent-67, score-0.313]

34 S Lhates[V ] hates[V ]R Lh1ates[V ] hates[V ]R1 LN NMhates[V]∗ hates[V]∗MN NR (b) A TSG-DMV elementary rule from Figure 1a. [sent-69, score-0.194]

35 This rule encodes a dependency between the subject and object of hates that is not present in the CFG-DMV. [sent-70, score-0.33]

36 More dependents can be inserted using additional rules below the M/L/R frontier non-terminals. [sent-72, score-0.156]

37 GeorgehatesbroccoliROOT (c) A traditional dependency tree representation of the parse tree in Figure 1a before applying the lexicalisation cutoff. [sent-73, score-0.406]

38 A TSG is a 4-tuple, G = (T, N, S, R), where T is a set of terminal symbols, N is a set of non-terminal symbols, S ∈ N is the distinguished root non-terminal abnolds ,R S Sis ∈ a set so tfh productions (rules). [sent-76, score-0.116]

39 nTohne- productions take the form of elementary trees tree fragments of height ≥ 1, where each internal node is mlaebneltlsed o fw hiethig a ≥non 1-,te wrmhienreal aancdh ienatcehr laela fn oisd ela i-s belled with either a terminal or a non-terminal. [sent-77, score-0.426]

40 Nonterminal leaves are called frontier non-terminals and form the substitution sites in the generative process of creating trees with the grammar. [sent-78, score-0.203]

41 A derivation creates a tree by starting with the root symbol and rewriting (substituting) it with an elementary tree, then continuing to rewrite frontier non-terminals with elementary trees until there are no remaining frontier non-terminals. [sent-79, score-0.714]

42 We can represent derivations as sequences of elementary trees, e, by specifying that during the generation of the tree each elementary tree is substituted for the left-most frontier non-terminal. [sent-80, score-0.63]

43 Figure 1a shows a – 1206 TSG derivation for the dependency tree in Figure 1c where bold nonterminal labels denote substitution sites (root/frontier nodes in the elementary trees). [sent-81, score-0.47]

44 The probability of a tree, t, and string of words, w, are P(t) = X P(e) and P(w) e:treXe(e)=t = X P(t), t:yielXd(t)=w respectively, where tree(e) returns the tree for the derivation e and yield(t) returns the string of terminal symbols at the leaves of t. [sent-84, score-0.236]

45 Parsing rin Pvo(lev|ecs) finding the most probable tree for a given string (arg maxt P(t|w)). [sent-88, score-0.096]

46 We define a hierarchical non-parametric TSG model on the space of parse trees licensed by the CFG grammar in Table 1. [sent-93, score-0.193]

47 Teh (2006) used a hierarchical PYP to model backoff in language models, we leverage this same capability to model backoff in TSG rules. [sent-100, score-0.3]

48 This effectively allows smoothing from lexicalised to unlexicalised grammars, and from TSG to CFG rules. [sent-101, score-0.252]

49 The topmost level of our model describes lexicalised elementary elementary fragments (e) as produced by a PYP, e| c ∼ Gc|ac, bc, Plcfg Gc ∼ PYP(ac, bc, Plcfg(·|c)) , where ac and bc control the strength of the backoff distribution Plcfg. [sent-104, score-0.816]

50 The space of lexicalised TSG rules will inevitably be very sparse, so the base distribution Plcfg backs-off to calculating the probability of a TSG rules as the product of the CFG rules it contains, multiplied by a geometric distribution over the size of the rule. [sent-105, score-0.352]

51 r u Tlehse in futenrcntaiol nto l e, cefagc-hr olfe t(hee| f)o rrmec0 → α; each CFG rule is drawn from the backoff distribution, Ac0 . [sent-108, score-0.178]

52 This model showed that smoothing the DMV by removing the heads from the CFG rules significantly improved performance. [sent-114, score-0.11]

53 Each stage of backoff is illustrated in Table 2, showing the rules generated from the TSG elementary tree in Figure 1b. [sent-116, score-0.441]

54 In this application to dependency grammar our model is capable of learning tree fragments which group CFG parameters. [sent-121, score-0.373]

55 As such the model can learn to condition dependency links on the valence, e. [sent-122, score-0.093]

56 by combining LH → L1H and L1H → LC CMH∗ rules into a single fragment the mod→el can learn a parameter that the leftmost child of H is C. [sent-124, score-0.108]

57 tree fragments representing the com- plete preferred argument frame of a verb. [sent-127, score-0.159]

58 , 2009) permit an efficient local sampler, the lack of an observed parse tree in our unsupervised model makes this sampler not applicable. [sent-131, score-0.195]

59 Instead we use a recently proposed blocked Metroplis-Hastings (MH) sampler (Cohn and Blunsom, 2010) which exploits a factorisation of the derivation probabilities such that whole trees can be sampled efficiently. [sent-132, score-0.159]

60 Klein and Manning (2004) emphasised the importance of the initialiser for achieving good performance with their model. [sent-136, score-0.104]

61 We employ the same harmonic initialiser as described in that work. [sent-137, score-0.104]

62 The initial derivations for our sampler are the Viterbi derivations under the CFG parameterised according to this initialiser. [sent-138, score-0.155]

63 We place prior distributions on the PYP discount ac and concentration bc hyperparamters and sample their values using a slice sampler. [sent-142, score-0.212]

64 Similarly, we treat the concentration parameters, bc, as being generated by a vague gamma prior, bc ∼ Gamma(1, 1), and sample a new value b0c using the∼ same slice-sampling approach as for ac: P(bc|z) ∝ P(z|bc) × Gamma(bc| 1, 1) . [sent-145, score-0.174]

65 2We made use of the slice sampler included in Mark Johnson’s Adaptor Grammar implementation http : / /www . [sent-146, score-0.114]

66 3 Parsing Unfortunately finding the maximising parse tree for a string under our TSG-DMV model is intractable due to the inter-rule dependencies created by the PYP formulation. [sent-156, score-0.096]

67 Like previous work we pre-process the training and test data to remove punctuation, training our unlexicalised models on the gold-standard part-of-speech tags, and including words occurring more than 100 times in our lexicalised models (Headden III et al. [sent-165, score-0.252]

68 The models are evaluated in terms of head attachment accuracy (the percentage of correctly predicted head indexes for each token in the test data), on two subsets of the testing data. [sent-173, score-0.485]

69 Subsequent to the evaluation reported in Table 4 we use section 22 to report the correlation between heldout accuracy and the model log-likelihood (LLH) for analytic purposes. [sent-179, score-0.093]

70 1 Discussion Table 4 shows the head attachment accuracy results for our TSG-DMV, plus many other significant previously proposed models. [sent-183, score-0.313]

71 We can identify a number of differences that may impact these results: the Adaptor Grammar model is trained using variational inference with the space of tree fragments DEiMric(Khlleeint a(Cndoh Menan enti anlg, . [sent-194, score-0.159]

72 9 ing data versus the head attachment accuracy on the Hypertext Markup(Spitkovsky et al. [sent-224, score-0.313]

73 This graph indicates that TL e Sx GT -SDSG M-D VM (PV cfg (P, Plc fsgh ),P c fg ),Psh)6 7 5. [sent-235, score-0.178]

74 8 der to use LLH as a model selection criteria similar Table 4: Mean and variance for the head attachment accuracy of our TSG-DMV models (highlighted) with varying backoff paths, and many other high performing models. [sent-241, score-0.446]

75 Our models labelled TSG used an unlexicalised top level Gc PYP, while those labelled LexTSG used the full lexicalised Gc. [sent-243, score-0.326]

76 (2009) shows that the random initialiser is crucial for good performance, however this initialiser requires training 1000 models to select a single best model for evaluation and results in considerable variance in test set performance. [sent-246, score-0.208]

77 Note also that our model exhibits considerably less variance than those induced using this random initialiser, suggesting that the combination of the harmonic initialiser and blocked-MH sampling may be a more practicable training regime. [sent-247, score-0.139]

78 For further analysis Table 5 shows the accuracy of the model at predicting the head for frequent types, while Table 6 shows the performance on dependencies of various lengths. [sent-256, score-0.172]

79 We emphasise that these results are for the single best performing sampler run on the heldout corpus and there is considerable variation in the analyses produced by each sampler. [sent-257, score-0.154]

80 Conjunctions such as and pose a particular difficulty when evaluating dependency models as the correct modelling of these remains a – Perplexity vs. [sent-260, score-0.152]

81 2) between the training LLH of the PYP Model and heldout directed head attachment accuracy (WSJ Section 22, |w| ≤ 10) for LexTSG-DMV (Plcfg, Pcfg, Psh). [sent-263, score-0.406]

82 Accuracy Samples (b) Mean heldout directed head attachment accuracy (WSJ Section 22, |w| ≤ 10) versus the number of samples used during training |fwor |L e≤xT 10S)G- vDerMsuVs (thPelcf ng,u mPcfbge, rP oshf). [sent-265, score-0.45]

83 Table 7 list the most frequent TSG rules lexicalised with has. [sent-268, score-0.226]

84 The most frequent rule is simply the single level equivalent of the DMV terminal rule for has. [sent-269, score-0.17]

85 Almost as frequent is rule 3, here the grammar incorporates the terminal into a larger elementary fragment, encoding that it is the head of the past participle occuring immediately to it’s right. [sent-270, score-0.567]

86 This shows the model’s ability to learn the verb’s argument position conditioned on both the head and child type, something lacking in DMV. [sent-271, score-0.217]

87 Rule 7 further refines this preferred analysis for has been by lexicalising both the head and child. [sent-272, score-0.172]

88 1211 6 Conclusion In this paper we have made two significant contributions to probabilistic modelling and grammar induction. [sent-276, score-0.18]

89 We have shown that it is possible to successfully learn hierarchical Pitman-Yor models that encode deep and complex backoff paths over highly structured latent spaces. [sent-277, score-0.201]

90 By applying these models to the induction of dependency grammars we have also been able to advance the state-of-the-art, increasing the head attachment accuracy on section 23 of the Wall Street Journal Corpus by more than 5%. [sent-278, score-0.561]

91 In particular more extensive experimentation with alternate priors and larger training data may allow the removal of the lexicalisation cutoff which is currently in place to counter sparsity. [sent-280, score-0.121]

92 Unsupervised induction of stochastic context-free grammars using distributional clustering. [sent-296, score-0.155]

93 Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. [sent-303, score-0.159]

94 Bilexical grammars and their cubic- time parsing algorithms. [sent-340, score-0.105]

95 Improving unsupervised dependency parsing with richer contexts and smoothing. [sent-358, score-0.165]

96 00 Table 5: Per tag type predicted count and accuracy, for the most frequent 15 un/lexicalised tokens on the WSJ Section 22 |w| ≤ 10 heldout set (LexTSG-DMV (WPSlcfJg,P Secfcg,tiPosnh)) 2. [sent-389, score-0.093]

97 Transforming projective bilexical dependency grammars into efficiently-parsable CFGs with unfold-fold. [sent-392, score-0.164]

98 Corpusbased induction of syntactic structure: models of dependency and constituency. [sent-404, score-0.177]

99 29 Table 6: Link distance precision, recall and f-score, on the WSJ Section 22 |w| ≤ 10 heldout set. [sent-435, score-0.093]

100 From Baby Steps to Leapfrog: How “Less is More” in unsupervised dependency parsing. [sent-460, score-0.131]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bz', 0.362), ('tsg', 0.345), ('pyp', 0.224), ('hates', 0.192), ('cfg', 0.178), ('head', 0.172), ('lexicalised', 0.163), ('dir', 0.15), ('elementary', 0.149), ('dmv', 0.148), ('attachment', 0.141), ('backoff', 0.133), ('grammar', 0.121), ('lexicalisation', 0.121), ('plcfg', 0.121), ('spitkovsky', 0.111), ('cont', 0.107), ('val', 0.107), ('bzr', 0.104), ('initialiser', 0.104), ('hr', 0.098), ('tree', 0.096), ('cohn', 0.094), ('heldout', 0.093), ('frontier', 0.093), ('dependency', 0.093), ('bc', 0.093), ('unlexicalised', 0.089), ('headden', 0.087), ('nnmhas', 0.086), ('induction', 0.084), ('terminal', 0.08), ('pcfg', 0.075), ('substitution', 0.072), ('grammars', 0.071), ('lhates', 0.069), ('llh', 0.069), ('psh', 0.069), ('bn', 0.069), ('cohen', 0.068), ('lh', 0.067), ('ac', 0.066), ('nr', 0.065), ('rules', 0.063), ('fragments', 0.063), ('adaptor', 0.061), ('mn', 0.061), ('sampler', 0.061), ('derivation', 0.06), ('valency', 0.059), ('modelling', 0.059), ('ln', 0.054), ('slice', 0.053), ('iii', 0.053), ('lv', 0.052), ('nnpmhas', 0.052), ('unlex', 0.052), ('beta', 0.049), ('heads', 0.047), ('derivations', 0.047), ('phil', 0.046), ('blunsom', 0.046), ('child', 0.045), ('rule', 0.045), ('gamma', 0.044), ('cmh', 0.044), ('mv', 0.044), ('samples', 0.044), ('wsj', 0.043), ('geman', 0.04), ('gc', 0.04), ('trees', 0.038), ('klein', 0.038), ('unsupervised', 0.038), ('trevor', 0.038), ('lc', 0.037), ('vague', 0.037), ('labelled', 0.037), ('hiyan', 0.037), ('valence', 0.037), ('sc', 0.036), ('root', 0.036), ('sampling', 0.035), ('alccfg', 0.035), ('asch', 0.035), ('blccfg', 0.035), ('bsch', 0.035), ('generalisation', 0.035), ('mbeen', 0.035), ('mmp', 0.035), ('mpd', 0.035), ('nmhates', 0.035), ('nnr', 0.035), ('ptsg', 0.035), ('encode', 0.034), ('parsing', 0.034), ('hierarchical', 0.034), ('annual', 0.034), ('association', 0.033), ('shay', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

Author: Phil Blunsom ; Trevor Cohn

2 0.19087192 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

Abstract: We reveal a previously unnoticed connection between dependency parsing and statistical machine translation (SMT), by formulating the dependency parsing task as a problem of word alignment. Furthermore, we show that two well known models for these respective tasks (DMV and the IBM models) share common modeling assumptions. This motivates us to develop an alignment-based framework for unsupervised dependency parsing. The framework (which will be made publicly available) is flexible, modular and easy to extend. Using this framework, we implement several algorithms based on the IBM alignment models, which prove surprisingly effective on the dependency parsing task, and demonstrate the potential of the alignment-based approach.

3 0.16240303 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

Author: Tahira Naseem ; Harr Chen ; Regina Barzilay ; Mark Johnson

Abstract: We present an approach to grammar induction that utilizes syntactic universals to improve dependency parsing across a range of languages. Our method uses a single set of manually-specified language-independent rules that identify syntactic dependencies between pairs of syntactic categories that commonly occur across languages. During inference of the probabilistic model, we use posterior expectation constraints to require that a minimum proportion of the dependencies we infer be instances of these rules. We also automatically refine the syntactic categories given in our coarsely tagged input. Across six languages our approach outperforms state-of-theart unsupervised methods by a significant margin.1

4 0.10913027 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

Author: Adria de Gispert ; Juan Pino ; William Byrne

Abstract: We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignmentmodel. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteri- ors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-totarget and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

5 0.10542607 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

Author: Hui Zhang ; Min Zhang ; Haizhou Li ; Eng Siong Chng

Abstract: This paper studies two issues, non-isomorphic structure translation and target syntactic structure usage, for statistical machine translation in the context of forest-based tree to tree sequence translation. For the first issue, we propose a novel non-isomorphic translation framework to capture more non-isomorphic structure mappings than traditional tree-based and tree-sequence-based translation methods. For the second issue, we propose a parallel space searching method to generate hypothesis using tree-to-string model and evaluate its syntactic goodness using tree-to-tree/tree sequence model. This not only reduces the search complexity by merging spurious-ambiguity translation paths and solves the data sparseness issue in training, but also serves as a syntax-based target language model for better grammatical generation. Experiment results on the benchmark data show our proposed two solutions are very effective, achieving significant performance improvement over baselines when applying to different translation models.

6 0.098485813 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

7 0.09290877 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

8 0.090011358 114 emnlp-2010-Unsupervised Parse Selection for HPSG

9 0.087964565 94 emnlp-2010-SCFG Decoding Without Binarization

10 0.083279483 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

11 0.081570044 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

12 0.081386641 88 emnlp-2010-On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

13 0.075156875 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

14 0.0653954 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

15 0.064710371 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

16 0.064364769 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

17 0.063065358 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

18 0.060368109 38 emnlp-2010-Dual Decomposition for Parsing with Non-Projective Head Automata

19 0.059326239 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

20 0.05724065 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.205), (1, 0.056), (2, 0.255), (3, -0.069), (4, 0.171), (5, -0.022), (6, -0.064), (7, 0.054), (8, 0.071), (9, -0.069), (10, 0.056), (11, -0.012), (12, 0.065), (13, 0.057), (14, -0.003), (15, -0.063), (16, 0.005), (17, -0.031), (18, -0.003), (19, -0.007), (20, 0.098), (21, -0.234), (22, -0.194), (23, 0.018), (24, -0.091), (25, 0.166), (26, 0.021), (27, 0.061), (28, -0.195), (29, -0.041), (30, 0.041), (31, 0.055), (32, -0.118), (33, 0.083), (34, 0.083), (35, 0.065), (36, 0.076), (37, -0.019), (38, 0.016), (39, -0.088), (40, -0.077), (41, -0.007), (42, -0.023), (43, 0.004), (44, -0.02), (45, -0.057), (46, -0.008), (47, -0.023), (48, -0.04), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94846541 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

Author: Phil Blunsom ; Trevor Cohn

2 0.72611845 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

Author: Tahira Naseem ; Harr Chen ; Regina Barzilay ; Mark Johnson

3 0.54665023 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

4 0.52922577 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

Author: Eric Hardisty ; Jordan Boyd-Graber ; Philip Resnik

Abstract: Strong indications of perspective can often come from collocations of arbitrary length; for example, someone writing get the government out of my X is typically expressing a conservative rather than progressive viewpoint. However, going beyond unigram or bigram features in perspective classification gives rise to problems of data sparsity. We address this problem using nonparametric Bayesian modeling, specifically adaptor grammars (Johnson et al., 2006). We demonstrate that an adaptive na¨ ıve Bayes model captures multiword lexical usages associated with perspective, and establishes a new state-of-the-art for perspective classification results using the Bitter Lemons corpus, a collection of essays about mid-east issues from Israeli and Palestinian points of view.

5 0.42129913 94 emnlp-2010-SCFG Decoding Without Binarization

Author: Mark Hopkins ; Greg Langmead

Abstract: Conventional wisdom dictates that synchronous context-free grammars (SCFGs) must be converted to Chomsky Normal Form (CNF) to ensure cubic time decoding. For arbitrary SCFGs, this is typically accomplished via the synchronous binarization technique of (Zhang et al., 2006). A drawback to this approach is that it inflates the constant factors associated with decoding, and thus the practical running time. (DeNero et al., 2009) tackle this problem by defining a superset of CNF called Lexical Normal Form (LNF), which also supports cubic time decoding under certain implicit assumptions. In this paper, we make these assumptions explicit, and in doing so, show that LNF can be further expanded to a broader class of grammars (called “scope3”) that also supports cubic-time decoding. By simply pruning non-scope-3 rules from a GHKM-extracted grammar, we obtain better translation performance than synchronous binarization.

6 0.37373218 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

7 0.37076619 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

8 0.36509296 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

9 0.34633583 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

10 0.34471986 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

11 0.33210143 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

12 0.32860342 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

13 0.32078868 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

14 0.31506065 114 emnlp-2010-Unsupervised Parse Selection for HPSG

15 0.30307105 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

16 0.29753065 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

17 0.28429097 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

18 0.28419459 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

19 0.27547744 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

20 0.26207146 88 emnlp-2010-On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.014), (12, 0.029), (29, 0.107), (30, 0.019), (32, 0.017), (52, 0.028), (56, 0.036), (62, 0.037), (66, 0.093), (72, 0.034), (76, 0.035), (77, 0.019), (79, 0.014), (83, 0.408), (87, 0.026), (89, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76405072 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

Author: Phil Blunsom ; Trevor Cohn

2 0.6650914 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

Author: Hideki Isozaki ; Tsutomu Hirao ; Kevin Duh ; Katsuhito Sudoh ; Hajime Tsukada

Abstract: Automatic evaluation of Machine Translation (MT) quality is essential to developing highquality MT systems. Various evaluation metrics have been proposed, and BLEU is now used as the de facto standard metric. However, when we consider translation between distant language pairs such as Japanese and English, most popular metrics (e.g., BLEU, NIST, PER, and TER) do not work well. It is well known that Japanese and English have completely different word orders, and special care must be paid to word order in translation. Otherwise, translations with wrong word order often lead to misunderstanding and incomprehensibility. For instance, SMT-based Japanese-to-English translators tend to translate ‘A because B’ as ‘B because A.’ Thus, word order is the most important problem for distant language translation. However, conventional evaluation metrics do not significantly penalize such word order mistakes. Therefore, locally optimizing these metrics leads to inadequate translations. In this paper, we propose an automatic evaluation metric based on rank correlation coefficients modified with precision. Our meta-evaluation of the NTCIR-7 PATMT JE task data shows that this metric outperforms conventional metrics.

3 0.52711844 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

4 0.38454443 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

5 0.37842962 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

Author: Tahira Naseem ; Harr Chen ; Regina Barzilay ; Mark Johnson

6 0.37041253 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

7 0.35928828 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

8 0.35607693 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

9 0.35439301 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

10 0.35420144 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

11 0.35378 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

12 0.35167444 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

13 0.35107255 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification