acl acl2012 acl2012-38 knowledge-graph by maker-knowledge-mining

38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

Source: pdf

Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata

Abstract: We propose Symbol-Refined Tree Substitution Grammars (SR-TSGs) for syntactic parsing. An SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. We aim to provide a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. We present a novel probabilistic SR-TSG model based on the hierarchical Pitman-Yor Process to encode backoff smoothing from a fine-grained SR-TSG to simpler CFG rules, and develop an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling. Our SR-TSG parser achieves an F1 score of 92.4% in the Wall Street Journal (WSJ) English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and better than state-of-the-art discriminative reranking parsers.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 An SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. [sent-10, score-0.597]

2 We aim to provide a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. [sent-11, score-0.539]

3 We present a novel probabilistic SR-TSG model based on the hierarchical Pitman-Yor Process to encode backoff smoothing from a fine-grained SR-TSG to simpler CFG rules, and develop an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling. [sent-12, score-0.32]

4 Probabilistic context-free grammar (PCFG) underlies many statistical parsers, however, it is well known that the PCFG rules extracted from treebank data via maximum likelihood estimation do not perform well due to unrealistic context freedom assumptions (Klein and Manning, 2003). [sent-20, score-0.239]

5 440 In recent years, there has been an increasing interest in tree substitution grammar (TSG) as an alternative to CFG for modeling syntax trees (Post and Gildea, 2009; Tenenbaum et al. [sent-21, score-0.363]

6 TSG is a natural extension of CFG in which nonterminal symbols can be rewritten (substituted) with arbitrarily large tree fragments. [sent-24, score-0.371]

7 These tree fragments have great advantages over tiny CFG rules since they can capture non-local contexts explicitly such as predicate-argument structures, idioms and grammatical agreements (Cohn et al. [sent-25, score-0.265]

8 One major drawback of TSG is that the context freedom assumptions still remain at substitution sites, that is, TSG tree fragments are generated that are conditionally independent of all others given root nonterminal symbols. [sent-32, score-0.545]

9 Furthermore, when a sentence is unparsable with large tree fragments, the PTSG parser usually uses naive CFG rules derived from its backoff model, which diminishes the benefits obtained from large tree fragments. [sent-33, score-0.544]

10 On the other hand, current state-of-the-art parsers use symbol refinement techniques (Johnson, 1998; Collins, 2003; Matsuzaki et al. [sent-34, score-0.456]

11 Symbol refinement is a successful approach for weakening context freedom assumptions by dividing coarse treebank symbols (e. [sent-36, score-0.289]

12 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi4c 4s0–4 8, tree fragments and symbol refinement work complementarily for syntactic parsing. [sent-41, score-0.621]

13 For example, Bansal and Klein (2010) have reported that deterministic symbol refinement with heuristics helps improve the accuracy of a TSG parser. [sent-42, score-0.419]

14 SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. [sent-44, score-0.597]

15 Our work differs from previous studies in that we focus on a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. [sent-45, score-0.539]

16 2 Background and Related Work Our SR-TSG work is built upon recent work on Bayesian TSG induction from parse trees (Post and Gildea, 2009; Cohn et al. [sent-50, score-0.202]

17 We firstly review the Bayesian TSG model used in that work, and then present related work on TSGs and symbol refinement. [sent-52, score-0.296]

18 A TSG consists of a 4-tuple, G = (T, N, S, R), where T is a set of terminal symbols, N is a set of nonterminal symbols, S ∈ N is the distinguished start nromnitnearlm siynmalb symbol ∈and N NR i sis t a s deti otifn productions (a. [sent-53, score-0.434]

19 The productions take the form of elementary trees i. [sent-57, score-0.206]

20 tree Tsh are oloatbe alnedd w inittehr nnoaln nteordmeisna olf symbols, aenndleaf nodes are labeled with either terminal or nonterminal symbols. [sent-61, score-0.278]

21 Nonterminal leaves are referred to as frontier nonterminals, and form the substitution sites to be combined with other elementary trees. [sent-62, score-0.343]

22 It starts with a root symbol and rewrites (substi441 tutes) nonterminal symbols with elementary trees until there are no remaining frontier nonterminals. [sent-64, score-0.778]

23 , 2010) employs a probabilistic model of a TSG and predicts derivations from observed parse trees in an unsupervised way. [sent-67, score-0.2]

24 The probability of a derivation is defined as the product of the probabilities of its component elementary trees as follows. [sent-69, score-0.29]

25 ) is a sequence of elementary trees used for the derivation, x = root (e) is the root symbol of e, and p (e |x ) is the probability of generating e given aintsd dro po(te symbol x. [sent-73, score-0.898]

26 The posterior distribution over elementary trees given a parse tree t can be computed by using the Bayes’ rule: p (e |t ) ∝ p (t |e ) p (e) . [sent-75, score-0.367]

27 Therefore, t th ea ntads ke of TSG induction from parse trees turns out to consist of modeling the prior distribution p (e). [sent-77, score-0.202]

28 Several studies have combined TSG induction and symbol refinement. [sent-79, score-0.364]

29 , 2007a) is a sort of nonparametric Bayesian TSG model with symbol refinement, and is thus closely related to our SR-TSG model. [sent-81, score-0.343]

30 However, an adaptor grammar differs from ours in that all its rules are complete: all leaf nodes must be terminal symbols, while our model permits nonterminal symbols as leaf nodes. [sent-82, score-0.477]

31 Furthermore, adaptor grammars have largely been applied to the task of unsupervised structural induction from raw texts such as (a) Figure 1: (b) (a) Example parse tree. [sent-83, score-0.269]

32 The refinement annotation is hyphenated with a nonterminal symbol. [sent-86, score-0.261]

33 The manual symbol refinement described in (Klein and Manning, 2003) was applied to an all-fragments grammar and this improved accuracy in the English WSJ parsing task. [sent-91, score-0.531]

34 As mentioned in the introduction, our model focuses on the automatic learning of a TSG and symbol refinement without heuristics. [sent-92, score-0.419]

35 Our SR-TSG model is an extension of the conventional TSG model where every symbol of the elementary trees can be refined to fit the training data. [sent-94, score-0.665]

36 As with previous work on TSG induction, our task is the induction of SR-TSG derivations from a corpus of parse trees in an unsupervised fashion. [sent-96, score-0.242]

37 That is, we wish to infer the symbol subcategories of every node and substitution site (i. [sent-97, score-0.6]

38 One major issue as regards modeling an SR-TSG is that the space of the grammar rules will be very sparse since SR-TSG allows for arbitrarily large tree fragments and also an arbitrarily large set of symbol subcategories. [sent-105, score-0.712]

39 To address the sparseness problem, we employ a hierarchical PYP to encode a backoff scheme from the SRTSG rules to simpler CFG rules, inspired by recent work on dependency parsing (Blunsom and Cohn, 2010). [sent-106, score-0.391]

40 elementary tree e, while x is a raw nonterminal symbol in the corpus and k = 0, 1, . [sent-114, score-0.667]

41 Suppose x is NP and its symbol subcategory is 0, then xk is NP0. [sent-118, score-0.498]

42 is a base distribution over infinite space of symbolrefined elementary trees rooted with xk, which provides the backoff probability of e. [sent-121, score-0.48]

43 The backoff probability Psr-tsg (e |xk ) is given by the product of symbol-refined CF(Ge (SR-CFG) rules that e contains as follows. [sent-123, score-0.312]

44 cf and ci (e) are nonterminal symbols of nodes f and i, respectively. [sent-129, score-0.261]

45 SR-CFG rules are CFG rules where every symbol is refined, as shown in Table 1. [sent-131, score-0.484]

46 Each SR-CFG rule α rooted with xk is dra→wn fαro. [sent-134, score-0.232]

47 m the backoff distribution Hxk , and Hxk is produced by the PYP with parameters: ? [sent-135, score-0.188]

48 The backoff probability of the SR-CFG rule, Psr-cfg (α |xk ), is given by the root-unrefined CFG (RU-CFG) rule as follows, Psr-cfg (α |xk ) α |x Ix = I (root-unrefine (α |xk )) ∼ Ix ∼ PYP ? [sent-141, score-0.263]

49 TRUhe- CRUFG-C ruFGle roufl αe ,i sw a CchF tGak reusle th we fhoerrme t ohef xro →ot symbol is unrefined and all leaf nonterminal symbols are refined, as shown in Table 1. [sent-144, score-0.543]

50 Each RU-CFG rule α rooted with x is drawn from the backoff distribution Ix, and Ix is produced by a PYP. [sent-145, score-0.258]

51 Finally, we set the backoff probability of the RU-CFG rule, Pru-cfg (α |x ), so that it is uniform as follows. [sent-147, score-0.218]

52 Overall, our hbieerra orfch RicUal- CmFoGdel r encodes backoff smoothing consistently from the SRTSG rules to the SR-CFG rules, and from the SRCFG rules to the RU-CFG rules. [sent-150, score-0.402]

53 , 2010), the parsing accuracy of the TSG model is strongly affected by its backoff model. [sent-152, score-0.243]

54 The effects of our hierarchical backoff model on parsing performance are evaluated in Section 5. [sent-153, score-0.269]

55 R-TSG derivations corresponds to inferring two kinds of latent variables: latent symbol subcategories and latent substitution sites. [sent-158, score-0.765]

56 We first infer latent symbol subcategories for every symbol in the parse trees, and then infer latent substitution sites stepwise. [sent-159, score-1.169]

57 During the inference of symbol subcategories, every internal node is fixed as a substitution site. [sent-160, score-0.481]

58 After that, we unfix that assumption and infer latent substitution sites given symbolrefined parse trees. [sent-161, score-0.368]

59 1 Inference of Symbol Subcategories For the inference of latent symbol subcategories, we adopt split and merge training (Petrov et al. [sent-165, score-0.435]

60 In each split-merge step, each symbol is split into at most two subcategories. [sent-167, score-0.322]

61 For example, every NP symbol in the training data is split into either NP0 or NP1 to maximize the posterior probability. [sent-168, score-0.348]

62 After convergence, we measure the loss of each split symbol in terms of the likelihood incurred when removing it, then the smallest 50% of the newly split symbols as regards that loss are merged to avoid overfitting. [sent-169, score-0.46]

63 In each splitting step, we use two types of blocked MCMC algorithm: the sentence-level blocked Metroporil-Hastings (MH) sampler and the treelevel blocked Gibbs sampler, while (Petrov et al. [sent-171, score-0.322]

64 Our sampler iterates sentence-level sampling and tree-level sampling alternately. [sent-173, score-0.225]

65 The sentence-level MH sampler is a recently proposed algorithm for grammar induction (Johnson et al. [sent-174, score-0.276]

66 In this work, we apply it to the training of symbol splitting. [sent-177, score-0.322]

67 The MH sampler consists of the following three steps: for each sentence, 1) calculate the inside probability (Lari and Young, 1991) in a bottom-up manner, 2) sample a derivation tree in a top-down manner, and 3) accept or reject the derivation sample by using the MH test. [sent-178, score-0.384]

68 This sampler simultaneously updates blocks of latent variables associated with a sentence, thus it can find MAP solutions efficiently. [sent-181, score-0.235]

69 The tree-level blocked Gibbs sampler focuses on the type of SR-TSG rules and simultaneously up444 dates all root and child nodes that are annotated with the same SR-TSG rule. [sent-182, score-0.382]

70 For example, the sampler collects all nodes that are annotated with S0 → NP1VP2, then updates those nodes to anothe→r subcategory such as S0 → NP2VP0 according to the posterior distribution. [sent-183, score-0.281]

71 →Thi Ns sampler is similar to table label resampling (Johnson and Goldwater, 2009), but differs in that our sampler can update multiple table labels simultaneously when multiple tables are labeled with the same elementary tree. [sent-184, score-0.44]

72 The tree-level sampler also simultaneously updates blocks of latent variables associated with the type of SR-TSG rules, thus it can find MAP solutions efficiently. [sent-185, score-0.235]

73 2 Inference of Substitution Sites After the inference of symbol subcategories, we use Gibbs sampling to infer the substitution sites of parse trees as described in (Cohn and Lapata, 2009; Post and Gildea, 2009). [sent-187, score-0.717]

74 For each iteration, the Gibbs sampler works by sampling the value of each binary variable in random order. [sent-189, score-0.188]

75 During the inference, our sampler ignores the symbol subcategories of internal nodes of elementary trees since they do not affect the derivation of the SR-TSG. [sent-192, score-0.935]

76 For example, the elementary trees “(S0 (NP0 NNP0) VP0)” and “(S0 (NP1 NNP0) VP0)” are regarded as being the same when we calculate the generation probabilities according to our model. [sent-193, score-0.206]

77 2 Training and Parsing For the inference of symbol subcategories, we trained our model with the MCMC sampler by using 6 split-merge steps for the full training set and 3 split-merge steps for the small training set. [sent-213, score-0.532]

78 Therefore, each symbol can be subdivided into a maximum of 26 = 64 and 23 = 8 subcategories, respectively. [sent-214, score-0.296]

79 In each split-merge step, we initialized the sampler by randomly splitting every symbol in two subcategories and ran the MCMC sampler for 1000 iterations. [sent-215, score-0.747]

80 After that, to infer the substitution sites, we initialized the model with the final sample from a run on the small training set, and used the Gibbs sampler for 2000 iterations. [sent-216, score-0.332]

81 The size is defined as the number of CFG rules that the elementary tree contains. [sent-240, score-0.327]

82 We also tested our model with three backoff hierarchy settings to evaluate the effects of backoff smoothing on parsing accuracy. [sent-245, score-0.457]

83 This result suggests that the conventional TSG model trained from the vanilla treebank is insufficient to resolve Model F1 (≤ 40) F1(≤ TSG (no symbol refinement) and Gildea (2009) Cohn et al. [sent-250, score-0.395]

84 structural ambiguities caused by coarse symbol annotations in a training corpus. [sent-275, score-0.322]

85 As we expected, symbol refinement can be helpful with the TSG model for further fitting the training set and improving the parsing accuracy. [sent-276, score-0.5]

86 The performance of the SR-TSG parser was strongly affected by its backoff models. [sent-277, score-0.26]

87 Therefore, sophisticated backoff modeling is essential for the SR-TSG parser. [sent-280, score-0.188]

88 Our hierarchical PYP modeling technique is a successful way to achieve backoff smoothing from sparse SR-TSG rules to simpler CFG rules, and offers the advantage of automatically estimating the optimal backoffprobabilities from the training set. [sent-281, score-0.388]

89 The rule sizes of SR446 TSG and TSG are defined as the number of CFG rules that the elementary tree contains. [sent-283, score-0.398]

90 → →In N Figure 2, we can see that there are almost the same number of SR-TSG rules and TSG rules with size = 1. [sent-286, score-0.188]

91 ree T fragments depending on the context, which is specified by the symbol subcategories. [sent-289, score-0.372]

92 Our model can be viewed as an extension of Cohn’s work by the incorporation of symbol refinement. [sent-299, score-0.326]

93 Therefore, this result confirms that a TSG and symbol refinement work complementarily in improving parsing accuracy. [sent-300, score-0.505]

94 Indeed, the few very large rules of our model memorized full parse trees of sentences, which were repeated in the training set. [sent-303, score-0.254]

95 The reranking parser takes the k-best lists of candidate trees or a packed forest produced by a baseline parser (usually a generative model), and then reranks the candidates using arbitrary features. [sent-306, score-0.256]

96 Then, we erased the subcategory information of parse trees and selected the best tree that achieved the highest likelihood under the product of sixteen models. [sent-311, score-0.303]

97 The search space of learning a TSG given a parse tree is O (2n) where n is the number of internal nodes iosf Othe(2 parse tree. [sent-322, score-0.306]

98 as latent variables and the search space is larger than that of a TSG when the symbol refinement model allows for more than two subcategories for each symbol. [sent-326, score-0.652]

99 6 Conclusion We have presented an SR-TSG, which is an extension of the conventional TSG model where each symbol of tree fragments can be automatically subcategorized to address the problem of the conditional independence assumptions of a TSG. [sent-328, score-0.639]

100 We proposed a novel backoff modeling of an SR-TSG based on the hierarchical Pitman-Yor Process and sentence-level and tree-level blocked MCMC sampling for training our model. [sent-329, score-0.334]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tsg', 0.644), ('symbol', 0.296), ('backoff', 0.188), ('cfg', 0.172), ('xk', 0.162), ('cohn', 0.154), ('sampler', 0.151), ('subcategories', 0.149), ('pyp', 0.139), ('nonterminal', 0.138), ('elementary', 0.138), ('refinement', 0.123), ('substitution', 0.118), ('tree', 0.095), ('rules', 0.094), ('bayesian', 0.091), ('mcmc', 0.087), ('symbols', 0.078), ('fragments', 0.076), ('parser', 0.072), ('grammars', 0.07), ('conventional', 0.069), ('induction', 0.068), ('trees', 0.068), ('parse', 0.066), ('johnson', 0.065), ('adaptor', 0.065), ('petrov', 0.063), ('sites', 0.062), ('srtsg', 0.062), ('grammar', 0.057), ('blocked', 0.057), ('bansal', 0.057), ('parsing', 0.055), ('latent', 0.054), ('derivation', 0.054), ('hxk', 0.054), ('nonparametric', 0.047), ('dxk', 0.046), ('ptsg', 0.046), ('subcategorized', 0.046), ('gildea', 0.046), ('rule', 0.045), ('nodes', 0.045), ('reranking', 0.044), ('post', 0.044), ('mh', 0.043), ('wsj', 0.043), ('pitman', 0.04), ('subcategory', 0.04), ('derivations', 0.04), ('ix', 0.04), ('refined', 0.038), ('pcfg', 0.038), ('infer', 0.037), ('sampling', 0.037), ('parsers', 0.037), ('blunsom', 0.035), ('root', 0.035), ('regards', 0.034), ('sixteen', 0.034), ('internal', 0.034), ('klein', 0.033), ('inference', 0.033), ('discriminative', 0.033), ('sharon', 0.032), ('freedom', 0.031), ('complementarily', 0.031), ('gxk', 0.031), ('npnnp', 0.031), ('symbolrefined', 0.031), ('topmost', 0.031), ('unrefined', 0.031), ('yor', 0.031), ('extension', 0.03), ('treebank', 0.03), ('probability', 0.03), ('variables', 0.03), ('arbitrarily', 0.03), ('matsuzaki', 0.029), ('simpler', 0.028), ('lari', 0.027), ('shindo', 0.027), ('zuidema', 0.027), ('assumptions', 0.027), ('np', 0.027), ('smoothing', 0.026), ('probabilistic', 0.026), ('hierarchical', 0.026), ('gibbs', 0.026), ('charniak', 0.026), ('sizes', 0.026), ('split', 0.026), ('training', 0.026), ('yusuke', 0.025), ('monte', 0.025), ('rooted', 0.025), ('syntax', 0.025), ('frontier', 0.025), ('conditionally', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata

2 0.45102981 154 acl-2012-Native Language Detection with Tree Substitution Grammars

Author: Benjamin Swanson ; Eugene Charniak

Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.

3 0.35341579 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

Author: Elif Yamangil ; Stuart Shieber

Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.

4 0.14779189 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

Author: Xiao Chen ; Chunyu Kit

Abstract: This paper presents a higher-order model for constituent parsing aimed at utilizing more local structural context to decide the score of a grammar rule instance in a parse tree. Experiments on English and Chinese treebanks confirm its advantage over its first-order version. It achieves its best F1 scores of 91.86% and 85.58% on the two languages, respectively, and further pushes them to 92.80% and 85.60% via combination with other highperformance parsers.

5 0.1460723 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

Author: Adam Pauls ; Dan Klein

Abstract: We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context. We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours. We evaluate on perplexity and a range of grammaticality tasks, and find that we perform as well or better than n-gram models and other generative baselines. Our model even competes with state-of-the-art discriminative models hand-designed for the grammaticality tasks, despite training on positive data alone. We also show fluency improvements in a preliminary machine translation experiment.

6 0.10334387 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars

7 0.093959391 108 acl-2012-Hierarchical Chunk-to-String Translation

8 0.092376098 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

9 0.091300026 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

10 0.090530328 57 acl-2012-Concept-to-text Generation via Discriminative Reranking

11 0.082946107 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction

12 0.079538248 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries

13 0.077721529 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

14 0.069321193 139 acl-2012-MIX Is Not a Tree-Adjoining Language

15 0.067606464 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation

16 0.066274546 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content

17 0.063878983 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

18 0.062640093 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation

19 0.062135313 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

20 0.061770234 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.192), (1, -0.038), (2, -0.141), (3, -0.156), (4, -0.229), (5, -0.049), (6, -0.027), (7, 0.196), (8, 0.076), (9, 0.033), (10, -0.08), (11, -0.282), (12, -0.156), (13, 0.102), (14, 0.118), (15, -0.313), (16, 0.061), (17, -0.207), (18, 0.087), (19, 0.13), (20, 0.111), (21, 0.181), (22, -0.054), (23, 0.01), (24, 0.192), (25, 0.068), (26, -0.054), (27, 0.069), (28, 0.0), (29, 0.003), (30, 0.019), (31, -0.102), (32, 0.067), (33, 0.029), (34, 0.003), (35, -0.078), (36, 0.017), (37, 0.035), (38, 0.05), (39, 0.013), (40, -0.08), (41, -0.041), (42, -0.008), (43, -0.012), (44, -0.003), (45, 0.033), (46, -0.035), (47, 0.023), (48, 0.0), (49, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96236694 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

Author: Elif Yamangil ; Stuart Shieber

same-paper 2 0.94569731 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata

3 0.88120574 154 acl-2012-Native Language Detection with Tree Substitution Grammars

Author: Benjamin Swanson ; Eugene Charniak

4 0.44873878 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars

Author: Andreas Maletti ; Joost Engelfriet

Abstract: Recently, it was shown (KUHLMANN, SATTA: Tree-adjoining grammars are not closed under strong lexicalization. Comput. Linguist., 2012) that finitely ambiguous tree adjoining grammars cannot be transformed into a normal form (preserving the generated tree language), in which each production contains a lexical symbol. A more powerful model, the simple context-free tree grammar, admits such a normal form. It can be effectively constructed and the maximal rank of the nonterminals only increases by 1. Thus, simple context-free tree grammars strongly lexicalize tree adjoining grammars and themselves.

5 0.44193038 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

Author: Stephen Tyndall

Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.

6 0.40822917 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

7 0.40389332 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

8 0.3942605 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

9 0.37309653 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

10 0.30955359 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries

11 0.30812857 108 acl-2012-Hierarchical Chunk-to-String Translation

12 0.27674729 83 acl-2012-Error Mining on Dependency Trees

13 0.26061916 139 acl-2012-MIX Is Not a Tree-Adjoining Language

14 0.25008202 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

15 0.24417722 57 acl-2012-Concept-to-text Generation via Discriminative Reranking

16 0.23578654 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

17 0.23477629 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

18 0.22220784 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

19 0.22134775 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

20 0.21537539 181 acl-2012-Spectral Learning of Latent-Variable PCFGs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.015), (12, 0.196), (26, 0.027), (28, 0.041), (30, 0.032), (36, 0.017), (37, 0.03), (39, 0.064), (71, 0.016), (74, 0.027), (82, 0.033), (84, 0.017), (85, 0.021), (90, 0.108), (92, 0.19), (94, 0.036), (99, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81749612 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata

2 0.76270354 205 acl-2012-Tweet Recommendation with Graph Co-Ranking

Author: Rui Yan ; Mirella Lapata ; Xiaoming Li

Abstract: Mirella Lapata‡ Xiaoming Li†, \ ‡Institute for Language, \State Key Laboratory of Software Cognition and Computation, Development Environment, University of Edinburgh, Beihang University, Edinburgh EH8 9AB, UK Beijing 100083, China mlap@ inf .ed .ac .uk lxm@pku .edu .cn 2012.1 Twitter enables users to send and read textbased posts ofup to 140 characters, known as tweets. As one of the most popular micro-blogging services, Twitter attracts millions of users, producing millions of tweets daily. Shared information through this service spreads faster than would have been possible with traditional sources, however the proliferation of user-generation content poses challenges to browsing and finding valuable information. In this paper we propose a graph-theoretic model for tweet recommendation that presents users with items they may have an interest in. Our model ranks tweets and their authors simultaneously using several networks: the social network connecting the users, the network connecting the tweets, and a third network that ties the two together. Tweet and author entities are ranked following a co-ranking algorithm based on the intuition that that there is a mutually reinforcing relationship between tweets and their authors that could be reflected in the rankings. We show that this framework can be parametrized to take into account user preferences, the popularity of tweets and their authors, and diversity. Experimental evaluation on a large dataset shows that our model out- performs competitive approaches by a large margin.

3 0.75785905 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

Author: Tsung-Ting Kuo ; San-Chuan Hung ; Wei-Shih Lin ; Nanyun Peng ; Shou-De Lin ; Wei-Fen Lin

Abstract: This paper brings a marriage of two seemly unrelated topics, natural language processing (NLP) and social network analysis (SNA). We propose a new task in SNA which is to predict the diffusion of a new topic, and design a learning-based framework to solve this problem. We exploit the latent semantic information among users, topics, and social connections as features for prediction. Our framework is evaluated on real data collected from public domain. The experiments show 16% AUC improvement over baseline methods. The source code and dataset are available at http://www.csie.ntu.edu.tw/~d97944007/dif fusion/ 1 Background The diffusion of information on social networks has been studied for decades. Generally, the proposed strategies can be categorized into two categories, model-driven and data-driven. The model-driven strategies, such as independent cascade model (Kempe et al., 2003), rely on certain manually crafted, usually intuitive, models to fit the diffusion data without using diffusion history. The data-driven strategies usually utilize learning-based approaches to predict the future propagation given historical records of prediction (Fei et al., 2011; Galuba et al., 2010; Petrovic et al., 2011). Data-driven strategies usually perform better than model-driven approaches because the past diffusion behavior is used during learning (Galuba et al., 2010). Recently, researchers started to exploit content information in data-driven diffusion models (Fei et al., 2011; Petrovic et al., 2011; Zhu et al., 2011). 344 However, most of the data-driven approaches assume that in order to train a model and predict the future diffusion of a topic, it is required to obtain historical records about how this topic has propagated in a social network (Petrovic et al., 2011; Zhu et al., 2011). We argue that such assumption does not always hold in the real-world scenario, and being able to forecast the propagation of novel or unseen topics is more valuable in practice. For example, a company would like to know which users are more likely to be the source of ‘viva voce’ of a newly released product for advertising purpose. A political party might want to estimate the potential degree of responses of a half-baked policy before deciding to bring it up to public. To achieve such goal, it is required to predict the future propagation behavior of a topic even before any actual diffusion happens on this topic (i.e., no historical propagation data of this topic are available). Lin et al. also propose an idea aiming at predicting the inference of implicit diffusions for novel topics (Lin et al., 2011). The main difference between their work and ours is that they focus on implicit diffusions, whose data are usually not available. Consequently, they need to rely on a model-driven approach instead of a datadriven approach. On the other hand, our work focuses on the prediction of explicit diffusion behaviors. Despite the fact that no diffusion data of novel topics is available, we can still design a data- driven approach taking advantage of some explicit diffusion data of known topics. Our experiments show that being able to utilize such information is critical for diffusion prediction. 2 The Novel-Topic Diffusion Model We start by assuming an existing social network G = (V, E), where V is the set of nodes (or user) v, and E is the set of link e. The set of topics is Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 4s4–348, denoted as T. Among them, some are considered as novel topics (denoted as N), while the rest (R) are used as the training records. We are also given a set of diffusion records D = {d | d = (src, dest, t) }, where src is the source node (or diffusion source), dest is the destination node, and t is the topic of the diffusion that belongs to R but not N. We assume that diffusions cannot occur between nodes without direct social connection; any diffusion pair implies the existence of a link e = (src, dest) ∈ E. Finally, we assume there are sets of keywords or tags that relevant to each topic (including existing and novel topics). Note that the set of keywords for novel topics should be seen in that of existing topics. From these sets of keywords, we construct a topicword matrix TW = (P(wordj | topici))i,j of which the elements stand for the conditional probabilities that a word appears in the text of a certain topic. Similarly, we also construct a user-word matrix UW= (P(wordj | useri))i,j from these sets of keywords. Given the above information, the goal is to predict whether a given link is active (i.e., belongs to a diffusion link) for topics in N. 2.1 The Framework The main challenge of this problem lays in that the past diffusion behaviors of new topics are missing. To address this challenge, we propose a supervised diffusion discovery framework that exploits the latent semantic information among users, topics, and their explicit / implicit interactions. Intuitively, four kinds of information are useful for prediction: • Topic information: Intuitively, knowing the signatures of a topic (e.g., is it about politics?) is critical to the success of the prediction. • User information: The information of a user such as the personality (e.g., whether this user is aggressive or passive) is generally useful. • User-topic interaction: Understanding the users' preference on certain topics can improve the quality of prediction. • Global information: We include some global features (e.g., topology info) of social network. Below we will describe how these four kinds of information can be modeled in our framework. 2.2 Topic Information We extract hidden topic category information to model topic signature. In particular, we exploit the 345 Latent Dirichlet Allocation (LDA) method (Blei et al., 2003), which is a widely used topic modeling technique, to decompose the topic-word matrix TW into hidden topic categories: TW = TH * HW , where TH is a topic-hidden matrix, HW is hiddenword matrix, and h is the manually-chosen parameter to determine the size of hidden topic categories. TH indicates the distribution of each topic to hidden topic categories, and HW indicates the distribution of each lexical term to hidden topic categories. Note that TW and TH include both existing and novel topics. We utilize THt,*, the row vector of the topic-hidden matrix TH for a topic t, as a feature set. In brief, we apply LDA to extract the topic-hidden vector THt,* to model topic signature (TG) for both existing and novel topics. Topic information can be further exploited. To predict whether a novel topic will be propagated through a link, we can first enumerate the existing topics that have been propagated through this link. For each such topic, we can calculate its similarity with the new topic based on the hidden vectors generated above (e.g., using cosine similarity between feature vectors). Then, we sum up the similarity values as a new feature: topic similarity (TS). For example, a link has previously propagated two topics for a total of three times {ACL, KDD, ACL}, and we would like to know whether a new topic, EMNLP, will propagate through this link. We can use the topic-hidden vector to generate the similarity values between EMNLP and the other topics (e.g., {0.6, 0.4, 0.6}), and then sum them up (1.6) as the value of TS. 2.3 User Information Similar to topic information, we extract latent personal information to model user signature (the users are anonymized already). We apply LDA on the user-word matrix UW: UW = UM * MW , where UM is the user-hidden matrix, MW is the hidden-word matrix, and m is the manually-chosen size of hidden user categories. UM indicates the distribution of each user to the hidden user categories (e.g., age). We then use UMu,*, the row vector of UM for the user u, as a feature set. In brief, we apply LDA to extract the user-hidden vector UMu,* for both source and destination nodes of a link to model user signature (UG). 2.4 User-Topic Interaction Modeling user-topic interaction turns out to be non-trivial. It is not useful to exploit latent semantic analysis directly on the user-topic matrix UR = UQ * QR , where UR represents how many times each user is diffused for existing topic R (R ∈ T), because UR does not contain information of novel topics, and neither do UQ and QR. Given no propagation record about novel topics, we propose a method that allows us to still extract implicit user-topic information. First, we extract from the matrix TH (described in Section 2.2) a subset RH that contains only information about existing topics. Next we apply left division to derive another userhidden matrix UH: UH = (RH \ URT)T = ((RHT RH )-1 RHT URT)T Using left division, we generate the UH matrix using existing topic information. Finally, we exploit UHu,*, the row vector of the user-hidden matrix UH for the user u, as a feature set. Note that novel topics were included in the process of learning the hidden topic categories on RH; therefore the features learned here do implicitly utilize some latent information of novel topics, which is not the case for UM. Experiments confirm the superiority of our approach. Furthermore, our approach ensures that the hidden categories in topic-hidden and user-hidden matrices are identical. Intuitively, our method directly models the user’s preference to topics’ signature (e.g., how capable is this user to propagate topics in politics category?). In contrast, the UM mentioned in Section 2.3 represents the users’ signature (e.g., aggressiveness) and has nothing to do with their opinions on a topic. In short, we obtain the user-hidden probability vector UHu,* as a feature set, which models user preferences to latent categories (UPLC). 2.5 Global Features Given a candidate link, we can extract global social features such as in-degree (ID) and outdegree (OD). We tried other features such as PageRank values but found them not useful. Moreover, we extract the number of distinct topics (NDT) for a link as a feature. The intuition behind this is that the more distinct topics a user has diffused to another, the more likely the diffusion will happen for novel topics. 346 2.6 Complexity Analysis The complexity to produce each feature is as below: (1) Topic information: O(I * |T| * h * Bt) for LDA using Gibbs sampling, where Iis # of the iterations in sampling, |T| is # of topics, and Bt is the average # of tokens in a topic. (2) User information: O(I * |V| * m * Bu) , where |V| is # of users, and Bu is the average # of tokens for a user. (3) User-topic interaction: the time complexity is O(h3 + h2 * |T| + h * |T| * |V|). (4) Global features: O(|D|), where |D| is # of diffusions. 3 Experiments For evaluation, we try to use the diffusion records of old topics to predict whether a diffusion link exists between two nodes given a new topic. 3.1 Dataset and Evaluation Metric We first identify 100 most popular topic (e.g., earthquake) from the Plurk micro-blog site between 01/201 1 and 05/201 1. Plurk is a popular micro-blog service in Asia with more than 5 million users (Kuo et al., 2011). We manually separate the 100 topics into 7 groups. We use topic-wise 4-fold cross validation to evaluate our method, because there are only 100 available topics. For each group, we select 3/4 of the topics as training and 1/4 as validation. The positive diffusion records are generated based on the post-response behavior. That is, if a person x posts a message containing one of the selected topic t, and later there is a person y responding to this message, we consider a diffusion of t has occurred from x to y (i.e., (x, y, t) is a positive instance). Our dataset contains a total of 1,642,894 positive instances out of 100 distinct topics; the largest and smallest topic contains 303,424 and 2,166 diffusions, respectively. Also, the same amount of negative instances for each topic (totally 1,642,894) is sampled for binary classification (similar to the setup in KDD Cup 2011 Track 2). The negative links of a topic t are sampled randomly based on the absence of responses for that given topic. The underlying social network is created using the post-response behavior as well. We assume there is an acquaintance link between x and y if and only if x has responded to y (or vice versa) on at least one topic. Eventually we generated a social network of 163,034 nodes and 382,878 links. Furthermore, the sets of keywords for each topic are required to create the TW and UW matrices for latent topic analysis; we simply extract the content of posts and responses for each topic to create both matrices. We set the hidden category number h = m = 7, which is equal to the number of topic groups. We use area under ROC curve (AUC) to evaluate our proposed framework (Davis and Goadrich, 2006); we rank the testing instances based on their likelihood of being positive, and compare it with the ground truth to compute AUC. 3.2 Implementation and Baseline After trying many classifiers and obtaining similar results for all of them, we report only results from LIBLINEAR with c=0.0001 (Fan et al., 2008) due to space limitation. We remove stop-words, use SCWS (Hightman, 2012) for tokenization, and MALLET (McCallum, 2002) and GibbsLDA++ (Phan and Nguyen, 2007) for LDA. There are three baseline models we compare the result with. First, we simply use the total number of existing diffusions among all topics between two nodes as the single feature for prediction. Second, we exploit the independent cascading model (Kempe et al., 2003), and utilize the normalized total number of diffusions as the propagation probability of each link. Third, we try the heat diffusion model (Ma et al., 2008), set initial heat proportional to out-degree, and tune the diffusion time parameter until the best results are obtained. Note that we did not compare with any data-driven approaches, as we have not identified one that can predict diffusion of novel topics. 3.3 Results The result of each model is shown in Table 1. All except two features outperform the baseline. The best single feature is TS. Note that UPLC performs better than UG, which verifies our hypothesis that maintaining the same hidden features across different LDA models is better. We further conduct experiments to evaluate different combinations of features (Table 2), and found that the best one (TS + ID + NDT) results in about 16% improvement over the baseline, and outperforms the combination of all features. As stated in (Witten et al., 2011), 347 adding useless features may cause the performance of classifiers to deteriorate. Intuitively, TS captures both latent topic and historical diffusion information, while ID and NDT provide complementary social characteristics of users. 4 Conclusions The main contributions of this paper are as below: 1. We propose a novel task of predicting the diffusion of unseen topics, which has wide applications in real-world. 2. Compared to the traditional model-driven or content-independent data-driven works on diffusion analysis, our solution demonstrates how one can bring together ideas from two different but promising areas, NLP and SNA, to solve a challenging problem. 3. Promising experiment result (74% in AUC) not only demonstrates the usefulness of the proposed models, but also indicates that predicting diffusion of unseen topics without historical diffusion data is feasible. Acknowledgments This work was also supported by National Science Council, National Taiwan University and Intel Corporation under Grants NSC 100-291 1-I-002-001, and 101R7501. References David M. Blei, Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3.993-1022. Jesse Davis & Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang & Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res., 9.1871-74. Hongliang Fei, Ruoyi Jiang, Yuhao Yang, Bo Luo & Jun Huan. 2011. Content based social behavior prediction: a multi-task learning approach. Proceedings of the 20th ACM international conference on Information and knowledge management, Glasgow, Scotland, UK. Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic & Wolfgang Kellerer. 2010. Outtweeting the twitterers - predicting information cascades in microblogs. Proceedings of the 3rd conference on Online social networks, Boston, MA. Hightman. 2012. Simple Chinese Words Segmentation (SCWS). David Kempe, Jon Kleinberg & Eva Tardos. 2003. Maximizing the spread of influence through a social network. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C. Tsung-Ting Kuo, San-Chuan Hung, Wei-Shih Lin, Shou-De Lin, Ting-Chun Peng & Chia-Chun Shih. 2011. Assessing the Quality of Diffusion Models Using Real-World Social Network Data. Conference on Technologies and Applications of Artificial Intelligence, 2011. C.X. Lin, Q.Z. Mei, Y.L. Jiang, J.W. Han & S.X. Qi. 2011. Inferring the Diffusion and Evolution of Topics in Social Communities. Proceedings of the IEEE International Conference on Data Mining, 2011. Hao Ma, Haixuan Yang, Michael R. Lyu & Irwin King. 2008. Mining social networks using heat diffusion processes for marketing candidates selection. Proceeding of the 17th ACM conference on Information and knowledge management, Napa Valley, California, USA. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Sasa Petrovic, Miles Osborne & Victor Lavrenko. 2011. RT to Win! Predicting Message Propagation in Twitter. International AAAI Conference on Weblogs and Social Media, 2011. 348 Xuan-Hieu Phan & Cam-Tu Nguyen. 2007. GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA). Ian H. Witten, Eibe Frank & Mark A. Hall. 2011. Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc. Jiang Zhu, Fei Xiong, Dongzhen Piao, Yun Liu & Ying Zhang. 2011. Statistically Modeling the Effectiveness of Disaster Information in Social Media. Proceedings of the 2011 IEEE Global Humanitarian Technology Conference.

4 0.75684303 154 acl-2012-Native Language Detection with Tree Substitution Grammars

Author: Benjamin Swanson ; Eugene Charniak

5 0.7491951 78 acl-2012-Efficient Search for Transformation-based Inference

Author: Asher Stern ; Roni Stern ; Ido Dagan ; Ariel Felner

Abstract: This paper addresses the search problem in textual inference, where systems need to infer one piece of text from another. A prominent approach to this task is attempts to transform one text into the other through a sequence of inference-preserving transformations, a.k.a. a proof, while estimating the proof’s validity. This raises a search challenge of finding the best possible proof. We explore this challenge through a comprehensive investigation of prominent search algorithms and propose two novel algorithmic components specifically designed for textual inference: a gradient-style evaluation function, and a locallookahead node expansion method. Evaluations, using the open-source system, BIUTEE, show the contribution of these ideas to search efficiency and proof quality.

6 0.74345529 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

7 0.70871711 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing

8 0.69349253 31 acl-2012-Authorship Attribution with Author-aware Topic Models

9 0.68352544 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

10 0.67907929 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

11 0.67472047 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

12 0.65370381 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

13 0.65057486 98 acl-2012-Finding Bursty Topics from Microblogs

14 0.6472376 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

15 0.64292836 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

16 0.63827842 167 acl-2012-QuickView: NLP-based Tweet Search

17 0.63750756 171 acl-2012-SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations

18 0.63021439 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale

19 0.62944829 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

20 0.61180294 139 acl-2012-MIX Is Not a Tree-Adjoining Language