acl acl2012 acl2012-84 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Elif Yamangil ; Stuart Shieber
Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.
Reference: text
sentIndex sentText sentNum sentScore
1 edu i Abstract We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. [sent-4, score-0.647]
2 Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). [sent-5, score-0.374]
3 We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data. [sent-6, score-0.306]
4 1 Introduction There is a deep tension in statistical modeling of grammatical structure between providing good expressivity to allow accurate modeling of the data with sparse grammars and low complexity making induction of the grammars and parsing of novel sentences computationally practical. [sent-7, score-0.359]
5 Recent work that incorporated Dirichlet process (DP) nonparametric models into TSGs has provided an efficient solution to the problem of segmenting training data trees into elementary parse tree fragments to form the grammar (Cohn et al. [sent-8, score-0.388]
6 The elementary trees combined in a TSG are, intuitively, primitives of the language, yet certain linguistic phenomena (notably various forms ofmodification) “split them up”, preventing their reuse, leading to less sparse grammars than might be ideal. [sent-11, score-0.262]
7 TSGs are a special case of the more flexible grammar formalism of tree adjoining grammar (TAG) (Joshi et al. [sent-13, score-0.286]
8 TAG augments TSG with an adjunction operator and a set of auxiliary trees in addition to the substitution operator and initial trees of TSG, allowing for “splicing in” of syntactic fragments within trees. [sent-15, score-0.7]
9 In the example, by augmenting a TSG with an operation ofadjunction, a grammar that hypothesizes auxiliary trees corresponding to adjoining “[NN former NN]”, “[NN NN of the university]”, and “[NN NN who resigned yesterday]” would be able to reuse the basic structure “[NP the [NN president]]”. [sent-16, score-0.504]
10 Unfortunately, TAG’s expressivity comes at the cost of greatly increased complexity. [sent-17, score-0.057]
11 Parsing complexity for unconstrained TAG scales as O(n6), imProceedJienjgus, R ofep thueb 5lic0t hof A Knonrueaa,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-18, score-0.021]
12 1 This has led researchers to resort to heuristic grammar extraction techniques (Chiang, 2000; Carreras et al. [sent-23, score-0.063]
13 , 2008) or using a very small number of grammar categories (Hwa, 1998). [sent-24, score-0.063]
14 Hwa (1998) first proposed to use tree-insertion grammars (TIG), a kind of expressive compromise between TSG and TAG, as a substrate on which to build grammatical inference. [sent-25, score-0.116]
15 TIG constrains the adjunction operation so that spliced-in material falls completely to the left or completely to the right of the splice point. [sent-26, score-0.319]
16 By restricting the form of possible auxiliary trees to only left or right auxiliary trees in this way, TIG remains within the realm of contextfree formalisms (with cubic complexity) while still modeling rich linguistic phenomena (Schabes and Waters, 1995). [sent-27, score-0.725]
17 (201 1) have provided a previous attempt at combining TIG and Bayesian nonparametric principles, albeit with severe limitations. [sent-30, score-0.126]
18 Their TIG variant (which we will refer to as TIG0) is highly constrained in the following ways. [sent-31, score-0.04]
19 The foot node in an auxiliary tree must be the immediate child of the root node. [sent-33, score-0.373]
20 (Schabes and Waters, 1995) 111 NPR (a) IN NP of (b) NPεL forDtJmheTN L PNε*LpreNsidnPtRN ε P*NoIfPN RPWNHSBAPRNε*WwHPhRoNSBA ε who Figure 2: TIG-to-TSG transform: (a) and (b) illustrate transformed TSG derivations for two different TIG derivations of the same parse tree structure. [sent-37, score-0.19]
21 The TIG nodes where we illustrate the transformation are in bold. [sent-38, score-0.031]
22 Even modeling multiple adjunction with root adjunction is disallowed. [sent-41, score-0.413]
23 There is thus no recursion possibility with adjunction, no stacking of auxiliary trees. [sent-42, score-0.193]
24 As a consequence of the prior two constraints, no adjunction along the spines of auxiliary trees is allowed. [sent-44, score-0.54]
25 As a consequence of the first constraint, all nonterminals along the spine of an auxiliary tree are identical. [sent-46, score-0.305]
26 In this paper we explore a Bayesian nonparametric model for estimating a far more expressive version of TIG, and compare its performance against TSG and the restricted TIG0 variant. [sent-47, score-0.176]
27 Auxiliary trees may have the foot node at depth greater than one. [sent-50, score-0.184]
28 Both left and right adjunctions may occur at the same node. [sent-52, score-0.17]
29 Simultanous adjunction (that is, more than one left or right adjunction per node) is allowed via root adjunction. [sent-54, score-0.486]
30 Adjunctions may occur along the spines of auxiliary trees. [sent-56, score-0.268]
31 The increased expressivity of our TIG variant is motivated both linguistically and practically. [sent-57, score-0.13]
32 From a linguistic point of view: Deeper auxiliary trees can help model large patterns of insertion and potential correlations between lexical items that extend over multiple levels of tree. [sent-58, score-0.368]
33 Combining left and right auxiliary trees can help model modifiers of the same node from left and right (combination of adjectives 2Throughout the paper, we will refer to the depth of an auxiliary tree to indicate the length of its spine. [sent-59, score-0.815]
34 Simultaneous insertion allows us to deal with multiple independent modifiers for the same constituent (for example, a series of adjectives). [sent-61, score-0.124]
35 From a practical point of view, we show that an induced TIG provides modeling performance superior to TSG and comparable with TIG0. [sent-62, score-0.022]
36 However we show that the grammars we induce are compact yet rich, in that they succinctly represent complex linguistic structures. [sent-63, score-0.217]
37 2 Probabilistic Model In the basic nonparametric TSG model, there is an independent DP for every grammar category (such as c = NP), each of which uses a base distribution P0 that generates an initial tree by making stepwise decisions. [sent-64, score-0.267]
38 Gicnit ∼ DP(αicnit, P0init(· | c)) P˜ The canonical P0 uses a probabilistic CFG that is fixed a priori to sample CFG rules top-down and Bernoulli variables for determining where substitutions should occur (Cohn et al. [sent-65, score-0.024]
39 We extend this model by adding specialized DPs for left and right auxiliary trees. [sent-67, score-0.288]
40 3 Grcight DP(αcright,P0right(· ∼ | c)) Therefore, we have an exchangeable process for generating right auxiliary trees p(aj| a 1. [sent-68, score-0.338]
41 Fortunately, Schabes and Waters (1995) provide an (exact) transformation from a fully general TIG into a TSG that generates the same string languages. [sent-70, score-0.031]
42 It is then straightforward to represent this TSG as a CFG using the Goodman transform (Goodman, 2002; Cohn and Blunsom, 2010). [sent-71, score-0.033]
43 4 Evaluation Results We use the standard Penn treebank methodology of training on sections 2–21 and testing on section 23. [sent-73, score-0.021]
44 As has become standard, we 113 carried out a small treebank experiment where we train on Section 2, and a large one where we train on the full training set. [sent-75, score-0.021]
45 Parsing results are based on the maximum probability parse which was obtained by sampling derivations under the transform CFG. [sent-78, score-0.116]
46 We compare our system (referred to as TIG) to our implementation of the TSG system of (Cohn and Blunsom, 2010) (referred to as TSG) and the constrained TIG variant of (Shindo et al. [sent-79, score-0.04]
47 4 for TSG, TIG0 and TIG respectively), on the small dataset insertion helps nonparametric model to find more compact and generalizable representations for the data, which affects parsing performance (Figure 4). [sent-84, score-0.365]
48 Although TIG0 has performance close to TIG, note that TIG achieves this performance using a more succinct representation and extracting a rich set of auxiliary trees. [sent-85, score-0.257]
49 As a result, TIG finds many chances to apply insertions to test sentences, whereas TIG0 depends mostly on TSG rules. [sent-86, score-0.067]
50 If we look at the most likely derivations for the test data, TIG0 assigns 663 insertions (35 1 left insertions) in the parsing of entire Section 23, meanwhile TIG assigns 3924 (2100 left insertions). [sent-87, score-0.266]
51 Some of these linguistically sophisticated auxiliary trees that apply to test data are listed in Figure 3. [sent-88, score-0.305]
52 5 Conclusion We described a nonparametric Bayesian inference scheme for estimating TIG grammars and showed the power of TIG formalism over TSG for returning rich, generalizable, yet compact representations of data. [sent-89, score-0.427]
53 The nonparametric inference scheme presents a principled way of addressing the difficult model selection problem with TIG which has been prohibitive in this area of research. [sent-90, score-0.129]
54 TIG still remains within context free and both our sampling and parsing techniques are highly scalable. [sent-91, score-0.096]
55 An empirical evaluation of probabilistic lexicalized tree insertion grammars. [sent-127, score-0.174]
56 Bayesian inference for PCFGs via Markov chain Monte Carlo. [sent-132, score-0.025]
57 Tree insertion grammar: a cubic-time parsable formalism that lexicalizes context-free grammar without changing the trees produced. [sent-152, score-0.295]
58 In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’ 11, pages 206–21 1, Stroudsburg, PA, USA. [sent-159, score-0.025]
wordName wordTfidf (topN-words)
[('tig', 0.716), ('tsg', 0.361), ('auxiliary', 0.193), ('adjunction', 0.183), ('nn', 0.164), ('cohn', 0.121), ('np', 0.115), ('nonparametric', 0.104), ('insertion', 0.096), ('bayesian', 0.091), ('grammars', 0.088), ('trees', 0.079), ('tree', 0.078), ('schabes', 0.076), ('blunsom', 0.073), ('cfg', 0.071), ('president', 0.069), ('insertions', 0.067), ('yesterday', 0.067), ('shindo', 0.067), ('resigned', 0.067), ('compact', 0.064), ('grammar', 0.063), ('waters', 0.061), ('dp', 0.061), ('expressivity', 0.057), ('derivations', 0.056), ('left', 0.051), ('adjunctions', 0.051), ('foot', 0.051), ('spines', 0.051), ('adjoining', 0.047), ('substitution', 0.046), ('tag', 0.044), ('estimating', 0.044), ('right', 0.044), ('yet', 0.043), ('morristown', 0.043), ('operator', 0.043), ('rich', 0.042), ('parsing', 0.041), ('tsgs', 0.041), ('variant', 0.04), ('proposal', 0.038), ('dop', 0.038), ('generalizable', 0.036), ('nj', 0.035), ('formalism', 0.035), ('reuse', 0.034), ('shieber', 0.034), ('consequence', 0.034), ('fragments', 0.034), ('transform', 0.033), ('linguistically', 0.033), ('transformation', 0.031), ('simultaneous', 0.031), ('hwa', 0.03), ('elementary', 0.03), ('goodman', 0.029), ('carreras', 0.028), ('modifiers', 0.028), ('depth', 0.028), ('free', 0.028), ('expressive', 0.028), ('sharon', 0.027), ('sampling', 0.027), ('association', 0.026), ('node', 0.026), ('stroudsburg', 0.026), ('joshi', 0.026), ('trevor', 0.026), ('phil', 0.026), ('inference', 0.025), ('root', 0.025), ('papers', 0.025), ('occur', 0.024), ('representations', 0.024), ('modeling', 0.022), ('succinct', 0.022), ('succinctly', 0.022), ('severe', 0.022), ('stepwise', 0.022), ('ferred', 0.022), ('dps', 0.022), ('contextfree', 0.022), ('ern', 0.022), ('exchangeable', 0.022), ('lexicalizes', 0.022), ('masako', 0.022), ('npr', 0.022), ('primitives', 0.022), ('sity', 0.022), ('suppress', 0.022), ('treebank', 0.021), ('complexity', 0.021), ('post', 0.021), ('operation', 0.021), ('relaxation', 0.02), ('constrains', 0.02), ('tension', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars
Author: Elif Yamangil ; Stuart Shieber
Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.
2 0.35341579 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing
Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata
Abstract: We propose Symbol-Refined Tree Substitution Grammars (SR-TSGs) for syntactic parsing. An SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. We aim to provide a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. We present a novel probabilistic SR-TSG model based on the hierarchical Pitman-Yor Process to encode backoff smoothing from a fine-grained SR-TSG to simpler CFG rules, and develop an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling. Our SR-TSG parser achieves an F1 score of 92.4% in the Wall Street Journal (WSJ) English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and better than state-of-the-art discriminative reranking parsers.
3 0.28077096 154 acl-2012-Native Language Detection with Tree Substitution Grammars
Author: Benjamin Swanson ; Eugene Charniak
Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.
4 0.080835238 109 acl-2012-Higher-order Constituent Parsing and Parser Combination
Author: Xiao Chen ; Chunyu Kit
Abstract: This paper presents a higher-order model for constituent parsing aimed at utilizing more local structural context to decide the score of a grammar rule instance in a parse tree. Experiments on English and Chinese treebanks confirm its advantage over its first-order version. It achieves its best F1 scores of 91.86% and 85.58% on the two languages, respectively, and further pushes them to 92.80% and 85.60% via combination with other highperformance parsers.
5 0.071939051 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
Author: Adam Pauls ; Dan Klein
Abstract: We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context. We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours. We evaluate on perplexity and a range of grammaticality tasks, and find that we perform as well or better than n-gram models and other generative baselines. Our model even competes with state-of-the-art discriminative models hand-designed for the grammaticality tasks, despite training on positive data alone. We also show fluency improvements in a preliminary machine translation experiment.
6 0.064510927 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars
7 0.063714363 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
8 0.059864726 108 acl-2012-Hierarchical Chunk-to-String Translation
9 0.059792422 83 acl-2012-Error Mining on Dependency Trees
10 0.05337863 139 acl-2012-MIX Is Not a Tree-Adjoining Language
11 0.052763537 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees
12 0.049711943 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?
13 0.046635326 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction
14 0.043754973 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars
15 0.042395234 140 acl-2012-Machine Translation without Words through Substring Alignment
16 0.041609034 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies
17 0.040754724 181 acl-2012-Spectral Learning of Latent-Variable PCFGs
18 0.039339554 71 acl-2012-Dependency Hashing for n-best CCG Parsing
19 0.036764689 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation
20 0.035700794 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
topicId topicWeight
[(0, -0.124), (1, -0.013), (2, -0.119), (3, -0.114), (4, -0.163), (5, -0.023), (6, -0.018), (7, 0.154), (8, 0.031), (9, 0.031), (10, -0.08), (11, -0.249), (12, -0.128), (13, 0.08), (14, 0.1), (15, -0.274), (16, 0.059), (17, -0.192), (18, 0.068), (19, 0.114), (20, 0.089), (21, 0.153), (22, -0.026), (23, -0.017), (24, 0.194), (25, 0.073), (26, -0.046), (27, 0.087), (28, 0.016), (29, 0.023), (30, 0.021), (31, -0.105), (32, 0.087), (33, -0.006), (34, 0.015), (35, -0.112), (36, 0.03), (37, 0.049), (38, 0.075), (39, -0.015), (40, -0.104), (41, -0.084), (42, 0.021), (43, 0.015), (44, -0.021), (45, 0.03), (46, -0.07), (47, 0.007), (48, -0.039), (49, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.94959217 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars
Author: Elif Yamangil ; Stuart Shieber
Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.
2 0.87928629 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing
Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata
Abstract: We propose Symbol-Refined Tree Substitution Grammars (SR-TSGs) for syntactic parsing. An SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. We aim to provide a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. We present a novel probabilistic SR-TSG model based on the hierarchical Pitman-Yor Process to encode backoff smoothing from a fine-grained SR-TSG to simpler CFG rules, and develop an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling. Our SR-TSG parser achieves an F1 score of 92.4% in the Wall Street Journal (WSJ) English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and better than state-of-the-art discriminative reranking parsers.
3 0.8398295 154 acl-2012-Native Language Detection with Tree Substitution Grammars
Author: Benjamin Swanson ; Eugene Charniak
Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.
4 0.40134403 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts
Author: Stephen Tyndall
Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.
5 0.36466202 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars
Author: Andreas Maletti ; Joost Engelfriet
Abstract: Recently, it was shown (KUHLMANN, SATTA: Tree-adjoining grammars are not closed under strong lexicalization. Comput. Linguist., 2012) that finitely ambiguous tree adjoining grammars cannot be transformed into a normal form (preserving the generated tree language), in which each production contains a lexical symbol. A more powerful model, the simple context-free tree grammar, admits such a normal form. It can be effectively constructed and the maximal rank of the nonterminals only increases by 1. Thus, simple context-free tree grammars strongly lexicalize tree adjoining grammars and themselves.
6 0.32140201 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
7 0.30972081 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
8 0.30834696 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction
9 0.26417243 109 acl-2012-Higher-order Constituent Parsing and Parser Combination
10 0.25443476 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries
11 0.23937857 83 acl-2012-Error Mining on Dependency Trees
12 0.21860656 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
13 0.18503472 108 acl-2012-Hierarchical Chunk-to-String Translation
14 0.18281022 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
15 0.17148218 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs
16 0.16812764 112 acl-2012-Humor as Circuits in Semantic Networks
17 0.16650146 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars
18 0.16094416 139 acl-2012-MIX Is Not a Tree-Adjoining Language
19 0.14743349 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
20 0.14600417 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies
topicId topicWeight
[(7, 0.014), (13, 0.022), (26, 0.034), (28, 0.026), (30, 0.031), (37, 0.032), (39, 0.045), (49, 0.011), (57, 0.011), (58, 0.226), (74, 0.016), (82, 0.027), (84, 0.023), (85, 0.028), (90, 0.084), (92, 0.174), (94, 0.015), (99, 0.085)]
simIndex simValue paperId paperTitle
same-paper 1 0.794523 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars
Author: Elif Yamangil ; Stuart Shieber
Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.
2 0.69207728 145 acl-2012-Modeling Sentences in the Latent Space
Author: Weiwei Guo ; Mona Diab
Abstract: Sentence Similarity is the process of computing a similarity score between two sentences. Previous sentence similarity work finds that latent semantics approaches to the problem do not perform well due to insufficient information in single sentences. In this paper, we show that by carefully handling words that are not in the sentences (missing words), we can train a reliable latent variable model on sentences. In the process, we propose a new evaluation framework for sentence similarity: Concept Definition Retrieval. The new framework allows for large scale tuning and testing of Sentence Similarity models. Experiments on the new task and previous data sets show significant improvement of our model over baselines and other traditional latent variable models. Our results indicate comparable and even better performance than current state of the art systems addressing the problem of sentence similarity.
3 0.67889315 205 acl-2012-Tweet Recommendation with Graph Co-Ranking
Author: Rui Yan ; Mirella Lapata ; Xiaoming Li
Abstract: Mirella Lapata‡ Xiaoming Li†, \ ‡Institute for Language, \State Key Laboratory of Software Cognition and Computation, Development Environment, University of Edinburgh, Beihang University, Edinburgh EH8 9AB, UK Beijing 100083, China mlap@ inf .ed .ac .uk lxm@pku .edu .cn 2012.1 Twitter enables users to send and read textbased posts ofup to 140 characters, known as tweets. As one of the most popular micro-blogging services, Twitter attracts millions of users, producing millions of tweets daily. Shared information through this service spreads faster than would have been possible with traditional sources, however the proliferation of user-generation content poses challenges to browsing and finding valuable information. In this paper we propose a graph-theoretic model for tweet recommendation that presents users with items they may have an interest in. Our model ranks tweets and their authors simultaneously using several networks: the social network connecting the users, the network connecting the tweets, and a third network that ties the two together. Tweet and author entities are ranked following a co-ranking algorithm based on the intuition that that there is a mutually reinforcing relationship between tweets and their authors that could be reflected in the rankings. We show that this framework can be parametrized to take into account user preferences, the popularity of tweets and their authors, and diversity. Experimental evaluation on a large dataset shows that our model out- performs competitive approaches by a large margin.
4 0.67625648 154 acl-2012-Native Language Detection with Tree Substitution Grammars
Author: Benjamin Swanson ; Eugene Charniak
Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.
5 0.67452234 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks
Author: Tsung-Ting Kuo ; San-Chuan Hung ; Wei-Shih Lin ; Nanyun Peng ; Shou-De Lin ; Wei-Fen Lin
Abstract: This paper brings a marriage of two seemly unrelated topics, natural language processing (NLP) and social network analysis (SNA). We propose a new task in SNA which is to predict the diffusion of a new topic, and design a learning-based framework to solve this problem. We exploit the latent semantic information among users, topics, and social connections as features for prediction. Our framework is evaluated on real data collected from public domain. The experiments show 16% AUC improvement over baseline methods. The source code and dataset are available at http://www.csie.ntu.edu.tw/~d97944007/dif fusion/ 1 Background The diffusion of information on social networks has been studied for decades. Generally, the proposed strategies can be categorized into two categories, model-driven and data-driven. The model-driven strategies, such as independent cascade model (Kempe et al., 2003), rely on certain manually crafted, usually intuitive, models to fit the diffusion data without using diffusion history. The data-driven strategies usually utilize learning-based approaches to predict the future propagation given historical records of prediction (Fei et al., 2011; Galuba et al., 2010; Petrovic et al., 2011). Data-driven strategies usually perform better than model-driven approaches because the past diffusion behavior is used during learning (Galuba et al., 2010). Recently, researchers started to exploit content information in data-driven diffusion models (Fei et al., 2011; Petrovic et al., 2011; Zhu et al., 2011). 344 However, most of the data-driven approaches assume that in order to train a model and predict the future diffusion of a topic, it is required to obtain historical records about how this topic has propagated in a social network (Petrovic et al., 2011; Zhu et al., 2011). We argue that such assumption does not always hold in the real-world scenario, and being able to forecast the propagation of novel or unseen topics is more valuable in practice. For example, a company would like to know which users are more likely to be the source of ‘viva voce’ of a newly released product for advertising purpose. A political party might want to estimate the potential degree of responses of a half-baked policy before deciding to bring it up to public. To achieve such goal, it is required to predict the future propagation behavior of a topic even before any actual diffusion happens on this topic (i.e., no historical propagation data of this topic are available). Lin et al. also propose an idea aiming at predicting the inference of implicit diffusions for novel topics (Lin et al., 2011). The main difference between their work and ours is that they focus on implicit diffusions, whose data are usually not available. Consequently, they need to rely on a model-driven approach instead of a datadriven approach. On the other hand, our work focuses on the prediction of explicit diffusion behaviors. Despite the fact that no diffusion data of novel topics is available, we can still design a data- driven approach taking advantage of some explicit diffusion data of known topics. Our experiments show that being able to utilize such information is critical for diffusion prediction. 2 The Novel-Topic Diffusion Model We start by assuming an existing social network G = (V, E), where V is the set of nodes (or user) v, and E is the set of link e. The set of topics is Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 4s4–348, denoted as T. Among them, some are considered as novel topics (denoted as N), while the rest (R) are used as the training records. We are also given a set of diffusion records D = {d | d = (src, dest, t) }, where src is the source node (or diffusion source), dest is the destination node, and t is the topic of the diffusion that belongs to R but not N. We assume that diffusions cannot occur between nodes without direct social connection; any diffusion pair implies the existence of a link e = (src, dest) ∈ E. Finally, we assume there are sets of keywords or tags that relevant to each topic (including existing and novel topics). Note that the set of keywords for novel topics should be seen in that of existing topics. From these sets of keywords, we construct a topicword matrix TW = (P(wordj | topici))i,j of which the elements stand for the conditional probabilities that a word appears in the text of a certain topic. Similarly, we also construct a user-word matrix UW= (P(wordj | useri))i,j from these sets of keywords. Given the above information, the goal is to predict whether a given link is active (i.e., belongs to a diffusion link) for topics in N. 2.1 The Framework The main challenge of this problem lays in that the past diffusion behaviors of new topics are missing. To address this challenge, we propose a supervised diffusion discovery framework that exploits the latent semantic information among users, topics, and their explicit / implicit interactions. Intuitively, four kinds of information are useful for prediction: • Topic information: Intuitively, knowing the signatures of a topic (e.g., is it about politics?) is critical to the success of the prediction. • User information: The information of a user such as the personality (e.g., whether this user is aggressive or passive) is generally useful. • User-topic interaction: Understanding the users' preference on certain topics can improve the quality of prediction. • Global information: We include some global features (e.g., topology info) of social network. Below we will describe how these four kinds of information can be modeled in our framework. 2.2 Topic Information We extract hidden topic category information to model topic signature. In particular, we exploit the 345 Latent Dirichlet Allocation (LDA) method (Blei et al., 2003), which is a widely used topic modeling technique, to decompose the topic-word matrix TW into hidden topic categories: TW = TH * HW , where TH is a topic-hidden matrix, HW is hiddenword matrix, and h is the manually-chosen parameter to determine the size of hidden topic categories. TH indicates the distribution of each topic to hidden topic categories, and HW indicates the distribution of each lexical term to hidden topic categories. Note that TW and TH include both existing and novel topics. We utilize THt,*, the row vector of the topic-hidden matrix TH for a topic t, as a feature set. In brief, we apply LDA to extract the topic-hidden vector THt,* to model topic signature (TG) for both existing and novel topics. Topic information can be further exploited. To predict whether a novel topic will be propagated through a link, we can first enumerate the existing topics that have been propagated through this link. For each such topic, we can calculate its similarity with the new topic based on the hidden vectors generated above (e.g., using cosine similarity between feature vectors). Then, we sum up the similarity values as a new feature: topic similarity (TS). For example, a link has previously propagated two topics for a total of three times {ACL, KDD, ACL}, and we would like to know whether a new topic, EMNLP, will propagate through this link. We can use the topic-hidden vector to generate the similarity values between EMNLP and the other topics (e.g., {0.6, 0.4, 0.6}), and then sum them up (1.6) as the value of TS. 2.3 User Information Similar to topic information, we extract latent personal information to model user signature (the users are anonymized already). We apply LDA on the user-word matrix UW: UW = UM * MW , where UM is the user-hidden matrix, MW is the hidden-word matrix, and m is the manually-chosen size of hidden user categories. UM indicates the distribution of each user to the hidden user categories (e.g., age). We then use UMu,*, the row vector of UM for the user u, as a feature set. In brief, we apply LDA to extract the user-hidden vector UMu,* for both source and destination nodes of a link to model user signature (UG). 2.4 User-Topic Interaction Modeling user-topic interaction turns out to be non-trivial. It is not useful to exploit latent semantic analysis directly on the user-topic matrix UR = UQ * QR , where UR represents how many times each user is diffused for existing topic R (R ∈ T), because UR does not contain information of novel topics, and neither do UQ and QR. Given no propagation record about novel topics, we propose a method that allows us to still extract implicit user-topic information. First, we extract from the matrix TH (described in Section 2.2) a subset RH that contains only information about existing topics. Next we apply left division to derive another userhidden matrix UH: UH = (RH \ URT)T = ((RHT RH )-1 RHT URT)T Using left division, we generate the UH matrix using existing topic information. Finally, we exploit UHu,*, the row vector of the user-hidden matrix UH for the user u, as a feature set. Note that novel topics were included in the process of learning the hidden topic categories on RH; therefore the features learned here do implicitly utilize some latent information of novel topics, which is not the case for UM. Experiments confirm the superiority of our approach. Furthermore, our approach ensures that the hidden categories in topic-hidden and user-hidden matrices are identical. Intuitively, our method directly models the user’s preference to topics’ signature (e.g., how capable is this user to propagate topics in politics category?). In contrast, the UM mentioned in Section 2.3 represents the users’ signature (e.g., aggressiveness) and has nothing to do with their opinions on a topic. In short, we obtain the user-hidden probability vector UHu,* as a feature set, which models user preferences to latent categories (UPLC). 2.5 Global Features Given a candidate link, we can extract global social features such as in-degree (ID) and outdegree (OD). We tried other features such as PageRank values but found them not useful. Moreover, we extract the number of distinct topics (NDT) for a link as a feature. The intuition behind this is that the more distinct topics a user has diffused to another, the more likely the diffusion will happen for novel topics. 346 2.6 Complexity Analysis The complexity to produce each feature is as below: (1) Topic information: O(I * |T| * h * Bt) for LDA using Gibbs sampling, where Iis # of the iterations in sampling, |T| is # of topics, and Bt is the average # of tokens in a topic. (2) User information: O(I * |V| * m * Bu) , where |V| is # of users, and Bu is the average # of tokens for a user. (3) User-topic interaction: the time complexity is O(h3 + h2 * |T| + h * |T| * |V|). (4) Global features: O(|D|), where |D| is # of diffusions. 3 Experiments For evaluation, we try to use the diffusion records of old topics to predict whether a diffusion link exists between two nodes given a new topic. 3.1 Dataset and Evaluation Metric We first identify 100 most popular topic (e.g., earthquake) from the Plurk micro-blog site between 01/201 1 and 05/201 1. Plurk is a popular micro-blog service in Asia with more than 5 million users (Kuo et al., 2011). We manually separate the 100 topics into 7 groups. We use topic-wise 4-fold cross validation to evaluate our method, because there are only 100 available topics. For each group, we select 3/4 of the topics as training and 1/4 as validation. The positive diffusion records are generated based on the post-response behavior. That is, if a person x posts a message containing one of the selected topic t, and later there is a person y responding to this message, we consider a diffusion of t has occurred from x to y (i.e., (x, y, t) is a positive instance). Our dataset contains a total of 1,642,894 positive instances out of 100 distinct topics; the largest and smallest topic contains 303,424 and 2,166 diffusions, respectively. Also, the same amount of negative instances for each topic (totally 1,642,894) is sampled for binary classification (similar to the setup in KDD Cup 2011 Track 2). The negative links of a topic t are sampled randomly based on the absence of responses for that given topic. The underlying social network is created using the post-response behavior as well. We assume there is an acquaintance link between x and y if and only if x has responded to y (or vice versa) on at least one topic. Eventually we generated a social network of 163,034 nodes and 382,878 links. Furthermore, the sets of keywords for each topic are required to create the TW and UW matrices for latent topic analysis; we simply extract the content of posts and responses for each topic to create both matrices. We set the hidden category number h = m = 7, which is equal to the number of topic groups. We use area under ROC curve (AUC) to evaluate our proposed framework (Davis and Goadrich, 2006); we rank the testing instances based on their likelihood of being positive, and compare it with the ground truth to compute AUC. 3.2 Implementation and Baseline After trying many classifiers and obtaining similar results for all of them, we report only results from LIBLINEAR with c=0.0001 (Fan et al., 2008) due to space limitation. We remove stop-words, use SCWS (Hightman, 2012) for tokenization, and MALLET (McCallum, 2002) and GibbsLDA++ (Phan and Nguyen, 2007) for LDA. There are three baseline models we compare the result with. First, we simply use the total number of existing diffusions among all topics between two nodes as the single feature for prediction. Second, we exploit the independent cascading model (Kempe et al., 2003), and utilize the normalized total number of diffusions as the propagation probability of each link. Third, we try the heat diffusion model (Ma et al., 2008), set initial heat proportional to out-degree, and tune the diffusion time parameter until the best results are obtained. Note that we did not compare with any data-driven approaches, as we have not identified one that can predict diffusion of novel topics. 3.3 Results The result of each model is shown in Table 1. All except two features outperform the baseline. The best single feature is TS. Note that UPLC performs better than UG, which verifies our hypothesis that maintaining the same hidden features across different LDA models is better. We further conduct experiments to evaluate different combinations of features (Table 2), and found that the best one (TS + ID + NDT) results in about 16% improvement over the baseline, and outperforms the combination of all features. As stated in (Witten et al., 2011), 347 adding useless features may cause the performance of classifiers to deteriorate. Intuitively, TS captures both latent topic and historical diffusion information, while ID and NDT provide complementary social characteristics of users. 4 Conclusions The main contributions of this paper are as below: 1. We propose a novel task of predicting the diffusion of unseen topics, which has wide applications in real-world. 2. Compared to the traditional model-driven or content-independent data-driven works on diffusion analysis, our solution demonstrates how one can bring together ideas from two different but promising areas, NLP and SNA, to solve a challenging problem. 3. Promising experiment result (74% in AUC) not only demonstrates the usefulness of the proposed models, but also indicates that predicting diffusion of unseen topics without historical diffusion data is feasible. Acknowledgments This work was also supported by National Science Council, National Taiwan University and Intel Corporation under Grants NSC 100-291 1-I-002-001, and 101R7501. References David M. Blei, Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3.993-1022. Jesse Davis & Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang & Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res., 9.1871-74. Hongliang Fei, Ruoyi Jiang, Yuhao Yang, Bo Luo & Jun Huan. 2011. Content based social behavior prediction: a multi-task learning approach. Proceedings of the 20th ACM international conference on Information and knowledge management, Glasgow, Scotland, UK. Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic & Wolfgang Kellerer. 2010. Outtweeting the twitterers - predicting information cascades in microblogs. Proceedings of the 3rd conference on Online social networks, Boston, MA. Hightman. 2012. Simple Chinese Words Segmentation (SCWS). David Kempe, Jon Kleinberg & Eva Tardos. 2003. Maximizing the spread of influence through a social network. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C. Tsung-Ting Kuo, San-Chuan Hung, Wei-Shih Lin, Shou-De Lin, Ting-Chun Peng & Chia-Chun Shih. 2011. Assessing the Quality of Diffusion Models Using Real-World Social Network Data. Conference on Technologies and Applications of Artificial Intelligence, 2011. C.X. Lin, Q.Z. Mei, Y.L. Jiang, J.W. Han & S.X. Qi. 2011. Inferring the Diffusion and Evolution of Topics in Social Communities. Proceedings of the IEEE International Conference on Data Mining, 2011. Hao Ma, Haixuan Yang, Michael R. Lyu & Irwin King. 2008. Mining social networks using heat diffusion processes for marketing candidates selection. Proceeding of the 17th ACM conference on Information and knowledge management, Napa Valley, California, USA. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Sasa Petrovic, Miles Osborne & Victor Lavrenko. 2011. RT to Win! Predicting Message Propagation in Twitter. International AAAI Conference on Weblogs and Social Media, 2011. 348 Xuan-Hieu Phan & Cam-Tu Nguyen. 2007. GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA). Ian H. Witten, Eibe Frank & Mark A. Hall. 2011. Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc. Jiang Zhu, Fei Xiong, Dongzhen Piao, Yun Liu & Ying Zhang. 2011. Statistically Modeling the Effectiveness of Disaster Information in Social Media. Proceedings of the 2011 IEEE Global Humanitarian Technology Conference.
6 0.66960871 78 acl-2012-Efficient Search for Transformation-based Inference
7 0.66500384 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation
8 0.64829701 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition
9 0.62448728 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment
10 0.60966474 31 acl-2012-Authorship Attribution with Author-aware Topic Models
11 0.6032964 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
12 0.59609532 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing
13 0.57630503 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning
14 0.56890941 167 acl-2012-QuickView: NLP-based Tweet Search
15 0.56473225 139 acl-2012-MIX Is Not a Tree-Adjoining Language
16 0.56408584 98 acl-2012-Finding Bursty Topics from Microblogs
18 0.55802345 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale
19 0.55219114 191 acl-2012-Temporally Anchored Relation Extraction
20 0.55191314 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars