acl acl2012 acl2012-154 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin Swanson ; Eugene Charniak
Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.
Reference: text
sentIndex sentText sentNum sentScore
1 Native Language Detection with Tree Substitution Grammars Ben Swanson Brown University chonge r @ c s brown . [sent-1, score-0.037]
2 edu Abstract We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. [sent-3, score-0.496]
3 We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. [sent-4, score-0.702]
4 Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set. [sent-5, score-0.36]
5 1 Introduction The correlation between a person’s native language (L1) and aspects of their writing in a second lan- guage (L2) can be exploited to predict L1label given L2 text. [sent-6, score-0.229]
6 In this work we explore the possibility of automatically induced Tree Substitution Grammar (TSG) rules as features for a logistic regression model1 trained to predict these L1 labels. [sent-8, score-0.279]
7 Automatic TSG induction is made difficult by the exponential number of possible TSG rules given a corpus. [sent-9, score-0.209]
8 The first uses a nonparametric Bayesian model to handle the large number 1a. [sent-11, score-0.049]
9 edu of rules (Cohn and Blunsom, 2010), while the second is inspired by tree kernel methods and extracts common subtrees from pairs of parse trees (Sangati and Zuidema, 2011). [sent-16, score-0.364]
10 While both are effective, we show that the Bayesian method of TSG induction produces superior features and achieves a new best result at the task of native language detection. [sent-17, score-0.505]
11 1 Native Language Detection Work in automatic native language detection has been mainly associated with the ICLE, published in 2002. [sent-19, score-0.269]
12 Koppel et al (2005) first constructed such a system with a feature set consisting of function words, POS bi-grams, and character n-grams. [sent-20, score-0.116]
13 These features provide a strong baseline but cannot capture many linguistic phenomena. [sent-21, score-0.038]
14 More recently, Wong and Dras (201 1a) considered syntactic features for this task, using logistic regression with features extracted from parse trees produced by a state of the art statistical parser. [sent-22, score-0.445]
15 They investigated two classes of features: reranking features from the Charniak parser and CFG features. [sent-23, score-0.08]
16 They showed that while reranking features capture long range dependencies in parse trees that CFG rules cannot, they do not produce classification performance superior to simple CFG rules. [sent-24, score-0.346]
17 Their CFG feature approach represents the best performing model to date for the task of native language detection. [sent-25, score-0.27]
18 Wong and Dras (201 1b) also investigated the use of LDA topic modeling to produce a latent feature set of reduced dimensionality, but failed to outperform baseline systems with this approach. [sent-26, score-0.111]
19 2 TSG induction One inherent difficulty in the use of TSGs is controlling the size of grammars automatically induced from data, which with any reasonable corpus quickly becomes too large for modern workstations to handle. [sent-30, score-0.336]
20 When automatically induced TSGs were first proposed by Bod (1991), the problem of grammar induction was tackled with random selection of fragments or weak constraints that led to massive grammars. [sent-31, score-0.634]
21 A more principled technique is to use a sparse nonparametric prior, as was recently presented by Cohn et al (2009) and Post and Gildea (2009). [sent-32, score-0.124]
22 They provide a local Gibbs sampling algorithm, and Cohn and Blunsom (2010) later developed a block sampling algorithm with better convergence behavior. [sent-33, score-0.123]
23 While this Bayesian method has yet to produce state of the art parsing results, it has achieved state of the art results for unsupervised grammar induction (Blunsom and Cohn, 2010) and has been extended to synchronous grammars for use in sentence compression (Yamangil and Shieber, 2010). [sent-34, score-0.643]
24 More recently, (Sangati and Zuidema, 2011) presented an elegantly simple heuristic inspired by tree kernels that they call DoubleDOP. [sent-35, score-0.34]
25 They showed that manageable grammar sizes can be obtained from a corpus the size of the Penn Treebank by recording all fragments that occur at least twice, subject to a pairwise constraint of maximality. [sent-36, score-0.438]
26 Using an additional heuristic to provide a distribution over fragments, DoubleDOP achieved the current state of the art for TSG parsing, competing closely with the absolute best results set by refinement based parsers. [sent-37, score-0.15]
27 3 Fragment Based Classification The use of parse tree fragments for classification began with Collins and Duffy (2001). [sent-39, score-0.616]
28 They used the number of common subtrees between two parse trees as a convolution kernel in a voted perceptron and applied it as a parse reranker. [sent-40, score-0.257]
29 Syntactic features have also been used in non194 kernelized classifiers, such as in the work of Wong and Dras (201 1a) mentioned in Section 2. [sent-42, score-0.038]
30 Additional examples include Raghavan et al (2010), which uses a CFG language model to perform authorship attribution, and Post (201 1), which uses TSG features in a logistic regression model to perform grammaticality detection. [sent-44, score-0.384]
31 3 Tree Substitution Grammars Tree Substitution Grammars are similar to Context Free Grammars, differing in that they allow rewrite rules of arbitrary parse tree structure with any number of nonterminal or terminal leaves. [sent-45, score-0.295]
32 We adopt the common term fragment2 to refer to these rules, as they are easily visualised as fragments of a complete parse tree. [sent-46, score-0.418]
33 S NP NP VBZ hates NP NP NNP VP NP NN NNS George broccoli shoes Figure 1: Fragments from a Tree Substitution Grammar capable ofderiving the sentences “George hates broccoli” and “George hates shoes”. [sent-47, score-0.421]
34 One method of TSG induction is to represent a probabilistic TSG with Dirichlet Process priors and sample derivations of a corpus using MCMC. [sent-50, score-0.207]
35 Under this model the posterior probability of a fragment e is given as P(e|e−,α,P0) =##e++ α αP0 (1) where e− is the multiset of fragments in the current derivations excluding e, #e is the count of the fragment e in e−, and #• is the total number of fragments in e− with the same root node as e. [sent-51, score-1.0]
36 P0 is 2As opposed to elementary tree, often used in related work a PCFG distribution over fragments with a bias towards small fragments. [sent-52, score-0.315]
37 α is the concentration parameter of the DP, and can be used to roughly tune the number of fragments that appear in the sampled derivations. [sent-53, score-0.388]
38 With this posterior distribution the derivations of a corpus can be sampled tree by tree using the block sampling algorithm of Cohn and Blunsom (2010), converging eventually on a sample from the true posterior of all derivations. [sent-54, score-0.546]
39 2 DoubleDOP Induction DoubleDOP uses a heuristic inspired by tree kernels, which are commonly used to measure similarity between two parse trees by counting the number of fragments they share. [sent-56, score-0.664]
40 DoubleDOP uses the same underlying technique, but caches the shared fragments instead of simply counting them. [sent-57, score-0.346]
41 This yields a set of fragments where each member is guaranteed to appear at least twice in the training set. [sent-58, score-0.352]
42 In order to avoid unmanageably large grammars only maximal fragments are retained in each pairwise extraction, which is to say that any shared fragment that occurs inside another shared fragment is discarded. [sent-59, score-0.719]
43 The main disadvantage of this method is that the complexity scales quadratically with the training set size, as all pairs of sentences must be considered. [sent-60, score-0.041]
44 It is fully parallelizable, however, which mediates this disadvantage to some extent. [sent-61, score-0.041]
45 All reported results are averaged over 5 subsamplings of the full data set. [sent-66, score-0.033]
46 Our data preproccesing pipeline is as follows: First we perform sentence segmentation with OpenNLP and then parse each sentence with a 6 split grammar for the Berkeley Parser (Petrov et al, 2006). [sent-67, score-0.17]
47 We then replace all terminal symbols which 195 do not occur in a list of 598 function words3 with a single UNK terminal. [sent-68, score-0.031]
48 This aggressive removal of lexical items is standard in this task and mitigates the effect of other unwanted information sources such as topic and geographic location that are correlated with native language in the data. [sent-69, score-0.27]
49 We contrast three different TSG feature sets in our experiments. [sent-70, score-0.041]
50 First, to provide a baseline, we simply read off the CFG rules from the data set (note that a CFG can be taken as a TSG with all fragments having depth one). [sent-71, score-0.359]
51 Second, in the method we call BTSG, we use the Bayesian induction model with the Dirichlet process’ concentration parameters tuned to 100 and run for 1000 iterations of sampling. [sent-72, score-0.202]
52 We take as our resulting finite grammar the fragments that appear in the sampled derivations. [sent-73, score-0.446]
53 Third, we run the parameterless DoubleDOP (2DOP) induction method. [sent-74, score-0.165]
54 Using the full 2DOP feature set produces over 400k features, which heavily taxes the resources of a single modern workstation. [sent-75, score-0.041]
55 To balance the feature set sizes between 2DOP and BTSG we pass back over the training data and count the actual number of times each fragment recovered by 2DOP appears. [sent-76, score-0.23]
56 We then limit the list to the n most common fragments, where n is the average number of fragments recovered by the BTSG method (around 7k). [sent-77, score-0.358]
57 We refer to results using this trimmed feature set with the label 2DOP, using 2DOP(F) to refer to DoubleDOP with the full set of features. [sent-78, score-0.041]
58 Given each TSG, we create a binary feature function for each fragment e in the grammar such that the feature fe(d) is active for a document d if there exists a derivation of some tree t ∈ d that uses e. [sent-79, score-0.499]
59 Clasissitfisc aat dioenri visa performed ew triethe tth e∈ M da thllaett u package faosrlogistic regression using the default initialized MaxEntTrainer. [sent-80, score-0.07]
60 1 Predictive Power The resulting classification accuracies are shown in Table 1. [sent-82, score-0.081]
61 The BTSG feature set gives the highest performance, and both true TSG induction techniques outperform the CFG baseline. [sent-83, score-0.234]
62 While in their work they report 80% accuracy with the CFG model, this is for a single sampling of the full data set. [sent-87, score-0.045]
63 The numbers we report are from our own implementation of their CFG technique, and all results are averaged over 5 random samplings from the full corpus. [sent-89, score-0.098]
64 For 2DOP we limit the 2DOP(F) fragments by choosing the 7k with maximum frequency, but there may exist superior methods. [sent-90, score-0.388]
65 Instead of experimenting with different limiting metrics, we note that when all 400k rules are used, the averaged accuracy is only 76. [sent-93, score-0.077]
66 2 Robustness We also investigated different classification strategies, as binary indicators of fragment occurrence over an entire document may lead to noisy results. [sent-96, score-0.3]
67 Consider a single outlier sentence in a document with a single fragment that is indicative of the incorrect L1 label. [sent-97, score-0.211]
68 Note that it is just as important in the eyes of the classifier as a fragment indicative of the correct label that appears many times. [sent-98, score-0.18]
69 To investigate this phenomena we classified individual sentences, and used these results to vote for each document level label in the test set. [sent-99, score-0.074]
70 In the first, VoteOne, each sentence contributes one vote to its maximum probability label. [sent-101, score-0.043]
71 55 Table 2: Sentence based classification accuracy for BTSG or 2DOP, but what is more interesting is that in both cases the CFG model outperforms 2DOP (with less than half of the features). [sent-109, score-0.081]
72 The robust behavior of the BTSG method shows that it finds correctly discriminative features across several sentences in each document to a greater extent than other methods. [sent-110, score-0.069]
73 3 Concision One possible explanation for the superior performance of BTSG is that DDOP is prone to yielding multiple fragments that represent the same linguistic phenomena, leading to sets of highly correlated features. [sent-112, score-0.429]
74 While correlated features are not crippling to a logistic regression model, they add computational complexity without contributing to higher classification accuracy. [sent-113, score-0.326]
75 To address this hypothesis empirically, we considered pairs of fragments eA and eB and calculated the pointwise mutual information (PMI) between events signifying their occurrence in a sentence. [sent-114, score-0.356]
76 For BTSG, the average pointwise mutual information over all pairs (eA, eB) is −. [sent-115, score-0.041]
77 e0 exclusivity, atshiinsg supports vthee callauimes that DDOP’s comparative weakness is to some extent due to feature redundancy. [sent-119, score-0.041]
78 6 Conclusion In this work we investigate automatically induced TSG fragments as classification features for native language detection. [sent-120, score-0.722]
79 We compare Bayesian and DoubleDOP induced features and find that the former represents the data with less redundancy, is more robust to classification strategy, and gives higher classification accuracy. [sent-121, score-0.259]
80 Additionally, the Bayesian TSG features give a new best result for the task of native language detection. [sent-122, score-0.267]
81 Determining an author’s native language by mining a text for errors. [sent-158, score-0.229]
wordName wordTfidf (topN-words)
[('tsg', 0.359), ('fragments', 0.315), ('btsg', 0.258), ('native', 0.229), ('doubledop', 0.226), ('dras', 0.225), ('cfg', 0.215), ('induction', 0.165), ('wong', 0.153), ('fragment', 0.146), ('tree', 0.145), ('cohn', 0.135), ('substitution', 0.129), ('bayesian', 0.126), ('grammars', 0.112), ('blunsom', 0.111), ('kernels', 0.104), ('hates', 0.097), ('icle', 0.097), ('sangati', 0.097), ('grammar', 0.095), ('authorship', 0.09), ('zuidema', 0.084), ('art', 0.081), ('classification', 0.081), ('parse', 0.075), ('al', 0.075), ('superior', 0.073), ('bod', 0.072), ('regression', 0.07), ('logistic', 0.068), ('broccoli', 0.065), ('ddop', 0.065), ('jojo', 0.065), ('samplings', 0.065), ('shoes', 0.065), ('induced', 0.059), ('attribution', 0.057), ('granger', 0.056), ('np', 0.056), ('tsgs', 0.051), ('yamangil', 0.051), ('nonparametric', 0.049), ('trevor', 0.049), ('phil', 0.049), ('sampling', 0.045), ('rules', 0.044), ('vote', 0.043), ('koppel', 0.043), ('shieber', 0.043), ('duffy', 0.043), ('grammaticality', 0.043), ('eb', 0.043), ('recovered', 0.043), ('investigated', 0.042), ('derivations', 0.042), ('kim', 0.041), ('correlated', 0.041), ('disadvantage', 0.041), ('convolution', 0.041), ('pointwise', 0.041), ('feature', 0.041), ('compact', 0.04), ('learner', 0.04), ('state', 0.04), ('detection', 0.04), ('post', 0.039), ('moschitti', 0.038), ('isozaki', 0.038), ('features', 0.038), ('concentration', 0.037), ('twice', 0.037), ('raghavan', 0.037), ('brown', 0.037), ('george', 0.037), ('posterior', 0.036), ('sampled', 0.036), ('suzuki', 0.036), ('trees', 0.035), ('inspired', 0.034), ('indicative', 0.034), ('averaged', 0.033), ('block', 0.033), ('matt', 0.032), ('subtrees', 0.031), ('terminal', 0.031), ('counting', 0.031), ('document', 0.031), ('accurate', 0.031), ('synchronous', 0.029), ('heuristic', 0.029), ('outperform', 0.028), ('visualised', 0.028), ('dagneaux', 0.028), ('converging', 0.028), ('elegantly', 0.028), ('crippling', 0.028), ('willem', 0.028), ('unk', 0.028), ('manageable', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 154 acl-2012-Native Language Detection with Tree Substitution Grammars
Author: Benjamin Swanson ; Eugene Charniak
Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.
2 0.45102981 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing
Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata
Abstract: We propose Symbol-Refined Tree Substitution Grammars (SR-TSGs) for syntactic parsing. An SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. We aim to provide a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. We present a novel probabilistic SR-TSG model based on the hierarchical Pitman-Yor Process to encode backoff smoothing from a fine-grained SR-TSG to simpler CFG rules, and develop an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling. Our SR-TSG parser achieves an F1 score of 92.4% in the Wall Street Journal (WSJ) English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and better than state-of-the-art discriminative reranking parsers.
3 0.28077096 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars
Author: Elif Yamangil ; Stuart Shieber
Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.
4 0.13650265 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization
Author: Alessandro Moschitti ; Qi Ju ; Richard Johansson
Abstract: In this paper, we encode topic dependencies in hierarchical multi-label Text Categorization (TC) by means of rerankers. We represent reranking hypotheses with several innovative kernels considering both the structure of the hierarchy and the probability of nodes. Additionally, to better investigate the role ofcategory relationships, we consider two interesting cases: (i) traditional schemes in which node-fathers include all the documents of their child-categories; and (ii) more general schemes, in which children can include documents not belonging to their fathers. The extensive experimentation on Reuters Corpus Volume 1 shows that our rerankers inject effective structural semantic dependencies in multi-classifiers and significantly outperform the state-of-the-art.
Author: Zhaopeng Tu ; Yifan He ; Jennifer Foster ; Josef van Genabith ; Qun Liu ; Shouxun Lin
Abstract: Convolution kernels support the modeling of complex syntactic information in machinelearning tasks. However, such models are highly sensitive to the type and size of syntactic structure used. It is therefore an important challenge to automatically identify high impact sub-structures relevant to a given task. In this paper we present a systematic study investigating (combinations of) sequence and convolution kernels using different types of substructures in document-level sentiment classification. We show that minimal sub-structures extracted from constituency and dependency trees guided by a polarity lexicon show 1.45 pointabsoluteimprovementinaccuracy overa bag-of-words classifier on a widely used sentiment corpus. 1
6 0.11285498 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts
7 0.11224917 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
8 0.11057505 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
9 0.10766809 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
10 0.099842116 109 acl-2012-Higher-order Constituent Parsing and Parser Combination
11 0.082946032 183 acl-2012-State-of-the-Art Kernels for Natural Language Processing
12 0.078838363 15 acl-2012-A Meta Learning Approach to Grammatical Error Correction
13 0.072160065 140 acl-2012-Machine Translation without Words through Substring Alignment
14 0.069258951 31 acl-2012-Authorship Attribution with Author-aware Topic Models
15 0.067229003 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?
16 0.065898448 83 acl-2012-Error Mining on Dependency Trees
17 0.064444646 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing
18 0.064219251 64 acl-2012-Crosslingual Induction of Semantic Roles
19 0.059727095 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
20 0.059257269 57 acl-2012-Concept-to-text Generation via Discriminative Reranking
topicId topicWeight
[(0, -0.203), (1, 0.01), (2, -0.111), (3, -0.151), (4, -0.195), (5, -0.022), (6, -0.031), (7, 0.224), (8, -0.02), (9, -0.006), (10, -0.114), (11, -0.314), (12, -0.077), (13, 0.153), (14, 0.116), (15, -0.243), (16, 0.025), (17, -0.219), (18, 0.12), (19, 0.098), (20, 0.067), (21, 0.128), (22, 0.022), (23, -0.016), (24, 0.164), (25, 0.043), (26, -0.072), (27, 0.124), (28, -0.003), (29, -0.045), (30, 0.024), (31, -0.064), (32, 0.077), (33, 0.017), (34, 0.036), (35, -0.083), (36, 0.052), (37, 0.052), (38, 0.049), (39, 0.035), (40, -0.063), (41, -0.022), (42, -0.075), (43, 0.003), (44, -0.018), (45, 0.02), (46, -0.041), (47, 0.019), (48, -0.063), (49, -0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.94149166 154 acl-2012-Native Language Detection with Tree Substitution Grammars
Author: Benjamin Swanson ; Eugene Charniak
Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.
2 0.91759795 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars
Author: Elif Yamangil ; Stuart Shieber
Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.
3 0.87676787 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing
Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata
Abstract: We propose Symbol-Refined Tree Substitution Grammars (SR-TSGs) for syntactic parsing. An SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. We aim to provide a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. We present a novel probabilistic SR-TSG model based on the hierarchical Pitman-Yor Process to encode backoff smoothing from a fine-grained SR-TSG to simpler CFG rules, and develop an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling. Our SR-TSG parser achieves an F1 score of 92.4% in the Wall Street Journal (WSJ) English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and better than state-of-the-art discriminative reranking parsers.
Author: Stephen Tyndall
Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.
5 0.38477913 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization
Author: Alessandro Moschitti ; Qi Ju ; Richard Johansson
Abstract: In this paper, we encode topic dependencies in hierarchical multi-label Text Categorization (TC) by means of rerankers. We represent reranking hypotheses with several innovative kernels considering both the structure of the hierarchy and the probability of nodes. Additionally, to better investigate the role ofcategory relationships, we consider two interesting cases: (i) traditional schemes in which node-fathers include all the documents of their child-categories; and (ii) more general schemes, in which children can include documents not belonging to their fathers. The extensive experimentation on Reuters Corpus Volume 1 shows that our rerankers inject effective structural semantic dependencies in multi-classifiers and significantly outperform the state-of-the-art.
6 0.37909254 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
7 0.37656978 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction
8 0.3656581 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars
9 0.34953576 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
10 0.31549725 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
11 0.29734299 109 acl-2012-Higher-order Constituent Parsing and Parser Combination
12 0.2953684 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
13 0.28901616 115 acl-2012-Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification
14 0.28560433 83 acl-2012-Error Mining on Dependency Trees
15 0.27332333 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs
16 0.26256934 15 acl-2012-A Meta Learning Approach to Grammatical Error Correction
17 0.23709899 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries
18 0.23312427 108 acl-2012-Hierarchical Chunk-to-String Translation
19 0.23229879 68 acl-2012-Decoding Running Key Ciphers
20 0.22976208 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
topicId topicWeight
[(7, 0.011), (26, 0.032), (28, 0.031), (30, 0.017), (37, 0.042), (39, 0.038), (74, 0.016), (82, 0.015), (84, 0.012), (85, 0.028), (90, 0.077), (92, 0.544), (94, 0.017), (99, 0.037)]
simIndex simValue paperId paperTitle
1 0.93842137 78 acl-2012-Efficient Search for Transformation-based Inference
Author: Asher Stern ; Roni Stern ; Ido Dagan ; Ariel Felner
Abstract: This paper addresses the search problem in textual inference, where systems need to infer one piece of text from another. A prominent approach to this task is attempts to transform one text into the other through a sequence of inference-preserving transformations, a.k.a. a proof, while estimating the proof’s validity. This raises a search challenge of finding the best possible proof. We explore this challenge through a comprehensive investigation of prominent search algorithms and propose two novel algorithmic components specifically designed for textual inference: a gradient-style evaluation function, and a locallookahead node expansion method. Evaluations, using the open-source system, BIUTEE, show the contribution of these ideas to search efficiency and proof quality.
2 0.92155963 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks
Author: Tsung-Ting Kuo ; San-Chuan Hung ; Wei-Shih Lin ; Nanyun Peng ; Shou-De Lin ; Wei-Fen Lin
Abstract: This paper brings a marriage of two seemly unrelated topics, natural language processing (NLP) and social network analysis (SNA). We propose a new task in SNA which is to predict the diffusion of a new topic, and design a learning-based framework to solve this problem. We exploit the latent semantic information among users, topics, and social connections as features for prediction. Our framework is evaluated on real data collected from public domain. The experiments show 16% AUC improvement over baseline methods. The source code and dataset are available at http://www.csie.ntu.edu.tw/~d97944007/dif fusion/ 1 Background The diffusion of information on social networks has been studied for decades. Generally, the proposed strategies can be categorized into two categories, model-driven and data-driven. The model-driven strategies, such as independent cascade model (Kempe et al., 2003), rely on certain manually crafted, usually intuitive, models to fit the diffusion data without using diffusion history. The data-driven strategies usually utilize learning-based approaches to predict the future propagation given historical records of prediction (Fei et al., 2011; Galuba et al., 2010; Petrovic et al., 2011). Data-driven strategies usually perform better than model-driven approaches because the past diffusion behavior is used during learning (Galuba et al., 2010). Recently, researchers started to exploit content information in data-driven diffusion models (Fei et al., 2011; Petrovic et al., 2011; Zhu et al., 2011). 344 However, most of the data-driven approaches assume that in order to train a model and predict the future diffusion of a topic, it is required to obtain historical records about how this topic has propagated in a social network (Petrovic et al., 2011; Zhu et al., 2011). We argue that such assumption does not always hold in the real-world scenario, and being able to forecast the propagation of novel or unseen topics is more valuable in practice. For example, a company would like to know which users are more likely to be the source of ‘viva voce’ of a newly released product for advertising purpose. A political party might want to estimate the potential degree of responses of a half-baked policy before deciding to bring it up to public. To achieve such goal, it is required to predict the future propagation behavior of a topic even before any actual diffusion happens on this topic (i.e., no historical propagation data of this topic are available). Lin et al. also propose an idea aiming at predicting the inference of implicit diffusions for novel topics (Lin et al., 2011). The main difference between their work and ours is that they focus on implicit diffusions, whose data are usually not available. Consequently, they need to rely on a model-driven approach instead of a datadriven approach. On the other hand, our work focuses on the prediction of explicit diffusion behaviors. Despite the fact that no diffusion data of novel topics is available, we can still design a data- driven approach taking advantage of some explicit diffusion data of known topics. Our experiments show that being able to utilize such information is critical for diffusion prediction. 2 The Novel-Topic Diffusion Model We start by assuming an existing social network G = (V, E), where V is the set of nodes (or user) v, and E is the set of link e. The set of topics is Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 4s4–348, denoted as T. Among them, some are considered as novel topics (denoted as N), while the rest (R) are used as the training records. We are also given a set of diffusion records D = {d | d = (src, dest, t) }, where src is the source node (or diffusion source), dest is the destination node, and t is the topic of the diffusion that belongs to R but not N. We assume that diffusions cannot occur between nodes without direct social connection; any diffusion pair implies the existence of a link e = (src, dest) ∈ E. Finally, we assume there are sets of keywords or tags that relevant to each topic (including existing and novel topics). Note that the set of keywords for novel topics should be seen in that of existing topics. From these sets of keywords, we construct a topicword matrix TW = (P(wordj | topici))i,j of which the elements stand for the conditional probabilities that a word appears in the text of a certain topic. Similarly, we also construct a user-word matrix UW= (P(wordj | useri))i,j from these sets of keywords. Given the above information, the goal is to predict whether a given link is active (i.e., belongs to a diffusion link) for topics in N. 2.1 The Framework The main challenge of this problem lays in that the past diffusion behaviors of new topics are missing. To address this challenge, we propose a supervised diffusion discovery framework that exploits the latent semantic information among users, topics, and their explicit / implicit interactions. Intuitively, four kinds of information are useful for prediction: • Topic information: Intuitively, knowing the signatures of a topic (e.g., is it about politics?) is critical to the success of the prediction. • User information: The information of a user such as the personality (e.g., whether this user is aggressive or passive) is generally useful. • User-topic interaction: Understanding the users' preference on certain topics can improve the quality of prediction. • Global information: We include some global features (e.g., topology info) of social network. Below we will describe how these four kinds of information can be modeled in our framework. 2.2 Topic Information We extract hidden topic category information to model topic signature. In particular, we exploit the 345 Latent Dirichlet Allocation (LDA) method (Blei et al., 2003), which is a widely used topic modeling technique, to decompose the topic-word matrix TW into hidden topic categories: TW = TH * HW , where TH is a topic-hidden matrix, HW is hiddenword matrix, and h is the manually-chosen parameter to determine the size of hidden topic categories. TH indicates the distribution of each topic to hidden topic categories, and HW indicates the distribution of each lexical term to hidden topic categories. Note that TW and TH include both existing and novel topics. We utilize THt,*, the row vector of the topic-hidden matrix TH for a topic t, as a feature set. In brief, we apply LDA to extract the topic-hidden vector THt,* to model topic signature (TG) for both existing and novel topics. Topic information can be further exploited. To predict whether a novel topic will be propagated through a link, we can first enumerate the existing topics that have been propagated through this link. For each such topic, we can calculate its similarity with the new topic based on the hidden vectors generated above (e.g., using cosine similarity between feature vectors). Then, we sum up the similarity values as a new feature: topic similarity (TS). For example, a link has previously propagated two topics for a total of three times {ACL, KDD, ACL}, and we would like to know whether a new topic, EMNLP, will propagate through this link. We can use the topic-hidden vector to generate the similarity values between EMNLP and the other topics (e.g., {0.6, 0.4, 0.6}), and then sum them up (1.6) as the value of TS. 2.3 User Information Similar to topic information, we extract latent personal information to model user signature (the users are anonymized already). We apply LDA on the user-word matrix UW: UW = UM * MW , where UM is the user-hidden matrix, MW is the hidden-word matrix, and m is the manually-chosen size of hidden user categories. UM indicates the distribution of each user to the hidden user categories (e.g., age). We then use UMu,*, the row vector of UM for the user u, as a feature set. In brief, we apply LDA to extract the user-hidden vector UMu,* for both source and destination nodes of a link to model user signature (UG). 2.4 User-Topic Interaction Modeling user-topic interaction turns out to be non-trivial. It is not useful to exploit latent semantic analysis directly on the user-topic matrix UR = UQ * QR , where UR represents how many times each user is diffused for existing topic R (R ∈ T), because UR does not contain information of novel topics, and neither do UQ and QR. Given no propagation record about novel topics, we propose a method that allows us to still extract implicit user-topic information. First, we extract from the matrix TH (described in Section 2.2) a subset RH that contains only information about existing topics. Next we apply left division to derive another userhidden matrix UH: UH = (RH \ URT)T = ((RHT RH )-1 RHT URT)T Using left division, we generate the UH matrix using existing topic information. Finally, we exploit UHu,*, the row vector of the user-hidden matrix UH for the user u, as a feature set. Note that novel topics were included in the process of learning the hidden topic categories on RH; therefore the features learned here do implicitly utilize some latent information of novel topics, which is not the case for UM. Experiments confirm the superiority of our approach. Furthermore, our approach ensures that the hidden categories in topic-hidden and user-hidden matrices are identical. Intuitively, our method directly models the user’s preference to topics’ signature (e.g., how capable is this user to propagate topics in politics category?). In contrast, the UM mentioned in Section 2.3 represents the users’ signature (e.g., aggressiveness) and has nothing to do with their opinions on a topic. In short, we obtain the user-hidden probability vector UHu,* as a feature set, which models user preferences to latent categories (UPLC). 2.5 Global Features Given a candidate link, we can extract global social features such as in-degree (ID) and outdegree (OD). We tried other features such as PageRank values but found them not useful. Moreover, we extract the number of distinct topics (NDT) for a link as a feature. The intuition behind this is that the more distinct topics a user has diffused to another, the more likely the diffusion will happen for novel topics. 346 2.6 Complexity Analysis The complexity to produce each feature is as below: (1) Topic information: O(I * |T| * h * Bt) for LDA using Gibbs sampling, where Iis # of the iterations in sampling, |T| is # of topics, and Bt is the average # of tokens in a topic. (2) User information: O(I * |V| * m * Bu) , where |V| is # of users, and Bu is the average # of tokens for a user. (3) User-topic interaction: the time complexity is O(h3 + h2 * |T| + h * |T| * |V|). (4) Global features: O(|D|), where |D| is # of diffusions. 3 Experiments For evaluation, we try to use the diffusion records of old topics to predict whether a diffusion link exists between two nodes given a new topic. 3.1 Dataset and Evaluation Metric We first identify 100 most popular topic (e.g., earthquake) from the Plurk micro-blog site between 01/201 1 and 05/201 1. Plurk is a popular micro-blog service in Asia with more than 5 million users (Kuo et al., 2011). We manually separate the 100 topics into 7 groups. We use topic-wise 4-fold cross validation to evaluate our method, because there are only 100 available topics. For each group, we select 3/4 of the topics as training and 1/4 as validation. The positive diffusion records are generated based on the post-response behavior. That is, if a person x posts a message containing one of the selected topic t, and later there is a person y responding to this message, we consider a diffusion of t has occurred from x to y (i.e., (x, y, t) is a positive instance). Our dataset contains a total of 1,642,894 positive instances out of 100 distinct topics; the largest and smallest topic contains 303,424 and 2,166 diffusions, respectively. Also, the same amount of negative instances for each topic (totally 1,642,894) is sampled for binary classification (similar to the setup in KDD Cup 2011 Track 2). The negative links of a topic t are sampled randomly based on the absence of responses for that given topic. The underlying social network is created using the post-response behavior as well. We assume there is an acquaintance link between x and y if and only if x has responded to y (or vice versa) on at least one topic. Eventually we generated a social network of 163,034 nodes and 382,878 links. Furthermore, the sets of keywords for each topic are required to create the TW and UW matrices for latent topic analysis; we simply extract the content of posts and responses for each topic to create both matrices. We set the hidden category number h = m = 7, which is equal to the number of topic groups. We use area under ROC curve (AUC) to evaluate our proposed framework (Davis and Goadrich, 2006); we rank the testing instances based on their likelihood of being positive, and compare it with the ground truth to compute AUC. 3.2 Implementation and Baseline After trying many classifiers and obtaining similar results for all of them, we report only results from LIBLINEAR with c=0.0001 (Fan et al., 2008) due to space limitation. We remove stop-words, use SCWS (Hightman, 2012) for tokenization, and MALLET (McCallum, 2002) and GibbsLDA++ (Phan and Nguyen, 2007) for LDA. There are three baseline models we compare the result with. First, we simply use the total number of existing diffusions among all topics between two nodes as the single feature for prediction. Second, we exploit the independent cascading model (Kempe et al., 2003), and utilize the normalized total number of diffusions as the propagation probability of each link. Third, we try the heat diffusion model (Ma et al., 2008), set initial heat proportional to out-degree, and tune the diffusion time parameter until the best results are obtained. Note that we did not compare with any data-driven approaches, as we have not identified one that can predict diffusion of novel topics. 3.3 Results The result of each model is shown in Table 1. All except two features outperform the baseline. The best single feature is TS. Note that UPLC performs better than UG, which verifies our hypothesis that maintaining the same hidden features across different LDA models is better. We further conduct experiments to evaluate different combinations of features (Table 2), and found that the best one (TS + ID + NDT) results in about 16% improvement over the baseline, and outperforms the combination of all features. As stated in (Witten et al., 2011), 347 adding useless features may cause the performance of classifiers to deteriorate. Intuitively, TS captures both latent topic and historical diffusion information, while ID and NDT provide complementary social characteristics of users. 4 Conclusions The main contributions of this paper are as below: 1. We propose a novel task of predicting the diffusion of unseen topics, which has wide applications in real-world. 2. Compared to the traditional model-driven or content-independent data-driven works on diffusion analysis, our solution demonstrates how one can bring together ideas from two different but promising areas, NLP and SNA, to solve a challenging problem. 3. Promising experiment result (74% in AUC) not only demonstrates the usefulness of the proposed models, but also indicates that predicting diffusion of unseen topics without historical diffusion data is feasible. Acknowledgments This work was also supported by National Science Council, National Taiwan University and Intel Corporation under Grants NSC 100-291 1-I-002-001, and 101R7501. References David M. Blei, Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3.993-1022. Jesse Davis & Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang & Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res., 9.1871-74. Hongliang Fei, Ruoyi Jiang, Yuhao Yang, Bo Luo & Jun Huan. 2011. Content based social behavior prediction: a multi-task learning approach. Proceedings of the 20th ACM international conference on Information and knowledge management, Glasgow, Scotland, UK. Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic & Wolfgang Kellerer. 2010. Outtweeting the twitterers - predicting information cascades in microblogs. Proceedings of the 3rd conference on Online social networks, Boston, MA. Hightman. 2012. Simple Chinese Words Segmentation (SCWS). David Kempe, Jon Kleinberg & Eva Tardos. 2003. Maximizing the spread of influence through a social network. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C. Tsung-Ting Kuo, San-Chuan Hung, Wei-Shih Lin, Shou-De Lin, Ting-Chun Peng & Chia-Chun Shih. 2011. Assessing the Quality of Diffusion Models Using Real-World Social Network Data. Conference on Technologies and Applications of Artificial Intelligence, 2011. C.X. Lin, Q.Z. Mei, Y.L. Jiang, J.W. Han & S.X. Qi. 2011. Inferring the Diffusion and Evolution of Topics in Social Communities. Proceedings of the IEEE International Conference on Data Mining, 2011. Hao Ma, Haixuan Yang, Michael R. Lyu & Irwin King. 2008. Mining social networks using heat diffusion processes for marketing candidates selection. Proceeding of the 17th ACM conference on Information and knowledge management, Napa Valley, California, USA. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Sasa Petrovic, Miles Osborne & Victor Lavrenko. 2011. RT to Win! Predicting Message Propagation in Twitter. International AAAI Conference on Weblogs and Social Media, 2011. 348 Xuan-Hieu Phan & Cam-Tu Nguyen. 2007. GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA). Ian H. Witten, Eibe Frank & Mark A. Hall. 2011. Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc. Jiang Zhu, Fei Xiong, Dongzhen Piao, Yun Liu & Ying Zhang. 2011. Statistically Modeling the Effectiveness of Disaster Information in Social Media. Proceedings of the 2011 IEEE Global Humanitarian Technology Conference.
same-paper 3 0.90870321 154 acl-2012-Native Language Detection with Tree Substitution Grammars
Author: Benjamin Swanson ; Eugene Charniak
Abstract: We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore, we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set.
4 0.89378959 205 acl-2012-Tweet Recommendation with Graph Co-Ranking
Author: Rui Yan ; Mirella Lapata ; Xiaoming Li
Abstract: Mirella Lapata‡ Xiaoming Li†, \ ‡Institute for Language, \State Key Laboratory of Software Cognition and Computation, Development Environment, University of Edinburgh, Beihang University, Edinburgh EH8 9AB, UK Beijing 100083, China mlap@ inf .ed .ac .uk lxm@pku .edu .cn 2012.1 Twitter enables users to send and read textbased posts ofup to 140 characters, known as tweets. As one of the most popular micro-blogging services, Twitter attracts millions of users, producing millions of tweets daily. Shared information through this service spreads faster than would have been possible with traditional sources, however the proliferation of user-generation content poses challenges to browsing and finding valuable information. In this paper we propose a graph-theoretic model for tweet recommendation that presents users with items they may have an interest in. Our model ranks tweets and their authors simultaneously using several networks: the social network connecting the users, the network connecting the tweets, and a third network that ties the two together. Tweet and author entities are ranked following a co-ranking algorithm based on the intuition that that there is a mutually reinforcing relationship between tweets and their authors that could be reflected in the rankings. We show that this framework can be parametrized to take into account user preferences, the popularity of tweets and their authors, and diversity. Experimental evaluation on a large dataset shows that our model out- performs competitive approaches by a large margin.
5 0.87867332 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation
Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum
Abstract: To discover relation types from text, most methods cluster shallow or syntactic patterns of relation mentions, but consider only one possible sense per pattern. In practice this assumption is often violated. In this paper we overcome this issue by inducing clusters of pattern senses from feature representations of patterns. In particular, we employ a topic model to partition entity pairs associated with patterns into sense clusters using local and global features. We merge these sense clusters into semantic relations using hierarchical agglomerative clustering. We compare against several baselines: a generative latent-variable model, a clustering method that does not disambiguate between path senses, and our own approach but with only local features. Experimental results show our proposed approach discovers dramatically more accurate clusters than models without sense disambiguation, and that incorporating global features, such as the document theme, is crucial.
6 0.65484804 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment
7 0.65277386 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars
8 0.65200692 31 acl-2012-Authorship Attribution with Author-aware Topic Models
9 0.63687027 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing
10 0.62458557 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
11 0.58248311 98 acl-2012-Finding Bursty Topics from Microblogs
12 0.57485557 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition
13 0.5467394 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation
14 0.5454126 167 acl-2012-QuickView: NLP-based Tweet Search
15 0.54221547 79 acl-2012-Efficient Tree-Based Topic Modeling
16 0.53924286 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning
19 0.51983005 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars
20 0.51473027 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale