acl acl2013 acl2013-191 knowledge-graph by maker-knowledge-mining

191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Source: pdf

Author: Jun Zhu ; Xun Zheng ; Bo Zhang

Abstract: Supervised topic models with a logistic likelihood have two issues that potentially limit their practical use: 1) response variables are usually over-weighted by document word counts; and 2) existing variational inference methods make strict mean-field assumptions. We address these issues by: 1) introducing a regularization constant to better balance the two parts based on an optimization formulation of Bayesian inference; and 2) developing a simple Gibbs sampling algorithm by introducing auxiliary Polya-Gamma variables and collapsing out Dirichlet variables. Our augment-and-collapse sampling algorithm has analytical forms of each conditional distribution without making any restricting assumptions and can be easily parallelized. Empirical results demonstrate significant improvements on prediction performance and time efficiency.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Supervised topic models with a logistic likelihood have two issues that potentially limit their practical use: 1) response variables are usually over-weighted by document word counts; and 2) existing variational inference methods make strict mean-field assumptions. [sent-2, score-0.816]

2 Our augment-and-collapse sampling algorithm has analytical forms of each conditional distribution without making any restricting assumptions and can be easily parallelized. [sent-4, score-0.265]

3 , 2009), one way to improve the predictive power of LDA is to define a likelihood model for the widely available documentlevel response variables, in addition to the likelihood model for document words. [sent-7, score-0.199]

4 For example, the logistic likelihood model is commonly used for binary or multinomial responses. [sent-8, score-0.381]

5 By imposing some priors, posterior inference is done with the Bayes’ rule. [sent-9, score-0.182]

6 Though powerful, one issue that could limit the use ofexisting logistic supervised LDA models is that they treat the document-level response variable as one additional word via a normalized likelihood model. [sent-10, score-0.46]

7 response variable, it is normally of a much smaller scale than the likelihood of the usually tens or hundreds of words in each document. [sent-14, score-0.119]

8 , 2012) and observed in our experiments, this model imbalance could result in a weak influence of response variables on the topic representations and thus non-satisfactory prediction performance. [sent-16, score-0.384]

9 Another difficulty arises when dealing with categorical response variables is that the commonly used normal priors are no longer conjugate to the logistic likelihood and thus lead to hard inference problems. [sent-17, score-0.525]

10 Existing approaches rely on variational approximation techniques which normally make strict mean-field assumptions. [sent-18, score-0.178]

11 First, we present a general framework of Bayesian logistic supervised topic models with a regularization parameter to better balance response variables and words. [sent-20, score-0.695]

12 , 2013b) via solving an optimization problem, where the posterior regularization is defined as an expectation of a logistic loss, a surrogate loss of the expected misclassification error; and a regularization parameter is introduced to balance the surrogate classification loss (i. [sent-23, score-0.823]

13 Second, to solve the intractable posterior inference problem of the generalized Bayesian logistic supervised topic models, we present a simple Gibbs sampling algorithm by exploring the ideas of data augmentation (Tanner and Wong, 1987; van Dyk and Meng, 2001 ; Holmes and Held, 2006). [sent-27, score-0.906]

14 More specifically, we extend Polson’s method for Bayesian logistic regression (Polson et al. [sent-28, score-0.237]

15 , 2012) to the generalized logistic supervised topic models, which are much more challeng187 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t. [sent-29, score-0.512]

16 Technically, we introduce a set of Polya- Gamma variables, one per document, to reformulate the generalized logistic pseudo-likelihood model (with the regularization parameter) as a scale mixture, where the mixture component is conditionally normal for classifier parameters. [sent-32, score-0.422]

17 Then, we develop a simple and efficient Gibbs sampling algorithms with analytic conditional distributions without Metropolis-Hastings accept/reject steps. [sent-33, score-0.177]

18 , topics and mixing proportions) to do collapsed Gibbs sampling, which can have better mixing rates (Griffiths and Steyvers, 2004). [sent-36, score-0.146]

19 The classification performance is also significantly improved by using appropriate regularization parameters. [sent-38, score-0.128]

20 2 introduces logistic supervised topic models as a general optimization problem. [sent-43, score-0.449]

21 2 Logistic Supervised Topic Models We now present the generalized Bayesian logistic supervised topic models. [sent-50, score-0.512]

22 1 The Generalized Models We consider binary classification with a training set D = {(wd, yd)}dD=1, where the response varisabetle D DY = ta {k(ews valu)e}s from the output space Y = {0, 1}. [sent-52, score-0.161]

23 A ta logistic supervised topic muto sdpeal cceon Ysis =ts {of0 two parts an uLpDeArv imseodde tol (Blei et al. [sent-53, score-0.449]

24 , 2003) for describing the words W = {wd}dD=1, where wd = denote the words within document= d, {awnd a logistic classifier for considering the supervising signal y = {yd}dD=1 . [sent-54, score-0.343]

25 LDA: LDA is a hierarchical Bayesian model that posits each document as an admixture of K topics, where each topic Φk is a multinomial distribution over a V -word vocabulary. [sent-56, score-0.256]

26 , Nd: (a) draw a topic1 zdn ∼ Mult(θd) (b) draw the word wdn ∼ Mult(Φzdn) where Dir(·) is a Dirichlet distribution; Mult(·) is a mheurlet Diniorm(·i)al i distribution; iasntdri Φzdn d; eMnoutlets( t)h ies topic selected by the non-zero entry of zdn. [sent-62, score-0.446]

27 Let zd = denote t∼he D Dseirt( βof) topic assignments =for { zdoc}ument d. [sent-64, score-0.263]

28 tLheet topic assignments Θand mixing proportions for the entire corpus. [sent-66, score-0.225]

30 , 2012), the posterior distribution by Bayes’ rule is equivalent to the solution of an information theoretical optimization problem {zdn}nN=d1 ( ∏np(zdn|θd)) q(Θm,Zin,Φ)KL(q(Θ,Z,Φ)∥p0(Θ,Z,Φ))−Eq[logp(W|Z,Φ)] s. [sent-69, score-0.181]

31 Logistic classifier: To consider binary supervising information, a logistic supervised topic model (e. [sent-72, score-0.549]

32 , sLDA) builds a logistic classifier using the topic representations as input features p(y = 1|η,z) =1 +ex epx(pη(⊤η z¯⊤) z¯), (2) ∑nN=1 where z is a K-vector with z¯k = N1 I(znk = 1), and I(·) is an indicator function ∑that equals to 11 i,f a predicate ahnold insd oicthaetorrw fiusne c0ti. [sent-74, score-0.417]

33 Inf∑ t ∑thhaet ecqlausaslisfi etor weights η and topic assignments z are given, the prediction rule is yˆ|η,z = I(p(y = 1|η, z) > 0. [sent-75, score-0.217]

34 In fact, this choice is motivated from the observation that logistic loss has been widely used as a convex surrogate loss for the misclassification 1A K-binary vector with only one entry equaling to 1. [sent-79, score-0.437]

35 Also, note that the logistic classifier and the LDA likelihood are coupled by sharing the latent topic assignments z. [sent-82, score-0.546]

36 The strong coupling makes it possible to learn a posterior distribution that can describe the observed words well and make accurate predictions. [sent-83, score-0.181]

37 Regularized Bayesian Inference: To integrate the above two components for hybrid learning, a logistic supervised topic model solves the joint Bayesian inference problem L(q(η,Θ,Z,Φ)) + cR(q(η,Z)) q(ηm,Θi,nZ,Φ) s. [sent-84, score-0.497]

38 Then, the generalized inference problem (5) of logistic supervised topic models can be written in the “standard” Bayesian inference form (1) − Eq[logψ(y|Z,η)] q(ηm,Θi,nZ,Φ) L(q(η,Θ,Z,Φ)) s. [sent-90, score-0.608]

39 It is easy twoh eshreow ψ tyha|Zt tηhe) opti∏mum so|luztion of problem (5) or the equivalent problem (7) is the posterior distribution with supervising information, i. [sent-93, score-0.226]

40 We can see that when c = 1, the model reduces to the standard sLDA, which in practice has the imbalance issue that the response variable (can be viewed as one additional word) is usually dominated by the words. [sent-97, score-0.151]

41 Comparison with MedLDA: The above formulation of logistic supervised topic models as an instance of regularized Bayesian inference provides a direct comparison with the max-margin supervised topic model (MedLDA) (Jiang et al. [sent-101, score-0.734]

42 The difference lies in the posterior regularization, for which MedLDA uses a hinge loss of an expected classifier while the logistic supervised topic model uses an expected log-logistic loss. [sent-103, score-0.709]

43 , 2013a) is another max-margin model that adopts the expected hinge loss as posterior regularization. [sent-105, score-0.226]

44 As we shall see in the experiments, by using appropriate regularization constants, logistic supervised topic models achieve comparable performance as maxmargin methods. [sent-106, score-0.579]

45 We note that the relationship between a logistic loss and a hinge loss has been discussed extensively in various settings (Rosasco et al. [sent-107, score-0.393]

46 But the presence of latent variables poses additional challenges in carrying out a formal theoretical analysis of these surrogate losses (Lin, 2001) in the topic model setting. [sent-110, score-0.312]

47 2 Variational Approximation Algorithms The commonly used normal prior for η is nonconjugate to the logistic likelihood, which makes the posterior inference hard. [sent-112, score-0.419]

48 Moreover, the latent variables Z make the inference problem harder than that of Bayesian logistic regression models (Chen et al. [sent-113, score-0.415]

49 Previous algorithms to solve problem (5) rely on variational approximation techniques. [sent-116, score-0.147]

50 It is easy to show that the variational method (Wang et al. [sent-117, score-0.116]

51 , 2009) results in an EM algorithm, which still needs to make strict mean-field assumptions together with a variational bound of the expectation ofthe log-logistic likelihood. [sent-120, score-0.182]

52 In this paper, we consider the full Bayesian treatment, which can principally consider prior distributions and infer the posterior covariance. [sent-121, score-0.161]

53 3 A Gibbs Sampling Algorithm Now, we present a simple and efficient Gibbs sampling algorithm for the generalized Bayesian logistic supervised topic models. [sent-122, score-0.662]

54 1 Formulation with Data Augmentation Since the logistic pseudo-likelihood ψ(y| Z, η) is nSiont conjugate twicith p enuodrmo-alil priors, dit ψ i(sy n|Zot, easy to derive the sampling algorithms directly. [sent-124, score-0.387]

55 This result indicates that the posterior distribution of the generalized Bayesian logistic supervised topic models, i. [sent-134, score-0.693]

56 , q(η, Θ, Z, Φ), can be expressed as the marginal of a higher dimensional distribution that includes the augmented variables λ. [sent-136, score-0.139]

57 The complete posterior distribution is q(η,λ,Θ,Z,Φ) = p0(η,Θ,Z,Φ)ψp((Wy,|WZ,)Φ)ϕ(y,λ|Z,η), where the pseudo-joint distribution of y and λ is ϕ(y,λ|Z,η) =∏dexp(κdωd−λd2ωd2)p(λd|c,0). [sent-137, score-0.228]

58 2 Inference with Collapsed Gibbs Sampling Although we can do Gibbs sampling to infer the complete posterior distribution q(η, λ, Θ, Z, Φ) and thus q(η, Θ, Z, Φ) by ignoring λ, the mixing rate would be slow due to the large sample space. [sent-139, score-0.372]

59 One way to effectively improve mixing rates is to integrate out the intermediate variables (Θ, Φ) and build a Markov chain whose equi- librium distribution is the marginal distribution q(η, λ, Z). [sent-140, score-0.227]

60 Then, the conditional distributions =use {dC in} collapsed Gibbs sampling are as follows. [sent-143, score-0.241]

61 For η: for the co∏mmonly used isotropic Gaussian prior p0(η) = ∏k N(ηk; 0, ν2), we have q(η|Z,λ) ∝ p0(∏η)∏dexp(κdωd−λd2ωd2) = N(η; µ, Σ) , (8) µ where the posterior mean is ∑= Σ(∑d κd¯ zd) and the covariance is Σ = (ν12I + ∑d λd∑ z¯d z¯d⊤)−1. [sent-144, score-0.134]

62 o λnid sit onal distribution of q(λd|Z,η) ∝ exp(−λd2ωd2)p(λd|c,0) = PG ((λd; c, ωd) , (10) 190 Algorithm 1 for collapsed Gibbs sampling 1:Initialization: set λ = 1 and randomly draw from a uniform distribution. [sent-151, score-0.329]

63 for m = 1to M do draw a classifier from the distribution (8) for d = 1to D do for each word n in document d do draw the topic using distribution (9) end for draw λd from distribution (10). [sent-152, score-0.552]

64 end for end for zdn 2: 3: 4: 5: 6: 7: 8: 9: 10: which is a Polya-Gamma distribution. [sent-153, score-0.119]

65 3 Prediction To apply the classifier ˆη on testing data, we need to infer their topic assignments. [sent-169, score-0.18]

66 We implemented the sampling algorithm in C++ together with our topic model sampler. [sent-176, score-0.296]

67 C¬kn is the times that the terms in this document w assigned to topic k with the n-th term excluded. [sent-177, score-0.173]

68 4 Experiments We present empirical results and sensitivity analysis to demonstrate the efficiency and prediction × performance3 of the generalized logistic supervised topic models on the 20Newsgroups (20NG) data set, which contains about 20,000 postings within 20 news groups. [sent-178, score-0.571]

69 We compare the generalized logistic supervised LDA using Gibbs sampling (denoted by gSLDA) with various competitors, including the standard sLDA using variational mean-field methods (denoted by vSLDA) (Wang et al. [sent-195, score-0.632]

70 , 2009), the MedLDA model using variational mean-field methods (denoted by vMedLDA) (Zhu et al. [sent-196, score-0.116]

71 , 2012), and the MedLDA model using collapsed Gibbs sampling algorithms (denoted by gMedLDA) (Jiang et al. [sent-197, score-0.214]

72 We also include the unsupervised LDA using collapsed Gibbs sampling as a baseline, denoted by gLDA. [sent-199, score-0.214]

73 For gLDA, we learn a binary linear SVM on its topic representations using SVMLight (Joachims, 1999). [sent-200, score-0.201]

74 As we shall see, gSLDA is insensitive to α, – 3Due to space limit, the topic visualization (similar to that of MedLDA) is deferred to a longer version. [sent-207, score-0.188]

75 5 4 2gvSLM 1De0AdD+LASDVM15#Topics2053 (c) testing time Figure 1: Accuracy, training time (in log-scale) and testing time on the 20NG binary data set. [sent-210, score-0.136]

76 In contrast, the well-balanced gSLDA+ model successfully outperforms the twostage approach, gLDA+SVM, by performing topic discovery and prediction jointly4. [sent-216, score-0.179]

77 For testing time, gSLDA and gSLDA+ are comparable with gMedLDA and the unsupervised gLDA, but faster than the variational vMedLDA and vSLDA, especially when K is large. [sent-220, score-0.116]

78 For multiclass classification, one possible extension is to use a multinomial logistic regression model for categorical variables Y by using topic representations z as input features. [sent-223, score-0.547]

79 However, it is non4The variational sLDA with a well-tuned c is significantly better than the standard sLDA, but a bit inferior to gSLDA+. [sent-224, score-0.116]

80 trivial to develop a Gibbs sampling algorithm using the similar data augmentation idea, due to the presence of latent variables and the nonlinearity of the soft-max function. [sent-225, score-0.342]

81 In fact, this is harder than the multinomial Bayesian logistic regression, which can be done via a coordinate strategy (Polson et al. [sent-226, score-0.273]

82 But given the performance gain in the binary task, we believe that the Gibbs sampling algorith192 arcyuA0 . [sent-239, score-0.205]

83 For training time, gSLDA models are about 10 times faster than variational vSLDA. [sent-267, score-0.116]

84 Table 1 shows in detail the percentages of the training time (see the numbers in brackets) spent at each sampling step for gSLDA+. [sent-268, score-0.177]

85 We can see that: 1) sampling the global variables η is very efficient, while sampling local variables (λ, Z) are much more expensive; and 2) sampling λ is relatively stable as K increases, while sampling Z takes more time as K becomes larger. [sent-269, score-0.851]

86 But, the good news is that our Gibbs sampling algorithm can be easily parallelized to speedup the sampling of local variables, following the similar architectures as in LDA. [sent-270, score-0.326]

87 Since GAS has been successfully applied to several machine learning algorithms5 including Gibbs sampling of LDA, we choose it as a preliminary attempt to parallelize our Gibbs sampling algorithm. [sent-275, score-0.3]

88 Each node is responsible for learning one binary gSLDA classifier with a parallel implementation on its 12-cores. [sent-286, score-0.116]

89 Once started, sampling and signaling will propagate over the graph. [sent-289, score-0.15]

90 When M = 0 (see the most left points), the models are built on random topic assignments. [sent-299, score-0.146]

91 Even when we use 40 or 60 burn-in steps, the training time is still competitive, compared with the variational vSLDA. [sent-319, score-0.143]

92 Empirical results for both binary and multi-class classification demonstrate significant improvements over the existing logistic supervised topic models. [sent-352, score-0.544]

93 Our preliminary results with GraphLab have shown promise on parallelizing the Gibbs sampling algorithm. [sent-353, score-0.15]

94 , 2009; Smola and Narayanamurthy, 2010), to make the sampling algorithm highly scalable to deal with massive data corpora. [sent-358, score-0.15]

95 Moreover, the data augmentation technique can be applied to deal with other types of response variables, such as count data with a negative-binomial likelihood (Polson et al. [sent-359, score-0.181]

96 Prior elicitation, variable selection and Bayesian computation for logistic regression models. [sent-397, score-0.275]

97 Bayesian auxiliary variable models for binary and multinomial regression. [sent-446, score-0.129]

98 Monte Carlo methods for maximum margin supervised topic models. [sent-455, score-0.212]

99 Bayesian inference for logistic models using Polya-Gamma latent variables. [sent-503, score-0.323]

100 Bayesian inference with posterior regularization and applications to infinite latent svms. [sent-579, score-0.308]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gslda', 0.698), ('logistic', 0.237), ('medlda', 0.163), ('gibbs', 0.159), ('polson', 0.157), ('slda', 0.157), ('sampling', 0.15), ('topic', 0.146), ('posterior', 0.134), ('zdn', 0.119), ('yd', 0.118), ('bayesian', 0.117), ('variational', 0.116), ('variables', 0.092), ('glda', 0.089), ('regularization', 0.088), ('zd', 0.079), ('graphlab', 0.074), ('vslda', 0.074), ('lda', 0.07), ('zhu', 0.069), ('draw', 0.068), ('supervised', 0.066), ('response', 0.066), ('collapsed', 0.064), ('loss', 0.064), ('generalized', 0.063), ('augmentation', 0.062), ('binary', 0.055), ('likelihood', 0.053), ('gonzalez', 0.048), ('inference', 0.048), ('distribution', 0.047), ('imbalance', 0.047), ('halpern', 0.045), ('parallelgslda', 0.045), ('supervising', 0.045), ('wdn', 0.045), ('shall', 0.042), ('tsinghua', 0.042), ('mixing', 0.041), ('classification', 0.04), ('pg', 0.04), ('stable', 0.04), ('dd', 0.04), ('rosasco', 0.039), ('assignments', 0.038), ('latent', 0.038), ('variable', 0.038), ('jmlr', 0.038), ('supervision', 0.037), ('svm', 0.037), ('surrogate', 0.036), ('multiclass', 0.036), ('tanner', 0.036), ('misclassification', 0.036), ('gas', 0.036), ('dirichlet', 0.036), ('multinomial', 0.036), ('eq', 0.036), ('assumptions', 0.035), ('classifier', 0.034), ('exp', 0.033), ('restricting', 0.033), ('prediction', 0.033), ('magnitudes', 0.033), ('strict', 0.031), ('approximation', 0.031), ('constant', 0.03), ('mult', 0.03), ('burn', 0.03), ('cdk', 0.03), ('dexp', 0.03), ('disclda', 0.03), ('dkn', 0.03), ('dyk', 0.03), ('germain', 0.03), ('gmedlda', 0.03), ('niart', 0.03), ('rifkin', 0.03), ('snoc', 0.03), ('vmedlda', 0.03), ('priors', 0.029), ('dc', 0.028), ('hinge', 0.028), ('samples', 0.027), ('blei', 0.027), ('dir', 0.027), ('wd', 0.027), ('document', 0.027), ('distributions', 0.027), ('time', 0.027), ('doesn', 0.027), ('parallel', 0.027), ('distrib', 0.026), ('narayanamurthy', 0.026), ('postings', 0.026), ('jiang', 0.026), ('architectures', 0.026), ('formulation', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Author: Jun Zhu ; Xun Zheng ; Bo Zhang

2 0.12932143 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

3 0.11634614 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

Author: Zede Zhu ; Miao Li ; Lei Chen ; Zhenxin Yang

Abstract: Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent top- ics own better adaptability and stability performance.

4 0.10901002 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).

5 0.1015051 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

6 0.098959662 224 acl-2013-Learning to Extract International Relations from Political Context

7 0.093712673 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

8 0.087644309 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

9 0.079870522 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

10 0.0788913 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

11 0.078552179 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

12 0.078084089 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

13 0.07607305 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

14 0.072800331 121 acl-2013-Discovering User Interactions in Ideological Discussions

15 0.070151664 382 acl-2013-Variational Inference for Structured NLP Models

16 0.067947142 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

17 0.067558408 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

18 0.05649019 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

19 0.054639108 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

20 0.054543734 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.139), (1, 0.032), (2, 0.019), (3, -0.001), (4, 0.045), (5, -0.046), (6, 0.088), (7, -0.014), (8, -0.153), (9, -0.05), (10, 0.074), (11, -0.0), (12, 0.054), (13, -0.011), (14, -0.034), (15, -0.108), (16, -0.072), (17, 0.095), (18, 0.004), (19, -0.082), (20, 0.033), (21, 0.012), (22, 0.003), (23, 0.033), (24, -0.017), (25, 0.005), (26, -0.051), (27, -0.076), (28, 0.034), (29, -0.016), (30, 0.115), (31, -0.003), (32, 0.052), (33, 0.044), (34, 0.0), (35, 0.061), (36, -0.005), (37, -0.018), (38, 0.009), (39, 0.047), (40, -0.027), (41, -0.041), (42, -0.05), (43, 0.009), (44, -0.029), (45, -0.037), (46, -0.056), (47, 0.088), (48, 0.019), (49, 0.107)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95318776 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Author: Jun Zhu ; Xun Zheng ; Bo Zhang

2 0.71992517 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

3 0.66821092 257 acl-2013-Natural Language Models for Predicting Programming Comments

Author: Dana Movshovitz-Attias ; William W. Cohen

Abstract: Statistical language models have successfully been used to describe and analyze natural language documents. Recent work applying language models to programming languages is focused on the task of predicting code, while mainly ignoring the prediction of programmer comments. In this work, we predict comments from JAVA source files of open source projects, using topic models and n-grams, and we analyze the performance of the models given varying amounts of background data on the project being predicted. We evaluate models on their comment-completion capability in a setting similar to codecompletion tools built into standard code editors, and show that using a comment completion tool can save up to 47% of the comment typing. 1 Introduction and Related Work Statistical language models have traditionally been used to describe and analyze natural language documents. Recently, software engineering researchers have adopted the use of language models for modeling software code. Hindle et al. (2012) observe that, as code is created by humans it is likely to be repetitive and predictable, similar to natural language. NLP models have thus been used for a variety of software development tasks such as code token completion (Han et al., 2009; Jacob and Tairas, 2010), analysis of names in code (Lawrie et al., 2006; Binkley et al., 2011) and mining software repositories (Gabel and Su, 2008). An important part of software programming and maintenance lies in documentation, which may come in the form of tutorials describing the code, or inline comments provided by the programmer. The documentation provides a high level description of the task performed by the code, and may William W. Cohen Computer Science Department Carnegie Mellon University wcohen @ c s .cmu .edu include examples of use-cases for specific code segments or identifiers such as classes, methods and variables. Well documented code is easier to read and maintain in the long-run but writing comments is a laborious task that is often overlooked or at least postponed by many programmers. Code commenting not only provides a summarization of the conceptual idea behind the code (Sridhara et al., 2010), but can also be viewed as a form of document expansion where the comment contains significant terms relevant to the described code. Accurately predicted comment words can therefore be used for a variety of linguistic uses including improved search over code bases using natural language queries, code categorization, and locating parts of the code that are relevant to a specific topic or idea (Tseng and Juang, 2003; Wan et al., 2007; Kumar and Carterette, 2013; Shepherd et al., 2007; Rastkar et al., 2011). A related and well studied NLP task is that of predicting natural language caption and commentary for images and videos (Blei and Jordan, 2003; Feng and Lapata, 2010; Feng and Lapata, 2013; Wu and Li, 2011). In this work, our goal is to apply statistical language models for predicting class comments. We show that n-gram models are extremely successful in this task, and can lead to a saving of up to 47% in comment typing. This is expected as n-grams have been shown as a strong model for language and speech prediction that is hard to improve upon (Rosenfeld, 2000). In some cases however, for example in a document expansion task, we wish to extract important terms relevant to the code regardless of local syntactic dependencies. We hence also evaluate the use of LDA (Blei et al., 2003) and link-LDA (Erosheva et al., 2004) topic models, which are more relevant for the term ex- traction scenario. We find that the topic model performance can be improved by distinguishing code and text tokens in the code. 35 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 35–40, 2 Method 2.1 Models We train n-gram models (n = 1, 2, 3) over source code documents containing sequences of combined code and text tokens from multiple training datasets (described below). We use the Berkeley Language Model package (Pauls and Klein, 2011) with absolute discounting (Kneser-Ney smoothing; (1995)) which includes a backoff strategy to lower-order n-grams. Next, we use LDA topic models (Blei et al., 2003) trained on the same data, with 1, 5, 10 and 20 topics. The joint distribution of a topic mixture θ, and a set of N topics z, for a single source code document with N observed word tokens, d = {wi}iN=1, given the Dirichlet parameters α sa,n dd β, {isw th}erefore p(θ, z, w|α, β) = p(θ|α) Yp(z|θ)p(w|z, (1) β) Yw Under the models described so far, there is no distinction between text and code tokens. Finally, we consider documents as having a mixed membership of two entity types, code and text tokens, d = where tthexet text ws,o drd =s are tok}ens f,r{owm comment and string literals, and the code words include the programming language syntax tokens (e.g., publ ic, private, for, etc’ ) and all identifiers. In this case, we train link-LDA models (Erosheva et al., 2004) with 1, 5, 10 and 20 topics. Under the linkLDA model, the mixed-membership joint distribution of a topic mixture, words and topics is then ({wciode}iC=n1, {witext}iT=n1), p(θ, z, w|α, β) = p(θ|α) Y wYtext · p(ztext|θ)p(wtext|ztext,β)· (2) Y p(zcode|θ)p(wcode|zcode,β) wYcode where θ is the joint topic distribution, w is the set of observed document words, ztext is a topic associated with a text word, and zcode a topic associated with a code word. The LDA and link-LDA models use Gibbs sampling (Griffiths and Steyvers, 2004) for topic inference, based on the implementation of Balasubramanyan and Cohen (201 1) with single or multiple entities per document, respectively. 2.2 Testing Methodology Our goal is to predict the tokens of the JAVA class comment (the one preceding the class definition) in each of the test files. Each of the models described above assigns a probability to the next comment token. In the case of n-grams, the probability of a token word wi is given by considering previous words p(wi |wi−1 , . . . , w0). This probability is estimated given the previous n 1tokens as p(wi|wi−1, wi−(n−1)). For t|hwe topic models, we separate the docu- ..., − ment tokens into the class definition and the comment we wish to predict. The set of tokens of the class comment are all considered as text tokens. The rest of the tokens in the document are considered to be the class definition, and they may contain both code and text tokens (from string literals and other comments in the source file). We then compute the posterior probability of document topics by solving the following inference problem conditioned on the tokens wc, wr, wr p(θ,zr|wr,α,β) =p(θp,(zwr,rw|αr,|αβ),β) (3) This gives us an estimate of the document distribution, θ, with which we infer the probability of the comment tokens as p(wc|θ,β) = Xp(wc|z,β)p(z|θ) (4) Xz Following Blei et al. (2003), for the case of a single entity LDA, the inference problem from equation (3) can be solved by considering p(θ, z, w|α, β), as in equation (1), and by taking tph(eθ marginal )di,s atrsib iunti eoqnu aotfio othne ( 1d)o,c aunmde bnyt t toakkeinngs as a continuous mixture distribution for the set w = by integrating over θ and summing over the set of topics z wr, p(w|α,β) =Zp(θ|α)· (5) YwXzp(z|θ)p(w|z,β)!dθ For the case of link-LDA where the document is comprised of two entities, in our case code tokens and text tokens, we can consider the mixedmembership joint distribution θ, as in equation (2), and similarly the marginal distribution p(w|α, β) over bimoithla rclyod teh ean mda tregxint tlok deisntsri bfruotmion w pr(.w |Sαi,nβce) comment words in are all considered as text tokens they are sampled using text topics, namely ztext, in equation (4). wc 36 3 Experimental Settings 3.1 Data and Training Methodology We use source code from nine open source JAVA projects: Ant, Cassandra, Log4j, Maven, MinorThird, Batik, Lucene, Xalan and Xerces. For each project, we divide the source files into a training and testing dataset. Then, for each project in turn, we consider the following three main training scenarios, leading to using three training datasets. To emulate a scenario in which we are predicting comments in the middle of project development, we can use data (documented code) from the same project. In this case, we use the in-project training dataset (IN). Alternatively, if we train a comment prediction model at the beginning of the development, we need to use source files from other, possibly related projects. To analyze this scenario, for each of the projects above we train models using an out-of-project dataset (OUT) containing data from the other eight projects. Typically, source code files contain a greater amount ofcode versus comment text. Since we are interested in predicting comments, we consider a third training data source which contains more English text as well as some code segments. We use data from the popular Q&A; website StackOverflow (SO) where users ask and answer technical questions about software development, tools, algorithms, etc’ . We downloaded a dataset of all actions performed on the site since it was launched in August 2008 until August 2012. The data includes 3,453,742 questions and 6,858,133 answers posted by 1,295,620 users. We used only posts that are tagged as JAVA related questions and answers. All the models for each project are then tested on the testing set of that project. We report results averaged over all projects in Table 1. Source files were tokenized using the Eclipse JDT compiler tools, separating code tokens and identifiers. Identifier names (of classes, methods and variables), were further tokenized by camel case notation (e.g., ’minMargin’ was converted to ’min margin’). Non alpha-numeric tokens (e.g., dot, semicolon) were discarded from the code, as well as numeric and single character literals. Text from comments or any string literals within the code were further tokenized with the Mallet statistical natural language processing package (Mc- Callum, 2002). Posts from SO were parsed using the Apache Tika toolkit1 and then tokenized with the Mallet package. We considered as raw code tokens anything labeled using a markup (as indicated by the SO users who wrote the post). 3.2 Evaluation Since our models are trained using various data sources the vocabularies used by each of them are different, making the comment likelihood given by each model incomparable due to different sets of out-of-vocabulary tokens. We thus evaluate models using a character saving metric which aims at quantifying the percentage of characters that can be saved by using the model in a word-completion settings, similar to standard code completion tools built into code editors. For a comment word with n characters, w = w1, . . . , wn, we predict the two most likely words given each model filtered by the first 0, . . . , n characters ofw. Let k be the minimal ki for which w is in the top two predicted word tokens where tokens are filtered by the first ki characters. Then, the number of saved characters for w is n k. In Table 1we report the average percentage o−f ksa.v Iend T Tcahbalera 1cte wrse per ocrotm thmee avnet using eearcchen not-f the above models. The final results are also averaged over the nine input projects. As an example, in the predicted comment shown in Table 2, taken from the project Minor-Third, the token entity is the most likely token according to the model SO trigram, out of tokens starting with the prefix ’en’ . The saved characters in this case are ’tity’ . − 4 Results Table 1 displays the average percentage of characters saved per class comment using each of the models. Models trained on in-project data (IN) perform significantly better than those trained on another data source, regardless of the model type, with an average saving of 47. 1% characters using a trigram model. This is expected, as files from the same project are likely to contain similar comments, and identifier names that appear in the comment of one class may appear in the code of another class in the same project. Clearly, in-project data should be used when available as it improves comment prediction leading to an average increase of between 6% for the worst model (26.6 for OUT unigram versus 33.05 for IN) and 14% for the best (32.96 for OUT trigram versus 47. 1for IN). 1http://tika.apache.org/ 37 Model n / topics n-gram LDA Link-LDA 1 2 3 20 10 5 1 20 10 5 1 IN 33.05 (3.62) 43.27 (5.79) 47.1 (6.87) 34.20 (3.63) 33.93 (3.67) 33.63 (3.67) 33.05 (3.62) 35.76 (3.95) 35.81 (4.12) 35.37 (3.98) 34.59 (3.92) OUT 26.6 (3.37) 31.52 (4.17) 32.96 (4.33) 26.79 (3.26) 26.8 (3.36) 26.86 (3.44) 26.6 (3.37) 28.03 (3.60) 28 (3.56) 28 (3.67) 27.82 (3.62) SO 27.8 (3.51) 33.29 (4.40) 34.56 (4.78) 27.25 (3.67) 27.22 (3.44) 27.34 (3.55) 27.8 (3.51) 28.08 (3.48) 28.12 (3.58) 27.94 (3.56) 27.9 (3.45) Table 1: Average percentage of characters saved per comment using n-gram, LDA and link-LDA models trained on three training sets: IN, OUT, and SO. The results are averaged over nine JAVA projects (with standard deviations in parenthesis). Model Predicted Comment trigram IN link-LDA OUT trigram SO trigram “Train “Train “Train “Train IN named-entity a named-entity a named-entity a named-entity a extractor“ extractor“ extractor“ extractor“ Table 2: Sample comment from the Minor-Third project predicted using IN, OUT and SO based models. Saved characters are underlined. Of the out-of-project data sources, models using a greater amount of text (SO) mostly outperformed models based on more code (OUT). This increase in performance, however, comes at a cost of greater run-time due to the larger word dictionary associated with the SO data. Note that in the scope of this work we did not investigate the contribution of each of the background projects used in OUT, and how their relevance to the target prediction project effects their performance. The trigram model shows the best performance across all training data sources (47% for IN, 32% for OUT and 34% for SO). Amongst the tested topic models, link-LDA models which distinguish code and text tokens perform consistently better than simple LDA models in which all tokens are considered as text. We did not however find a correlation between the number of latent topics learned by a topic model and its performance. In fact, for each of the data sources, a different num- ber of topics gave the optimal character saving results. Note that in this work, all topic models are based on unigram tokens, therefore their results are most comparable with that of the unigram in Dataset n-gram link-LDA IN 2778.35 574.34 OUT 1865.67 670.34 SO 1898.43 638.55 Table 3: Average words per project for which each tested model completes the word better than the other. This indicates that each of the models is better at predicting a different set of comment words. Table 1, which does not benefit from the backoff strategy used by the bigram and trigram models. By this comparison, the link-LDA topic model proves more successful in the comment prediction task than the simpler models which do not distin- guish code and text tokens. Using n-grams without backoff leads to results significantly worse than any of the presented models (not shown). Table 2 shows a sample comment segment for which words were predicted using trigram models from all training sources and an in-project linkLDA. The comment is taken from the TrainExtractor class in the Minor-Third project, a machine learning library for annotating and categorizing text. Both IN models show a clear advantage in completing the project-specific word Train, compared to models based on out-of-project data (OUT and SO). Interestingly, in this example the trigram is better at completing the term namedentity given the prefix named. However, the topic model is better at completing the word extractor which refers to the target class. This example indicates that each model type may be more successful in predicting different comment words, and that combining multiple models may be advantageous. 38 This can also be seen by the analysis in Table 3 where we compare the average number of words completed better by either the best n-gram or topic model given each training dataset. Again, while n-grams generally complete more words better, a considerable portion of the words is better completed using a topic model, further motivating a hybrid solution. 5 Conclusions We analyze the use of language models for predicting class comments for source file documents containing a mixture of code and text tokens. Our experiments demonstrate the effectiveness of using language models for comment completion, showing a saving of up to 47% of the comment characters. When available, using in-project training data proves significantly more successful than using out-of-project data. However, we find that when using out-of-project data, a dataset based on more words than code performs consistently better. The results also show that different models are better at predicting different comment words, which motivates a hybrid solution combining the advantages of multiple models. Acknowledgments This research was supported by the NSF under grant CCF-1247088. References Ramnath Balasubramanyan and William W Cohen. 2011. Block-lda: Jointly modeling entity-annotated text and entity-entity links. In Proceedings ofthe 7th SIAM International Conference on Data Mining. Dave Binkley, Matthew Hearn, and Dawn Lawrie. 2011. Improving identifier informativeness using part of speech information. In Proc. of the Working Conference on Mining Software Repositories. ACM. David M Blei and Michael I Jordan. 2003. Modeling annotated data. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM. David M Blei, Andrew Y Ng, and Michael IJordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. Elena Erosheva, Stephen Fienberg, and John Lafferty. 2004. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences of the United States of America. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? automatic caption generation for news images. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE transactions on pattern analysis and machine intelligence. Mark Gabel and Zhendong Su. 2008. Javert: fully automatic mining of general temporal properties from dynamic traces. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 339–349. ACM. Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. of the National Academy of Sciences of the United States of America. Sangmok Han, David R Wallace, and Robert C Miller. 2009. Code completion from abbreviated input. In Automated Software Engineering, 2009. ASE’09. 24th IEEE/ACM International Conference on, pages 332–343. IEEE. Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE. Ferosh Jacob and Robert Tairas. 2010. Code template inference using language models. In Proceedings of the 48th Annual Southeast Regional Conference. ACM. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., volume 1, pages 181–184. IEEE. Naveen Kumar and Benjamin Carterette. 2013. Time based feedback and query expansion for twitter search. In Advances in Information Retrieval, pages 734–737. Springer. Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. Whats in a name? a study of identifiers. In Program Comprehension, 2006. ICPC 2006. 14th IEEE International Conference on, pages 3–12. IEEE. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Adam Pauls and Dan Klein. 2011. Faster and smaller language models. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 258–267. n-gram Sarah Rastkar, Gail C Murphy, and Alexander WJ Bradley. 2011. Generating natural language summaries for crosscutting source code concerns. In Software Maintenance (ICSM), 2011 27th IEEE International Conference on, pages 103–1 12. IEEE. 39 Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8): 1270–1278. David Shepherd, Zachary P Fry, Emily Hill, Lori Pollock, and K Vijay-Shanker. 2007. Using natural language program analysis to locate and understand action-oriented concerns. In Proceedings of the 6th international conference on Aspect-oriented software development, pages 212–224. ACM. Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pages 43–52. ACM. Yuen-Hsien Tseng and Da-Wei Juang. 2003. Document-self expansion for text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 399–400. ACM. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. 2007. Single document summarization with document expansion. In Proc. of the National Conference on Artificial Intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Roung-Shiunn Wu and Po-Chun Li. 2011. Video annotation using hierarchical dirichlet process mixture model. Expert Systems with Applications, 38(4):3040–3048. 40

4 0.64314944 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

Author: Xiaoming Lu ; Lei Xie ; Cheung-Chi Leung ; Bin Ma ; Haizhou Li

Abstract: We present an efficient approach for broadcast news story segmentation using a manifold learning algorithm on latent topic distributions. The latent topic distribution estimated by Latent Dirichlet Allocation (LDA) is used to represent each text block. We employ Laplacian Eigenmaps (LE) to project the latent topic distributions into low-dimensional semantic representations while preserving the intrinsic local geometric structure. We evaluate two approaches employing LDA and probabilistic latent semantic analysis (PLSA) distributions respectively. The effects of different amounts of training data and different numbers of latent topics on the two approaches are studied. Experimental re- sults show that our proposed LDA-based approach can outperform the corresponding PLSA-based approach. The proposed approach provides the best performance with the highest F1-measure of 0.7860.

5 0.59898663 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

Author: Shay B. Cohen ; Mark Johnson

Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.

6 0.59713328 350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection

7 0.57310212 54 acl-2013-Are School-of-thought Words Characterizable?

8 0.56983459 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

9 0.54895884 220 acl-2013-Learning Latent Personas of Film Characters

10 0.54888237 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

11 0.54259729 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

12 0.53150958 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

13 0.53092754 126 acl-2013-Diverse Keyword Extraction from Conversations

14 0.52190167 142 acl-2013-Evolutionary Hierarchical Dirichlet Process for Timeline Summarization

15 0.51824373 14 acl-2013-A Novel Classifier Based on Quantum Computation

16 0.505202 237 acl-2013-Margin-based Decomposed Amortized Inference

17 0.50476384 224 acl-2013-Learning to Extract International Relations from Political Context

18 0.50454414 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

19 0.50252813 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

20 0.49847353 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.094), (6, 0.029), (11, 0.053), (16, 0.256), (24, 0.056), (26, 0.047), (28, 0.029), (35, 0.078), (42, 0.027), (48, 0.07), (64, 0.012), (70, 0.071), (88, 0.034), (90, 0.019), (95, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78209698 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Author: Jun Zhu ; Xun Zheng ; Bo Zhang

2 0.76616037 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

Author: Miguel Almeida ; Andre Martins

Abstract: We present a dual decomposition framework for multi-document summarization, using a model that jointly extracts and compresses sentences. Compared with previous work based on integer linear programming, our approach does not require external solvers, is significantly faster, and is modular in the three qualities a summary should have: conciseness, informativeness, and grammaticality. In addition, we propose a multi-task learning framework to take advantage of existing data for extractive summarization and sentence compression. Experiments in the TAC2008 dataset yield the highest published ROUGE scores to date, with runtimes that rival those of extractive summarizers.

3 0.71410871 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

Author: Wenbin Jiang ; Meng Sun ; Yajuan Lu ; Yating Yang ; Qun Liu

Abstract: Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. In this paper we propose a discriminative learning algorithm to take advantage of the linguistic knowledge in large amounts of natural annotations on the Internet. It utilizes the Internet as an external corpus with massive (although slight and sparse) natural annotations, and enables a classifier to evolve on the large-scaled and real-time updated web text. With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves sig- nificant improvement on a series of testing sets from different domains, even with a single classifier and local features.

4 0.67438787 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

Author: Longkai Zhang ; Li Li ; Zhengyan He ; Houfeng Wang ; Ni Sun

Abstract: Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctuation information of unlabeled micro-blog data by introducing characters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training framework to incorporate confident instances is also used, which prove to be helpful. Ex- periments on micro-blog data show that our approach improves performance, especially in OOV-recall. 1 INTRODUCTION Micro-blog (also known as tweets in English) is a new kind of broadcast medium in the form of blogging. A micro-blog differs from a traditional blog in that it is typically smaller in size. Furthermore, texts in micro-blogs tend to be informal and new words occur more frequently. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data. The poor segmentation results will hurt subsequent analysis on micro-blog text. ∗Corresponding author Manually labeling the texts of micro-blog is time consuming. Luckily, punctuations provide useful information because they are used as indicators of the end of previous sentence and the beginning of the next one, which also indicate the start and the end of a word. These ”natural boundaries” appear so frequently in micro-blog texts that we can easily make good use of them. TABLE 1 shows some statistics of the news corpus vs. the micro-blogs. Besides, English letters and digits are also more than those in news corpus. They all are natural delimiters of Chinese characters and we treat them just the same as punctuations. We propose a method to enlarge the training corpus by using punctuation information. We build a semi-supervised learning (SSL) framework which can iteratively incorporate newly labeled instances from unlabeled micro-blog data during the training process. We test our method on microblog texts and experiments show good results. This paper is organized as follows. In section 1 we introduce the problem. Section 2 gives detailed description of our approach. We show the experi- ment and analyze the results in section 3. Section 4 gives the related works and in section 5 we conclude the whole work. 2 Our method 2.1 Punctuations Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. To be simple, one can use label ’B’ to indicate a character is the beginning of a word, and use ’N’ to indicate a character is not the beginning of a word. We also use the 2-tag in our work. Other tag sets like the ’BIES’ tag set are not suiteable because the puctuation information cannot decide whether a character after punctuation should be labeled as ’B’ or ’S’(word with Single 177 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 177–182, micNreow-bslogC68h56i. n73e%%seE10n1.g6.8l%i%shN20u. m76%%berPu1n13c9.t u03a%%tion Table 1: Percentage of Chinese, English, number, punctuation in the news corpus vs. the micro-blogs. character). Punctuations can serve as implicit labels for the characters before and after them. The character right after punctuations must be the first character of a word, meanwhile the character right before punctuations must be the last character of a word. An example is given in TABLE 2. 2.2 Algorithm Our algorithm “ADD-N” is shown in TABLE 3. The initially selected character instances are those right after punctuations. By definition they are all labeled with ’B’ . In this case, the number of training instances with label ’B’ is increased while the number with label ’N’ remains unchanged. Because of this, the model trained on this unbalanced corpus tends to be biased. This problem can become even worse when there is inexhaustible supply of texts from the target domain. We assume that labeled corpus of the source domain can be treated as a balanced reflection of different labels. Therefore we choose to estimate the balanced point by counting characters labeling ’B’ and ’N’ and calculate the ratio which we denote as η . We assume the enlarged corpus is also balanced if and only if the ratio of ’B’ to ’N’ is just the same to η of the source domain. Our algorithm uses data from source domain to make the labels balanced. When enlarging corpus using characters behind punctuations from texts in target domain, only characters labeling ’B’ are added. We randomly reuse some characters labeling ’N’ from labeled data until ratio η is reached. We do not use characters ahead of punctuations, because the single-character words ahead of punctuations take the label of ’B’ instead of ’N’ . In summary our algorithm tackles the problem by duplicating labeled data in source domain. We denote our algorithm as ”ADD-N”. We also use baseline feature templates include the features described in previous works (Sun and Xu, 2011; Sun et al., 2012). Our algorithm is not necessarily limited to a specific tagger. For simplicity and reliability, we use a simple MaximumEntropy tagger. 3 Experiment 3.1 Data set We evaluate our method using the data from weibo.com, which is the biggest micro-blog service in China. We use the API provided by weibo.com1 to crawl 500,000 micro-blog texts of weibo.com, which contains 24,243,772 characters. To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data, which contain 2,420,037 characters. We manually segment 2038 randomly selected microblogs.We follow the segmentation standard as the PKU corpus. In micro-blog texts, the user names and URLs have fixed format. User names start with ’ @ ’, followed by Chinese characters, English letters, numbers and ’ ’, and terminated when meeting punctuations or blanks. URLs also match fixed patterns, which are shortened using ”http : / /t . cn /” plus six random English letters or numbers. Thus user names and URLs can be pre-processed separately. We follow this principle in following experiments. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. We choose the PKU data in our experiment because our baseline methods use the same segmentation standard. We compare our method with three baseline methods. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. ICTCLAS, on the other hand, is a HMMbased Chinese word segmenter. Another baseline is Li and Sun (2009), which also uses punctuation in their semi-supervised framework. F-score 1http : / / open . we ibo .com/wiki 2http : / /www . s ighan .org/bakeo f f2 0 0 5 / 3http : / / i c l .org/ ct as 4http : / / nlp . st an ford . edu /pro j ect s / chine s e-nlp . shtml \ # cws 178 评B论-是-风-格-，-评B论-是-能-力-。- BNBBNBBNBBNB Table 2: The first line represents the original text. The second line indicates whether each character is the Beginning of sentence. The third line is the tag sequence using ”BN” tag set. is used as the accuracy measure. The recall of out-of-vocabulary is also taken into consideration, which measures the ability of the model to correctly segment out of vocabulary words. 3.2 Main results methods on the development data. TABLE 4 summarizes the segmentation results. In TABLE 4, Li-Sun is the method in Li and Sun (2009). Maxent only uses the PKU data for training, with neither punctuation information nor self-training framework incorporated. The next 4 methods all require a 100 iteration of self-training. No-punc is the method that only uses self-training while no punctuation information is added. Nobalance is similar to ADD N. The only difference between No-balance and ADD-N is that the former does not balance label ’B’ and label ’N’ . The comparison of Maxent and No-punctuation shows that naively adding confident unlabeled instances does not guarantee to improve performance. The writing style and word formation of the source domain is different from target domain. When segmenting texts of the target domain using models trained on source domain, the performance will be hurt with more false segmented instances added into the training set. The comparison of Maxent, No-balance and ADD-N shows that considering punctuation as well as self-training does improve performance. Both the f-score and OOV-recall increase. By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV-Recall. However, considering it will improve OOV-Recall by about +1.6% and the fscore +0.2%. We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data. TABLE 5 shows different f-scores and OOV-Recalls on different unlabeled data set. We note that when the number of texts changes from 0 to 50,000, the f-score and OOV both are improved. However, when unlabeled data changes to 200,000, the performance is a bit decreased, while still better than not using unlabeled data. This result comes from the fact that the method ’ADD-N’ only uses characters behind punctua179 Tabl152S0eiz 0:Segm0.8nP67ta245ion0p.8Rer6745f9om0a.8nF57c6e1witOh0 .d7Vi65f-2394Rernt size of unlabeled data tions from target domain. Taking more texts into consideration means selecting more characters labeling ’N’ from source domain to simulate those in target domain. If too many ’N’s are introduced, the training data will be biased against the true distribution of target domain. 3.3 Characters ahead of punctuations In the ”BN” tagging method mentioned above, we incorporate characters after punctuations from texts in micro-blog to enlarge training set.We also try an opposite approach, ”EN” tag, which uses ’E’ to represent ”End of word”, and ’N’ to rep- resent ”Not the end of word”. In this contrasting method, we only use charactersjust ahead ofpunctuations. We find that the two methods show similar results. Experiment results with ADD-N are shown in TABLE 6 . 5DU0an0lt a b0Tsiealzbe lde6:0.C8Fo7”m5BNpa”rO0itsOa.o7gVn7-3oRfBN0.8aFn”7E0dNEN”Ot0.aO.g7V6-3R 4 Related Work Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). These supervised methods show good results, however, are unable to incorporate information from new domain, where OOV problem is a big challenge for the research community. On the other hand unsupervised word segmentation Peng and Schuurmans (2001); Goldwater et al. (2006); Jin and Tanaka-Ishii (2006); Feng et al. (2004); Maosong et al. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. However, they usually are less accurate and more complicated than supervised ones. Meanwhile semi-supervised methods have been applied into NLP applications. Bickel et al. (2007) learns a scaling factor from data of source domain and use the distribution to resemble target domain distribution. Wu et al. (2009) uses a Domain adaptive bootstrapping (DAB) framework, which shows good results on Named Entity Recognition. Similar semi-supervised applications include Shen et al. (2004); Daum e´ III and Marcu (2006); Jiang and Zhai (2007); Weinberger et al. (2006). Besides, Sun and Xu (201 1) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation. There are previous works using punctuations as implicit annotations. Riley (1989) uses it in sentence boundary detection. Li and Sun (2009) proposed a compromising solution to by using a clas- sifier to select the most confident characters. We do not follow this approach because the initial errors will dramatically harm the performance. Instead, we only add the characters after punctuations which are sure to be the beginning of words (which means labeling ’B’) into our training set. Sun and Xu (201 1) uses punctuation information as discrete feature in a sequence labeling framework, which shows improvement compared to the pure sequence labeling approach. Our method is different from theirs. We use characters after punctuations directly. 5 Conclusion In this paper we have presented an effective yet simple approach to Chinese word segmentation on micro-blog texts. In our approach, punctuation information of unlabeled micro-blog data is used, as well as a self-training framework to incorporate confident instances. Experiments show that our approach improves performance, especially in OOV-recall. Both the punctuation information and the self-training phase contribute to this improve- ment. Acknowledgments This research was partly supported by National High Technology Research and Development Program of China (863 Program) (No. 2012AA01 1101), National Natural Science Foundation of China (No.91024009) and Major National Social Science Fund of China(No. 12&ZD227;). 180 References Bickel, S., Br¨ uckner, M., and Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In Proceedings ofthe 24th international conference on Machine learning, pages 81–88. ACM. Chen, W., Zhang, Y., and Isahara, H. (2006). Chinese named entity recognition with conditional random fields. In 5th SIGHAN Workshop on Chinese Language Processing, Australia. Daum e´ III, H. and Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1): 101–126. Feng, H., Chen, K., Deng, X., and Zheng, W. (2004). Accessor variety criteria for chinese word extraction. Computational Linguistics, 30(1):75–93. Goldwater, S., Griffiths, T., and Johnson, M. (2006). Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 673–680. Association for Computational Linguistics. Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in nlp. In Annual Meeting-Association For Computational Linguistics, volume 45, page 264. Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 428–435. Association for Computational Linguistics. Li, Z. and Sun, M. (2009). Punctuation as implicit annotations for chinese word segmentation. Computational Linguistics, 35(4):505– 512. Low, J., Ng, H., and Guo, W. (2005). A maximum entropy approach to chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, volume 164. Jeju Island, Korea. Maosong, S., Dayang, S., and Tsou, B. (1998). Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 1 7th international conference on Computational linguistics-Volume 2, pages 1265–1271 . Association for Computational Linguistics. Pan, S. and Yang, Q. (2010). A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10): 1345–1359. Peng, F. and Schuurmans, D. (2001). Selfsupervised chinese word segmentation. Advances in Intelligent Data Analysis, pages 238– 247. Riley, M. (1989). Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, pages 339–352. Association for Computational Linguistics. Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C. (2004). Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 589. Association for Computational Linguistics. Sun, W. and Xu, J. (201 1). Enhancing chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 970–979. Association for Computational Linguistics. Sun, X., Wang, H., and Li, W. (2012). Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 253–262, Jeju Island, Korea. Association for Computational Linguistics. Weinberger, K., Blitzer, J., and Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In In NIPS. Citeseer. Wu, D., Lee, W., Ye, N., and Chieu, H. (2009). Domain adaptive bootstrapping for named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1523–1532. Association for Computational Linguistics. Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48. Zhao, H., Huang, C., and Li, M. (2006a). An improved chinese word segmentation system with 181 conditional random field. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, volume 117. Sydney: July. Zhao, H., Huang, C., Li, M., and Lu, B. (2006b). Effective tag set selection in chinese word segmentation via conditional random field modeling. In Proceedings pages of PACLIC, volume 20, 87–94. 182

5 0.57521564 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

Author: Xiaodong Zeng ; Derek F. Wong ; Lidia S. Chao ; Isabel Trancoso

Abstract: This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models. Similarly to multi-view learning, the “segmentation agreements” between the two different types of view are used to overcome the scarcity of the label information on unlabeled data. The proposed approach trains a character-based and word-based model on labeled data, respectively, as the initial models. Then, the two models are constantly updated using unlabeled examples, where the learning objective is maximizing their segmentation agreements. The agreements are regarded as a set of valuable constraints for regularizing the learning of both models on unlabeled data. The segmentation for an input sentence is decoded by using a joint scoring function combining the two induced models. The evaluation on the Chinese tree bank reveals that our model results in better gains over the state-of-the-art semi-supervised models reported in the literature.

6 0.56751847 249 acl-2013-Models of Semantic Representation with Visual Attributes

7 0.55979598 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

8 0.55726874 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

9 0.5570429 85 acl-2013-Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis

10 0.55604744 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

11 0.55469948 380 acl-2013-VSEM: An open library for visual semantics representation

12 0.5542872 237 acl-2013-Margin-based Decomposed Amortized Inference

13 0.55322874 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

14 0.55249137 275 acl-2013-Parsing with Compositional Vector Grammars

15 0.55103308 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals

16 0.55037522 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

17 0.5491333 224 acl-2013-Learning to Extract International Relations from Political Context

18 0.54810053 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation

19 0.54639459 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

20 0.54577792 318 acl-2013-Sentiment Relevance