acl acl2013 acl2013-54 knowledge-graph by maker-knowledge-mining

54 acl-2013-Are School-of-thought Words Characterizable?

Source: pdf

Author: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge

Abstract: School of thought analysis is an important yet not-well-elaborated scientific knowledge discovery task. This paper makes the first attempt at this problem. We focus on one aspect of the problem: do characteristic school-of-thought words exist and whether they are characterizable? To answer these questions, we propose a probabilistic generative School-Of-Thought (SOT) model to simulate the scientific authoring process based on several assumptions. SOT defines a school of thought as a distribution of topics and assumes that authors determine the school of thought for each sentence before choosing words to deliver scientific ideas. SOT distinguishes between two types of school-ofthought words for either the general background of a school of thought or the original ideas each paper contributes to its school of thought. Narrative and quantitative experiments show positive and promising results to the questions raised above. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Xiaorui Jiang¶1 Xiaopin g Sun¶2 Hai Zhuge¶†‡3* ¶ Key Lab of Intelligent Information Processing, Institute of Computing Technology, CAS, Beijing, China † Nanjing University of Posts and Telecommunications, Nanjing, China ‡ Aston University, Birmingham, UK 1 xiaorui j iang@ gmai l com . [sent-2, score-0.042]

2 cn † Abstract School of thought analysis is an important yet not-well-elaborated scientific knowledge discovery task. [sent-8, score-0.659]

3 To answer these questions, we propose a probabilistic generative School-Of-Thought (SOT) model to simulate the scientific authoring process based on several assumptions. [sent-11, score-0.322]

4 SOT defines a school of thought as a distribution of topics and assumes that authors determine the school of thought for each sentence before choosing words to deliver scientific ideas. [sent-12, score-1.668]

5 SOT distinguishes between two types of school-ofthought words for either the general background of a school of thought or the original ideas each paper contributes to its school of thought. [sent-13, score-1.123]

6 1 Introduction With more powerful computational analysis tools, researchers are now devoting efforts to establish a “science of better science” by analyzing the ecosystem of scientific discovery (Goth, 2012). [sent-15, score-0.231]

7 Amongst this ambition, school of thought analysis has been identified an important fine-grained scientific knowledge discovery task. [sent-16, score-0.893]

8 As mentioned by Teufel (2010), it is important for an experienced scientist to know which papers belong to which school of thought (or technical route) through years of knowledge accumulation. [sent-17, score-0.742]

9 Schools of thought typically emerge with the evolution of a research domain or scientific topic. [sent-18, score-0.693]

10 Take reachability indexing for example, which we will repeatedly turn to later, there are two schools of thought, the cover-based (since about 1990) and hop-based (since the beginning of the 2000s) methods. [sent-23, score-0.631]

11 Most of the following works belong to either school of thought and thus two streams of innovative ideas emerge. [sent-24, score-0.882]

12 Two chains of subsequentially published papers represent two schools of thought of the reachability indexing domain. [sent-26, score-1.139]

13 However it is not easy to gain this knowledge about school of thought. [sent-28, score-0.234]

14 Current citation indexing services are not very helpful for this kind of knowledge discovery tasks. [sent-29, score-0.212]

15 As explained in Figure 1, papers of different schools of thought cite each other heavily and form a rather dense citation graph. [sent-30, score-0.888]

16 An extreme example is p14, which cites more hop-based papers than its own school of thought. [sent-31, score-0.314]

17 If the current citation indexing service can be equipped with school of thought knowledge, it will help scientists, especially novice researchers, a lot in grasping the core ideas of a scientific domain quickly and making their own way of innovation (Upham et al. [sent-32, score-1.243]

18 School of thought analysis is also useful for knowledge 822 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-34, score-0.428]

19 , 2010) and scientific paradigm summarization (Joang and Kan, 2010; Qazvinian et al. [sent-37, score-0.202]

20 This paper makes the first attempts to unsupervised school of thought analysis. [sent-39, score-0.662]

21 Three main aspects of school of thought analysis can be identified: determining the number of schools of thought, characterizing school-of-thought words and categorizing papers into one or several school(s) of thought (if applicable). [sent-40, score-1.448]

22 To answer these questions, we propose the probabilistic generative School-Of-Thought model (SOT for short) based on the following assumptions on the scientific authoring process. [sent-43, score-0.32]

23 The co-occurrence patterns are useful for revealing which words and sentences are school-of-thought words and which schools of thought they describe. [sent-45, score-0.706]

24 Take reachability indexing for example, hop-based papers try to get the “optimum labeling” by finding the “densest intermediate hops” to encode reachability information captured by an intermediate data structure called “transitive closure contour”. [sent-46, score-0.797]

25 To accomplish this, they solve the “densest subgraph problem” on specifically created “bipartite” graphs centered at “hops” by transforming the problem into an equivalent “minimum set cover” framework. [sent-47, score-0.055]

26 In cover-based methods, however, one or several “spanning tree(s)” are extracted and “(multiple) intervals” are assigned to each node as reachability labels by “pre-(order)” and “post-order traversals”. [sent-49, score-0.272]

27 Meanwhile, graph theory terminologies like “root”, “child” and “ancestor” etc. [sent-50, score-0.031]

28 Before writing a sentence to deliver their ideas, the authors need to determine which school of thought this sentence is to portray. [sent-53, score-0.745]

29 The one-sot-per-sentence assumption does not mean that authors intentionally write this way, but only simulates the outcome of the scientific paper organization. [sent-55, score-0.268]

30 Investigations into scientific writing reveal that sentences of different schools of thought can occur anywhere and are often interleaved. [sent-56, score-0.908]

31 This is because authors of a scientific paper not only contribute to the school of thought they follow but also discuss different schools of thought. [sent-57, score-1.176]

32 For example, in the Method part, the authors may turn to discuss another paper (possibly of a different school of thought) for comparison. [sent-58, score-0.268]

33 Besides, citation sentences often acknowledge related works of different schools of thought. [sent-60, score-0.408]

34 All the papers of a domain talk about the general domain backgrounds. [sent-62, score-0.14]

35 For example, reachability indexing aims to build “compact indices” for facilitating “reachability queries” between “source” and “target nodes”. [sent-63, score-0.353]

36 Other background words include “(complete) transitive closure”, “index size” and “reach” etc. [sent-64, score-0.093]

37 , as well as classical graph theory terminologies like “predecessors” and “successors” etc. [sent-65, score-0.031]

38 Besides contributing original ideas, papers of the same school of thought typically need to follow some general strategies that make them fall into the same school of thought. [sent-67, score-0.976]

39 For example, all the hop-based methods follow the general ideas of designing approximate algorithms for choosing good hops, while the original ideas of each paper lead to different labeling algorithms. [sent-68, score-0.332]

40 Scientific readers pay attention to the original ideas of each paper as well as the general ideas of each school of thought. [sent-69, score-0.566]

41 This assumes that a word can be either a generality or originality word to deliver general and original ideas of a school of thought respectively. [sent-70, score-1.05]

42 The plate notation follows Bishop (2006) where a shaded circle means an observed variable, in this context word occurrence in text, a white circle denotes either a latent variable or a model pa- rameter, and a small solid dot represents a hyperparameter of the corresponding model parameter. [sent-74, score-0.11]

43 The generative scientific authoring process illustrated in Figure 2 is elaborated as follows. [sent-75, score-0.287]

44 To simulate the one-sot-per-sentence assumption, we introduce a latent school-of-thought assignment variable cd,s (1 ≤ cd,s ≤ C, where C is the number of schools of thought) for each sentence s in paper d, dependent on which are topic assignment and word occurrence variables. [sent-80, score-0.546]

45 As different papers and their authors have different foci, flavors and writing styles, it is appropriate to assume that each paper d has its own Dirichlet distribution of schools of thought πdc  Dir(αc) (refer to Heinrich (2008) for Dirichlet analysis of texts). [sent-81, score-0.82]

46 Before choosing a word wd,s,n to deliver scientific ideas, the authors first need to determine whether this word describes domain backgrounds or depicts a specific school-of-thought. [sent-85, score-0.378]

47 This information is indicated by the latent background word indicator variable bd,s,n  Bern(πbd) , where πdb  Beta(αb0 ,α1b) is the probability of Bernoulli test. [sent-86, score-0.145]

48 bd,s,n = 1 means wd,s,n is a background word tbi. [sent-87, score-0.061]

49 Then the authors need to determine whether wd,s,n talks about the general ideas of a certain school of thought (i. [sent-93, score-0.896]

50 a generality word when od,s,n = 0) or delivers original contributions to the specific school of thought (i. [sent-95, score-0.689]

51 The latent originality indicator variable od,s,n is assigned in a similar way to bd,s,n. [sent-98, score-0.23]

52 SOT regards schools of thought and topics as two different levels of semantic information. [sent-101, score-0.765]

53 A school of thought is modeled as a distribution of topics discussed by the papers of a research do- main. [sent-102, score-0.801]

54 Each topic in turn is defined as a distribution of the topical words. [sent-103, score-0.095]

55 Reflected in Figure 1, θcg and θco are Dirichlet distributions of generalData Sets ity and originality topics respectively, with γg and γo being the Dirichlet priors. [sent-104, score-0.205]

56 According to the assignment of the originality indicator, the topic td,s,n of the current token is multinomially selected from either θcg (od,s,n = 0) or θco (od,s,n = 1). [sent-105, score-0.292]

57 After that, a word wd,s,n is multinomially emitted from the topical word distribution ϕttpd, s , where ϕttp  Dir(βtp) for each 1 ≤ t ≤ T. [sent-106, score-0.088]

58 Each data set consists of several dozens of papers of the same domain. [sent-111, score-0.08]

59 We extracted texts from the collected papers and removed tables, figures and sentences full of math equations or unrecognizable symbols. [sent-115, score-0.08]

60 The gold-standard number and the classification of schools of thoughts reflect not only the viewpoints of the survey authors but also the consensus of the corresponding research communities. [sent-117, score-0.312]

61 2 Qualitative Results This section looks at the capabilities of SOT in learning background and school-of-thought words using the RE data set as an example. [sent-119, score-0.061]

62 Given the estimated model parameters, the distributions of the school-of-thought words of SOT can be calculated as weighted sums of topical word emission probabilities (ϕ ϕtt,p w for each word w) over all the topics (t) and (d), as in Eq. [sent-120, score-0.131]

63 papers 824 p( w w| |c , o = 0 / 1) =dNdN, v ( dw , w) )πdo,0/1tθcg, t/ oϕt p, w (1) The first row of Table 2 lists the top-60 background and school-of-thought words learned by SOT for the RE data set sorted in descending order of their probabilities column by column. [sent-122, score-0.141]

64 As the data sets are relative small, it is not appropriate to set T too large, otherwise most of the topics are meaningless or duplicate. [sent-125, score-0.09]

65 Either case will impose additive negative influences on the usefulness of the model, for example when applied to schools of thought clustering in the next section. [sent-126, score-0.736]

66 C is set to the gold-standard number of schools of thought as in this study we are mainly interested in whether school-of-thought words are characterizable. [sent-127, score-0.706]

67 The problems of identifying the existence and number of schools of thought are left to future work. [sent-128, score-0.706]

68 For domain backgrounds, reachability indexing is a classical problem of the graph database “domain” which talks about the reachability between the “source” and “destination nodes” on a “graph”. [sent-132, score-0.689]

69 Cover-based ones conform well to the assumptions in Sect. [sent-136, score-0.033]

70 To accomplish this, hop-based methods define a “densest subgraph problem” on a “bipartite” graph, transform it to an equivalent “set cover” problem, and then apply “greedy” algorithms based on several “heuristics” to find “approximate” solutions. [sent-140, score-0.055]

71 “contour” is used by hop-based methods as a concise representation of the remaining to-be-encoded reachability information. [sent-142, score-0.272]

72 3 Quantitative Results To see the usefulness of school-of-thought words, we use the SOT model as a way to feature space reduction for a more precise text representation in the school-of-thought clustering task. [sent-147, score-0.062]

73 (2009), output the best clustering by the minimum residual squared sum principle. [sent-152, score-0.03]

74 Two baselines are the “RAW” method without dimension reduction and LDA-based (Blei et al. [sent-153, score-0.032]

75 In the parentheses are the corresponding threshold values under which the reported clustering result is obtained. [sent-156, score-0.03]

76 Compared to the baselines, SOT has consistently the best clustering qualities. [sent-158, score-0.03]

77 80 typically means LDA is much less efficient in feature reduction than SOT on these two data sets. [sent-165, score-0.032]

78 School-of-thought clustering results Related Work An early work in semantic analysis of scientific articles is Griffiths and Steyvers (2004) which focused on efficient browsing of large literature collections based on scientific topics. [sent-203, score-0.434]

79 Other re- lated researches include topic-based reviewer assignment (Mimno and McCallum, 2007), citation influence estimation (Dietz et al. [sent-204, score-0.165]

80 Another line of research is the joint modeling of topics and other types of semantic units such as perspectives (Lin et al. [sent-208, score-0.059]

81 TAM simultaneously models aspects and topics with different assumptions from SOT and it models purely on word level. [sent-214, score-0.092]

82 Studies that introduce an explicit background distribution include Chemudugunta et al. [sent-215, score-0.061]

83 Different from these works, SOT assumes that not only some “meaningless” generalpurpose words but also more meaningful words about the specific domain backgrounds can be learned. [sent-218, score-0.093]

84 However, it is very useful to regard sentence as the basic processing unit, for example in the text scanning approach simulating human read- ing process by Xu and Zhuge (2013). [sent-220, score-0.07]

85 Indeed, sentence-level school of thought assignment is crucial to SOT as it allows SOT to model the scientific authoring process. [sent-221, score-1.012]

86 There are also other works that model text semantics on different levels other than words or tokens, such as Wallach (2006) on n-grams and Titov and McDonald (2008) on words within multinomially sampled sliding windows. [sent-222, score-0.07]

87 The latter also distinguishes between different levels of topics, say global versus local topics, while in SOT such discrimination is generality versus originality topics. [sent-223, score-0.231]

88 In SOT, a school of thought is modeled as a distribution of topics, with the latter defined as a distribution of topical words. [sent-225, score-0.708]

89 School of thought assignment to each sentence is vital as it allows SOT to simulate the scientific authoring process in which each sentence conveys a piece of idea contributed to a certain school of thought as well as the domain backgrounds. [sent-226, score-1.505]

90 Modeling general ad specific aspects of documents with a probabilistic topic model. [sent-237, score-0.049]

91 Positioning knowledge: schools of thought and new knowledge creation. [sent-424, score-0.706]

92 A text scanning mechanism simulating human reading process, In Proc. [sent-434, score-0.07]

93 B Gibbs Sampling of the SOT Model Using collapsed Gibbs sampling (Griffiths and Steyvers, 2004), the latent variable is inferenced in Eq. [sent-492, score-0.058]

94 is the number of words of topic t describing the common ideas (o = 0) or original ideas (o = 1) of school of thought c. [sent-497, score-1.073]

95 N¬d, c(d , s ) (d , c ) ) counts the number of sentences in paper d describing school of thought c with sentence s removed from consideration. [sent-499, score-0.692]

96 For example, Nc, b , o , t(c ,0,o ,Σ ) =t=1,,TNc, b , o , t(c ,0,o , t ) Latent variables b , o and t are jointly without counting the n-th token in sentence s of paper d. [sent-502, score-0.069]

97 Nb¬,( td , v , s , n ) (0,t , v ) is the number of schoolof-thought words of topic t which is instantiated by vocabulary item v in the literature collection without counting the n-th token in sentence s of paper d. [sent-503, score-0.118]

98 N¬d,( bd , s , n ) (d , b ) counts the number of background (b = 0) or school-ofthought (b = 1) words in document d without counting the n-th token in sentence s. [sent-506, score-0.333]

99 Nb¬,( vd , s , n ) (1,v ) is the number of times vocabulary item v occurs as background word in the literature collection without counting the n-th token in sentence s of paper d. [sent-507, score-0.13]

100 N¬d,( bd , o , s , n ) (d ,0,o ) is the number of words describing either common ideas (o = 0) or original ideas (o = 1) of some school of thought without considering the n-th token in sentence s of paper d. [sent-508, score-1.261]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sot', 0.443), ('thought', 0.428), ('schools', 0.278), ('reachability', 0.272), ('school', 0.234), ('bd', 0.203), ('scientific', 0.202), ('zhuge', 0.167), ('ideas', 0.166), ('originality', 0.146), ('citation', 0.102), ('nd', 0.087), ('authoring', 0.085), ('fsthr', 0.084), ('hops', 0.084), ('indexing', 0.081), ('papers', 0.08), ('steyvers', 0.065), ('assignment', 0.063), ('backgrounds', 0.063), ('contour', 0.063), ('background', 0.061), ('topics', 0.059), ('densest', 0.055), ('lr', 0.054), ('nanjing', 0.051), ('topic', 0.049), ('deliver', 0.049), ('griffiths', 0.047), ('cg', 0.046), ('topical', 0.046), ('characterizable', 0.042), ('dietz', 0.042), ('goth', 0.042), ('heinrich', 0.042), ('herrera', 0.042), ('joang', 0.042), ('multinomially', 0.042), ('nbn', 0.042), ('scanning', 0.042), ('upham', 0.042), ('xiaorui', 0.042), ('wallach', 0.041), ('dp', 0.04), ('closure', 0.038), ('cover', 0.037), ('ncn', 0.037), ('wa', 0.036), ('counting', 0.035), ('simulate', 0.035), ('token', 0.034), ('authors', 0.034), ('talks', 0.034), ('assumptions', 0.033), ('evolution', 0.033), ('re', 0.033), ('transitive', 0.032), ('assumption', 0.032), ('tam', 0.032), ('bishop', 0.032), ('chemudugunta', 0.032), ('cd', 0.032), ('reduction', 0.032), ('ns', 0.031), ('nc', 0.031), ('girju', 0.031), ('meaningless', 0.031), ('terminologies', 0.031), ('pp', 0.03), ('sd', 0.03), ('clustering', 0.03), ('domain', 0.03), ('dc', 0.03), ('describing', 0.03), ('accomplish', 0.029), ('variable', 0.029), ('versus', 0.029), ('discovery', 0.029), ('latent', 0.029), ('works', 0.028), ('simulating', 0.028), ('qazvinian', 0.028), ('intermediate', 0.027), ('mimno', 0.027), ('generality', 0.027), ('characteristic', 0.027), ('vanderwende', 0.026), ('subgraph', 0.026), ('teufel', 0.026), ('circle', 0.026), ('quantitative', 0.026), ('te', 0.026), ('dir', 0.026), ('emission', 0.026), ('streams', 0.026), ('dirichlet', 0.026), ('indicator', 0.026), ('np', 0.025), ('mei', 0.025), ('titov', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 54 acl-2013-Are School-of-thought Words Characterizable?

Author: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge

2 0.11130255 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords

Author: Rahul Jha ; Amjad Abu-Jbara ; Dragomir Radev

Abstract: In this paper, we investigate the problem of automatic generation of scientific surveys starting from keywords provided by a user. We present a system that can take a topic query as input and generate a survey of the topic by first selecting a set of relevant documents, and then selecting relevant sentences from those documents. We discuss the issues of robust evaluation of such systems and describe an evaluation corpus we generated by manually extracting factoids, or information units, from 47 gold standard documents (surveys and tutorials) on seven topics in Natural Language Processing. We have manually annotated 2,625 sentences with these factoids (around 375 sentences per topic) to build an evaluation corpus for this task. We present evaluation results for the performance of our system using this annotated data.

3 0.07904204 121 acl-2013-Discovering User Interactions in Ideological Discussions

Author: Arjun Mukherjee ; Bing Liu

Abstract: Online discussion forums are a popular platform for people to voice their opinions on any subject matter and to discuss or debate any issue of interest. In forums where users discuss social, political, or religious issues, there are often heated debates among users or participants. Existing research has studied mining of user stances or camps on certain issues, opposing perspectives, and contention points. In this paper, we focus on identifying the nature of interactions among user pairs. The central questions are: How does each pair of users interact with each other? Does the pair of users mostly agree or disagree? What is the lexicon that people often use to express agreement and disagreement? We present a topic model based approach to answer these questions. Since agreement and disagreement expressions are usually multiword phrases, we propose to employ a ranking method to identify highly relevant phrases prior to topic modeling. After modeling, we use the modeling results to classify the nature of interaction of each user pair. Our evaluation results using real-life discussion/debate posts demonstrate the effectiveness of the proposed techniques.

4 0.075717844 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

Author: Romain Deveaud ; Eric SanJuan ; Patrice Bellot

Abstract: The current topic modeling approaches for Information Retrieval do not allow to explicitly model query-oriented latent topics. More, the semantic coherence of the topics has never been considered in this field. We propose a model-based feedback approach that learns Latent Dirichlet Allocation topic models on the top-ranked pseudo-relevant feedback, and we measure the semantic coherence of those topics. We perform a first experimental evaluation using two major TREC test collections. Results show that retrieval perfor- mances tend to be better when using topics with higher semantic coherence.

5 0.071692184 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

6 0.065000832 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

7 0.060982015 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

8 0.053734399 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

9 0.046381474 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

10 0.044715203 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

11 0.043771446 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

12 0.043758459 224 acl-2013-Learning to Extract International Relations from Political Context

13 0.042659637 350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection

14 0.04228903 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain

15 0.042242125 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

16 0.042216122 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

17 0.042164017 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

18 0.041825201 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

19 0.041271199 126 acl-2013-Diverse Keyword Extraction from Conversations

20 0.039536219 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.127), (1, 0.044), (2, 0.005), (3, -0.023), (4, 0.013), (5, -0.031), (6, 0.05), (7, -0.031), (8, -0.093), (9, -0.039), (10, 0.027), (11, 0.028), (12, 0.011), (13, 0.038), (14, 0.011), (15, -0.027), (16, -0.03), (17, 0.045), (18, -0.035), (19, 0.001), (20, -0.012), (21, 0.006), (22, -0.007), (23, -0.037), (24, -0.002), (25, -0.0), (26, -0.052), (27, -0.018), (28, 0.01), (29, -0.021), (30, 0.004), (31, -0.018), (32, -0.025), (33, -0.012), (34, 0.006), (35, -0.023), (36, 0.005), (37, 0.004), (38, -0.0), (39, -0.042), (40, 0.06), (41, 0.013), (42, 0.046), (43, -0.005), (44, 0.038), (45, 0.036), (46, -0.029), (47, 0.016), (48, 0.018), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91793013 54 acl-2013-Are School-of-thought Words Characterizable?

Author: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge

2 0.79789847 126 acl-2013-Diverse Keyword Extraction from Conversations

Author: Maryam Habibi ; Andrei Popescu-Belis

Abstract: A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. The results demonstrate that the method outperforms two competitive baselines.

3 0.75958955 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

Author: Yukari Ogura ; Ichiro Kobayashi

Abstract: In this paper, we propose a method to raise the accuracy of text classification based on latent topics, reconsidering the techniques necessary for good classification for example, to decide important sentences in a document, the sentences with important words are usually regarded as important sentences. In this case, tf.idf is often used to decide important words. On the other hand, we apply the PageRank algorithm to rank important words in each document. Furthermore, before clustering documents, we refine the target documents by representing them as a collection of important sentences in each document. We then classify the documents based on latent information in the documents. As a clustering method, we employ the k-means algorithm and inves– tigate how our proposed method works for good clustering. We conduct experiments with Reuters-21578 corpus under various conditions of important sentence extraction, using latent and surface information for clustering, and have confirmed that our proposed method provides better result among various conditions for clustering.

4 0.74810535 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords

Author: Rahul Jha ; Amjad Abu-Jbara ; Dragomir Radev

5 0.73240197 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

Author: Romain Deveaud ; Eric SanJuan ; Patrice Bellot

6 0.71906441 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

7 0.71656179 121 acl-2013-Discovering User Interactions in Ideological Discussions

8 0.66084212 142 acl-2013-Evolutionary Hierarchical Dirichlet Process for Timeline Summarization

9 0.6517933 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

10 0.6516304 293 acl-2013-Random Walk Factoid Annotation for Collective Discourse

11 0.64649481 182 acl-2013-High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning

12 0.63859296 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

13 0.63710356 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

14 0.61347884 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

15 0.60434747 287 acl-2013-Public Dialogue: Analysis of Tolerance in Online Discussions

16 0.56541306 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

17 0.55520648 257 acl-2013-Natural Language Models for Predicting Programming Comments

18 0.55486161 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

19 0.55130512 370 acl-2013-Unsupervised Transcription of Historical Documents

20 0.53707004 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.048), (4, 0.015), (6, 0.024), (11, 0.051), (15, 0.013), (24, 0.047), (26, 0.045), (35, 0.068), (42, 0.035), (48, 0.449), (70, 0.028), (88, 0.023), (90, 0.017), (95, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.973032 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

Author: Jun Suzuki ; Masaaki Nagata

Abstract: This paper proposes a framework of supervised model learning that realizes feature grouping to obtain lower complexity models. The main idea of our method is to integrate a discrete constraint into model learning with the help of the dual decomposition technique. Experiments on two well-studied NLP tasks, dependency parsing and NER, demonstrate that our method can provide state-of-the-art performance even if the degrees of freedom in trained models are surprisingly small, i.e., 8 or even 2. This significant benefit enables us to provide compact model representation, which is especially useful in actual use.

2 0.95684373 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

Author: Volkan Cirik

Abstract: We study substitute vectors to solve the part-of-speech ambiguity problem in an unsupervised setting. Part-of-speech tagging is a crucial preliminary process in many natural language processing applications. Because many words in natural languages have more than one part-of-speech tag, resolving part-of-speech ambiguity is an important task. We claim that partof-speech ambiguity can be solved using substitute vectors. A substitute vector is constructed with possible substitutes of a target word. This study is built on previous work which has proven that word substitutes are very fruitful for part-ofspeech induction. Experiments show that our methodology works for words with high ambiguity.

same-paper 3 0.95634127 54 acl-2013-Are School-of-thought Words Characterizable?

Author: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge

4 0.90014994 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

Author: Tiziano Flati ; Roberto Navigli

Abstract: We present SPred, a novel method for the creation of large repositories of semantic predicates. We start from existing collocations to form lexical predicates (e.g., break ∗) and learn the semantic classes that best f∗it) tahned ∗ argument. Taon idco this, we extract failtl thhee ∗ occurrences ion Wikipedia ewxthraiccht match the predicate and abstract its arguments to general semantic classes (e.g., break BODY PART, break AGREEMENT, etc.). Our experiments show that we are able to create a large collection of semantic predicates from the Oxford Advanced Learner’s Dictionary with high precision and recall, and perform well against the most similar approach.

5 0.89071321 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics

Author: Angeliki Lazaridou ; Marco Marelli ; Roberto Zamparelli ; Marco Baroni

Abstract: Speakers of a language can construct an unlimited number of new words through morphological derivation. This is a major cause of data sparseness for corpus-based approaches to lexical semantics, such as distributional semantic models of word meaning. We adapt compositional methods originally developed for phrases to the task of deriving the distributional meaning of morphologically complex words from their parts. Semantic representations constructed in this way beat a strong baseline and can be of higher quality than representations directly constructed from corpus data. Our results constitute a novel evaluation of the proposed composition methods, in which the full additive model achieves the best performance, and demonstrate the usefulness of a compositional morphology component in distributional semantics.

6 0.88044792 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

7 0.66757381 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words

8 0.65818477 237 acl-2013-Margin-based Decomposed Amortized Inference

9 0.65709782 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit

10 0.64837992 294 acl-2013-Re-embedding words

11 0.64351451 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

12 0.62233359 62 acl-2013-Automatic Term Ambiguity Detection

13 0.61783886 275 acl-2013-Parsing with Compositional Vector Grammars

14 0.61779994 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

15 0.60635549 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

16 0.60313684 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

17 0.60103327 91 acl-2013-Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning

18 0.59793943 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics

19 0.5941003 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

20 0.59199381 175 acl-2013-Grounded Language Learning from Video Described with Sentences