acl acl2013 acl2013-54 acl2013-54-reference knowledge-graph by maker-knowledge-mining

54 acl-2013-Are School-of-thought Words Characterizable?

Source: pdf

Author: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge

Abstract: School of thought analysis is an important yet not-well-elaborated scientific knowledge discovery task. This paper makes the first attempt at this problem. We focus on one aspect of the problem: do characteristic school-of-thought words exist and whether they are characterizable? To answer these questions, we propose a probabilistic generative School-Of-Thought (SOT) model to simulate the scientific authoring process based on several assumptions. SOT defines a school of thought as a distribution of topics and assumes that authors determine the school of thought for each sentence before choosing words to deliver scientific ideas. SOT distinguishes between two types of school-ofthought words for either the general background of a school of thought or the original ideas each paper contributes to its school of thought. Narrative and quantitative experiments show positive and promising results to the questions raised above. 1

reference text

Chemudugunta, C., Smyth P., and Steyvers, M. 2006. Modeling general ad specific aspects of documents with a probabilistic topic model. In Proc. NIPS’06. Bishop, C. M. 2006. Patter Recognition and Machine learning. Ch. 8 Graphical Models. Springer. Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3: 993– 1022. Chen, C. 2004. Searching for intellectual turning points: Prograssive knowledge domain visualiza- tion. Proc. Natl. Acad. Sci., 101(suppl. 1): 5303– 53 10. Dietz, L., Bickel, S., and Scheffer, T. 2007. Unsupervised prediction of citation influence. In Proc. ICML ’07, 233–240. Goth, G. 2012. The science of better science. Commun. ACM, 55(2): 13–15. Griffiths, T., and Steyvers, M. 2004. Finding scientific topics. Proc. Natl. Acad. Sci., 101 (suppl 1): 5228–5235. Griffiths, T., Steyvers, M., Blei, D. M., and Tenenbaum, J. B. 2004. Integrating topics and syntax. In Proc. NIPS’04. Haghighi, A., and Vanderwende, L. 2009. Exploring content models for multi-document summarization. In Proc. HLT-NAACL ’09, 362–370. Hall, D., Jurafsky, D., and Manning, C. D. 2008. Studying the history of ideas using topic models. In Proc. EMNLP’08, 363–371 . Heinrich, G. 2008. Parameter estimation for text analysis. Available at www.arbylon.net/publications/text-est.pdf. Herrera, M., Roberts, D. C., and Gulbahce, N. 2010. Mapping the evolution of scientific fields. PLoS ONE, 5(5): e10355. Joang, C. D. V., and Kan, M.-Y. (2010). Towards automatic related work summarization. In Proc. COLING 2010. Li, P., Jiang, J., and Wang, Y. 2010. Generating templates of entity summaries with an entity-aspect model and pattern mining. In Proc. ACL ’10, 640– 649. Lin, W., Wilson, T., Wiebe, J., and Hauptmann, A. 2006. Which side are you on? Identifying perspectives at the document and sentence levels. In Proc. CoNLL ’06, 109–1 16. Manning, C. D., Raghavan, P., and Schütze, H. 2009. Introduction to Information Retrieval. Ch. 16. Flat Clustering. Cambridge University Press. Mei, Q., Ling, X., Wondra, M., Su, H., and Zhai, C. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In Proc. WWW’07, 171–180. Mimno, D., and McCallum, A. 2007. Expertise modeling for matching papers with reviewers. In Proc. SIGKDD ’07, 500–509. Paul, M., and Girju, R. 2010. A two-dimensional topic-aspect model for discovering multi-faceted topics. In Proc. AAAI’10, 545–550. Qazvinian, V., Radev, D. R., Mohammad, S. M., Dorr, B., Zajic, D., Whidby, M., and Moon T. (2013). Generating extractive summaries of scientific paradigms. J. Artif. Intell. Res., 46: 165–201 . Teufel, S. 2010. The Structure of Scientific Articles. CLSI Publications, Stanford, CA, USA. Titov, I., and McDonald R. 2008. Modeling online reviews with multi-grain topic models. In Proc. WWW’08, 111–120. Tu, Y., Johri, N., Roth, D., and Hockenmaier, J. 2010. Citation author topic model in expert search. In Proc. COLING’ ’10, 1265–1273. Upham, S. P., Rosenkopf, L., Ungar, L. H. 2010. Positioning knowledge: schools of thought and new knowledge creation. Scientometrics, 83 (2): 555– 581. Wallach, H. 2006. Topic modeling: beyond bag-ofwords. In Proc. ICML ’06, 977– 984. Xu, B., and Zhuge, H. 2013. A text scanning mechanism simulating human reading process, In Proc. IJCAI’13. Zhao, X., Jiang, J., Yan, H., and Li, X. 2010. Jointly modeling aspects and opinions with a MaxEntLDA hybrid. In Proc. EMNLP’10, 56– 65. Zhuge, H. 2006. Discovery of knowledge flow in science. Commun. ACM, 49(5): 101-107. Zhuge, H. 2012. The Knowledge Grid: Toward Cyber-Physical Society (2nd edition). World Scientific Publishing Company, Singapore. Appendices A Survey Papers for Building Data Sets [RE] Yu, P. X., and Cheng, J. 2010. Managing and Mining Graph Data, Ch. 6, 181–215. Springer. [NP] Ng, V. 2010. Supervised noun phrase coreference research: The first fifteen years. In Proc. ACL ’10, 1396–1 141. [PP] Madnani, N., and Dorr, B. J. 2010. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Comput. Linguist., 36 (3): 341–387. [TE/WA] Lopez, A. 2008. Statistical machine translation. ACM Comput. Surv., 40(3), Article 8, 49 pages. [DP] Kübler, S., McDonald, R., and Nivre, J. 2009. Dependency parsing, Ch. 3–5, 21–78. Morgan & Claypools Publishers. [LR] Liu, T. Y. 201 1. Learning to rank for information retrieval, Ch. 2–4, 33–88. Springer. B Gibbs Sampling of the SOT Model Using collapsed Gibbs sampling (Griffiths and Steyvers, 2004), the latent variable is inferenced in Eq. (B1). In Eq. (B1), Nc, b , o , t (c , 0,o , t ) c 827 ×p∏t(=Tc1dΓ,s=(Nc¬,|(b cdo, ¬ s,)t(dc,s )0,1)∝,t ∏+T=1γΓo)(N×¬cΓ, b( d,o( N,st)(c¬ ,b (d,0o , st )0(c,t )0+,1γΣg) +×γΓo()N×¬c,Nb (,doN ¬,tsc,)(¬dc,s(d,) 0s(d),0 dΣ, c)+ CTα⋅ cγg) (B1) p( bd, s , n=1|wd, s , n=v, )∝N¬d,(N bd ¬d, s,( , b nd ) ,( sd , n ) ,(Σd ) ,1 +)α+b0α+1bα1b×NNb¬,( vbd¬, v ,( sd , n , s ) ,( n1 )(,1Σ,v ) + Vβ⋅bβgbg p( bd, s , n = 0,od, s , n = 0,td, s , n = t t| cd, s = c, b¬(d , s , n ),o¬(d , s , n ),t¬(d , s , n ),wd, s , n = v, ) ∝Nd¬N, (bd ,¬ s,( ,bd n ) ,s( , nd ) ,(Σd ) , +0)α+0bα+0bα1b×Nd¬N,( bd , o¬ ,d s, b(, nd , o ) ,( s ,d n ) ,(0d, Σ0 ),0 +)α+o0α+0oα1o ×NcN¬,(bcd,¬o,b,(s,d ,to, ns,) t,(n)c,(c0,0 ,0Σ,)t+)+Tγ⋅gγg×NbN¬,t(,bd¬v,(t,sd,v, ns),n(0)(,0t,Σt,)v+)+Vβ⋅tβpt p( bd, s , n = 0,od, s , n =1,td, s , n = t t| cd, s = (B2) (B3) c, b¬(d , s , n ),o¬(d , s , n ),t¬(d , s , n ),wd, s , n = v, ) ∝Nd¬N, b(d ,¬ s, ( ,bd n ) , s( , nd ) ,(Σd ) ,0 +)α+0bα+0bα1b×Nd¬(N, bd , o ,¬d s ,( n bd , ) o ,( sd , n ) ,(0d, Σ0 ),1 +)α+o0α+1oα1o ×NcN¬,b(d¬,co, b,s( ,d ,to, ns,) t,(n)c,(c0,10,Σ1,)t+)+Tγ⋅oγo×NbN¬,t(,bd¬v,(t,sd,v, ns),n(0)(,0t,Σt,)v+)+Vβ⋅tβpt (B4) Figure B 1. The SOT model inference. is the number of words of topic t describing the common ideas (o = 0) or original ideas (o = 1) of school of thought c. The superscript ¬(d , s ) means that words in sentence s of paper d are not counted. N¬d, c(d , s ) (d , c ) ) counts the number of sentences in paper d describing school of thought c with sentence s removed from consideration. In Eqs. (B 1)–(B4), the symbol Σ means summation over the corresponding variable. For example, Nc, b , o , t(c ,0,o ,Σ ) =t=1,,TNc, b , o , t(c ,0,o , t ) Latent variables b , o and t are jointly without counting the n-th token in sentence s of paper d. Nb¬,( td , v , s , n ) (0,t , v ) is the number of schoolof-thought words of topic t which is instantiated by vocabulary item v in the literature collection without counting the n-th token in sentence s of paper d. (B5) sampled in Eqs. (B2)–(B4). N¬d,( bd , s , n ) (d , b ) counts the number of background (b = 0) or school-ofthought (b = 1) words in document d without counting the n-th token in sentence s. Nb¬,( vd , s , n ) (1,v ) is the number of times vocabulary item v occurs as background word in the literature collection without counting the n-th token in sentence s of paper d. N¬d,( bd , o , s , n ) (d ,0,o ) is the number of words describing either common ideas (o = 0) or original ideas (o = 1) of some school of thought without considering the n-th token in sentence s of paper d. N¬c, b(d , o , s , t , n ) (c ,0,o , t ) is the number of words of topic t in the literature collection describing either common ideas (o = 0) or original ideas (o = 1) of school of thought c 828