acl acl2010 acl2010-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Duo Zhang ; Qiaozhu Mei ; ChengXiang Zhai
Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Proba- bilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. [sent-4, score-1.162]
2 One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. [sent-5, score-1.103]
3 In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. [sent-6, score-1.587]
4 Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Proba- bilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. [sent-7, score-0.617]
5 Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data. [sent-8, score-0.842]
6 1 Introduction As a robust unsupervised way to perform shallow latent semantic analysis of topics in text, probabilistic topic models (Hofmann, 1999a; Blei et al. [sent-9, score-0.989]
7 A topic is represented by a multinomial word distribution so that words characterizing a topic generally have higher probabilities than other words. [sent-12, score-0.694]
8 We can then hypothesize the existence of multiple topics in text and define a generative model based on the hypothesized topics. [sent-13, score-0.51]
9 edu to the latent topics as well as the topic distributions in text. [sent-16, score-0.964]
10 topics shared in text data in two different natural languages. [sent-20, score-0.519]
11 Thus with the existing models, we can only extract topics from text in each language, but cannot extract common topics shared in multiple languages. [sent-22, score-1.086]
12 In this paper, we propose a novel topic model, called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model, which can be used to mine shared latent topics from unaligned text data in different languages. [sent-23, score-1.067]
13 PCLSA extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. [sent-24, score-0.285]
14 As a topic extraction algorithm, PCLSA would take a pair of unaligned document sets in different languages and a bilingual dictionary as input, and output a set of aligned word distributions in both languages that can characterize the shared topics in the two languages. [sent-27, score-1.263]
15 In addition, it also outputs a topic cov1128 Proce dingUsp opfs thaela 4, 8Stwhe Adnen u,a 1l1- M16e Jtiunlgy o 2f0 t1h0e. [sent-28, score-0.31]
16 c s 2o0c1ia0ti Aosnso focria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsetisc 1s128–1 37, erage distribution for each language to indicate the relative coverage of different shared topics in each language. [sent-30, score-0.505]
17 To the best of our knowledge, no previous work has attempted to solve this topic extraction problem and generate the same output. [sent-31, score-0.335]
18 Both used a bilingual dictionary to bridge the language gap in a topic model. [sent-33, score-0.499]
19 However, the goals of their work are different from ours in that their models mainly focus on mining cross-lingual topics of matching word pairs and discovering the correspondence at the vocabulary level. [sent-34, score-0.503]
20 Therefore, the topics extracted using their model cannot indicate how a common topic is covered differently in the two languages, because the words in each word pair share the same probability in a common topic. [sent-35, score-0.926]
21 Our work focuses on discovering correspondence at the topic level. [sent-36, score-0.339]
22 In our model, since we only add a soft constraint on word pairs in the dictionary, their probabilities in common topics are generally different, naturally capturing which shows the different vari- ations of a common topic in different languages. [sent-37, score-0.916]
23 Experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data, and it outperforms a baseline approach using the standard PLSA on text data in each language. [sent-40, score-0.834]
24 2 Related Work Many topic models have been proposed, and the two basic models are the Probabilistic Latent Semantic Analysis (PLSA) model (Hofmann, 1999a) and the Latent Dirichlet Allocation (LDA) model (Blei et al. [sent-41, score-0.354]
25 They and their extensions have been successfully applied to many problems, including hierarchical topic extraction (Hofmann, 1999b; Blei et al. [sent-43, score-0.335]
26 , 2004), contextual topic analysis (Mei and Zhai, 2006), dynamic and correlated topic models (Blei and Lafferty, 2005; Blei and Lafferty, 2006), and opinion analysis (Mei et al. [sent-45, score-0.649]
27 Our work is an extension of PLSA by incorporating the knowledge of a bilingual dictionary as soft constraints. [sent-48, score-0.226]
28 Some previous work on multilingual topic models assume documents in multiple languages are aligned either at the document level, sentence level or by time stamps (Mimno et al. [sent-51, score-0.541]
29 However, in many applications, we need to mine topics from unaligned text corpus. [sent-55, score-0.563]
30 For example, mining topics from search results in different languages can facilitate summarization of multilingual search results. [sent-56, score-0.623]
31 Besides all the multilingual topic modeling work discussed above, comparable corpora have also been studied extensively (e. [sent-57, score-0.422]
32 Our work differs from this line of previous work in that our goal is to discover shared latent topics from multi-lingual text data that are weakly comparable (e. [sent-63, score-0.707]
33 3 Problem Formulation In general, the problem of cross-lingual topic extraction can be defined as to extract a set of common cross-lingual latent topics covered in text collections in different natural languages. [sent-66, score-1.09]
34 A crosslingual latent topic will be represented as a multinomial word distribution over the words in all the languages, i. [sent-67, score-0.613]
35 For example, given two collections of news articles in English and Chinese, respectively, we would like to extract common topics simultaneously from the two collections. [sent-70, score-0.628]
36 As a computational problem, our input is a multi-lingual text corpus, and output is a set of cross-lingual latent topics. [sent-74, score-0.202]
37 1129 Definition 1(Multi-Lingual Corpus) A multilingual corpus C is a set of text collections {C1,C2, . [sent-76, score-0.181]
38 Definition 2 (Cross-Lingual Topic): A crosslingual topic θ is a semantically coherent multinomial distribution over all the words in the vocabularies of languages L1, . [sent-94, score-0.607]
39 Definition 3 (Cross-Lingual Topic Extraction) Given a multi-lingual corpus C, the task of cross-lingual topic ei-xlitnrgacutaioln c oirsp tuos m C,od theel a tandsk extract k major cross-lingual topics {θ1 , θ2, . [sent-102, score-0.828]
40 The extracted cross-lingual topics can be directly used as a summary of the common content of the multi-lingual data set. [sent-106, score-0.523]
41 Note that once a cross-lingual topic is extracted, we can easily obtain its representation in each language Li by “splitting” the cross-lingual topic into multiple word distributions in different languages. [sent-107, score-0.662]
42 Formally, the word distribution of a cross-lingual topic θ in language Li is given by pi (wi |θ) = p(wi|θ) ∑w∈Vip(w|θ) These aligned language-specific word distributions can directly review the variations of topics in different languages. [sent-108, score-0.849]
43 They can also be used to analyze the difference of the coverage of the same topic in different languages. [sent-109, score-0.31]
44 Moreover, they are also useful for retrieving relevant articles or passages in each language and aligning them to the same common topic, thus essentially also allowing us to integrate and align articles in multiple languages. [sent-110, score-0.146]
45 4 Probabilistic Cross-Lingual Latent Semantic Analysis In this section, we present our probabilistic crosslingual latent semantic analysis (PCLSA) model and discuss how it can be used to extract crosslingual topics from multi-lingual text data. [sent-111, score-0.955]
46 The main reason why existing topic models can’t be used for cross-lingual topic extraction is because they cannot cross the language barrier. [sent-112, score-0.645]
47 Intuitively, in order to cross the language barrier and extract a common topic shared in articles in different languages, we must rely on some kind of linguistic knowledge. [sent-113, score-0.473]
48 , Ls, if we represent each language as a node in a graph and connect those language pairs for which we have a bilingual dictionary, the minimum requirement is that the whole graph is connected. [sent-118, score-0.233]
49 A bilingual dictionary for languages Li and Lj generally would give us a many-to-many mapping between the vocabularies of the two languages. [sent-124, score-0.31]
50 Figure 1: A Dictionary based Word Graph With multiple bilingual dictionaries, we can merge the graphs to generate a multi-partite graph G = (V, E). [sent-128, score-0.158]
51 Based on this graph, the PCLSA model extends the standard PLSA by adding a constraint to the likelihood function to “smooth” the word distributions of topics in PLSA on the multi-partite graph so that we would encourage the words that are connected in the graph (i. [sent-129, score-0.708]
52 , k) be a set of Skp cross-lingual topic mθ}od (ejls = =to 1 , b. [sent-136, score-0.31]
53 ,dki)sc boev ear seedt from a multilingual text data set with s languages such that p(w|θi) is the probability of word w according tto p t(hwe| topic model θi. [sent-139, score-0.52]
54 Clearly, we would like the extracted topics to have a small R(C). [sent-147, score-0.475]
55 Our parameters include all the cross-lingual topics and the coverage distributions of the topics in all documents, which we denote by Ψ = {p(w|θj) ,p(θj |d)}d,w,j wwhhiecreh j = 1, . [sent-155, score-0.94]
56 , k, w =var {ieps( over the en|dt)ir}e vocabularies of all the languages , d varies over all the documents in our collection. [sent-158, score-0.157]
57 Obviously, after each smoothing step, the sum of the probabilities of all the words in one topic is still equal to 1. [sent-172, score-0.334]
58 These monolingual topics can then be aligned based on a bilingual dictionary to suggest a possible cross-lingual topic. [sent-193, score-0.661]
59 1 Qualitative Comparison To qualitatively compare PCLSA with the baseline method, we compare the word distributions of topics extracted by them. [sent-195, score-0.517]
60 The number of topics to be extracted is set to 10 for both methods. [sent-200, score-0.475]
61 The first ten rows show sample topics of the modeling results of traditional PLSA model. [sent-203, score-0.471]
62 Compared with the baseline method, PCLSA can not only find coherent topics from the cross-lingual corpus, but it can also show the content about one topic from both two language corpora. [sent-208, score-0.795]
63 Similarly, in ’Topic 9’, the topic is related to Philippine, the Chinese corpus mentions some environmental situation in Philippine, while the English corpus mentions a lot about ’Abu Sayyaf’. [sent-216, score-0.42]
64 2 Discovering Common Topics To demonstrate the ability of PCLSA for finding common topics in cross-lingual corpus, we use some event names, e. [sent-218, score-0.497]
65 In either the English corpus or the Chinese corpus, we select a smaller number of documents about topic ’Championship’ combined with the other two topics in the same corpus. [sent-222, score-0.829]
66 In this way, when we want to extract two topics from either English or Chinese corpus, the ’Championship’ topic may not be easy to extract, because the other two topics have more documents in the corpus. [sent-223, score-1.279]
67 However, when we use PCLSA to extract four topics from the two corpora together, we expect that the topic ’Championship’ will be found, because now the sum of English and Chinese documents related to ’Championship’ is larger than other topics. [sent-224, score-0.83]
68 The first two columns are the two topics extracted from Engish corpus, the third and the forth columns are two topics from Chinese corpus, and the other four columns are the results from cross-lingual corpus. [sent-226, score-0.924]
69 We can see that in either the Chinese subcollection or the English sub-collection, the topic ’Championship’ is not extracted as a significant topic. [sent-227, score-0.336]
70 But, as expected, the topic ’Championship’ is extracted from the cross-lingual corpus, while the topic ’Olympic’ and topic ’Shrine’ are merged together. [sent-228, score-0.956]
71 This demonstrate that PCLSA is capable of extracting common topics from a cross-lingual corpus. [sent-229, score-0.535]
72 3 Quantitative Evaluation We also quantitatively evaluate how well our PCLSA model can discover common topics among corpus in different languages. [sent-231, score-0.553]
73 The basic idea is: suppose we got k cross-lingual topics from the whole corpus, then for each topic, we split the topic into two separate set of topics, English topics and Chinese topics, using the splitting formula described before, i. [sent-233, score-1.231]
74 Then, we use the word distribution o∑fthe Chinese topics (translating the words into English) to fit the English Corpus and use the word distribution of the English topics (translating the words into Chinese) to fit the Chinese Corpus. [sent-236, score-1.028]
75 If the topics mined are common topics in the whole corpus, then such a “crosscollection” likelihood should be larger than those topics which are not commonly shared by the English and the Chinese corpus. [sent-237, score-1.551]
76 To translate topics from one language to another, e. [sent-239, score-0.449]
77 Chinese to English, we look up the bilingual dictionary and do word-to-word translation. [sent-241, score-0.189]
78 Basically, suppose PLSA mined k semantic topics in the Chinese corpus and k semantic topics in the English corpus. [sent-244, score-1.021]
79 Then, we also use the “cross-collection” likelihood measure to see how well those k semantic Chinese topics fit the English corpus and those k semantic English topics fit the Chinese corpus. [sent-245, score-1.137]
80 In the second data set, the English and Chinese corpora share some common topics during the overlap period. [sent-279, score-0.52]
81 The purpose of using these three different data sets for evaluation is to test how well PCLSA can mine common topics from either a data set where the English corpus and the Chinese corpus are comparable or a data set where the English corpus and the Chinese corpus rarely share common topics. [sent-281, score-0.757]
82 Each row shows the “cross-collection” likelihood of using the “cross-collection” topics to fit the data set named in the first column. [sent-283, score-0.558]
83 For example, in the first row, the values are the “cross-collection” likelihood of using Chinese topics found by different methods from the first data set to fit English 1. [sent-284, score-0.558]
84 From the results, we can see that in all the data sets, our PCLSA has higher “cross-collection” likelihood value, which means it can find better common topics compared to the baseline method. [sent-286, score-0.566]
85 less topic overlapping, but the imTable 4: Quantitative Evaluation of Common Topic Finding (“cross-collection” log-likelihood) CE nhgi nles he321- 24. [sent-290, score-0.31]
86 4 Extracting from Multi-Language Corpus In the previous experiments, we have shown the capability and effectiveness of the PCLSA model in latent topic extraction from two language cor- pora. [sent-297, score-0.52]
87 In fact, the proposed model is general and capable of extracting latent topics from multilanguage corpus. [sent-298, score-0.672]
88 The experimental result is shown in Table 5, in which we try to extract 8 topics from the crosslingual corpus. [sent-305, score-0.574]
89 We can see that the extracted topics are mainly written in monolanguage. [sent-308, score-0.475]
90 As we set the value of parameter λ larger, the extracted topics become multi-lingual, which is shown in the next ten rows. [sent-309, score-0.497]
91 In addition, if we set the λ even larger, we will get topics that are mostly made of the same words from the three different brands, which means the extracted topics are very smooth on the dictionary graph now. [sent-311, score-1.091]
92 7 Conclusion In this paper, we study the problem of crosslingual latent topic extraction where the task is to extract a set of common latent topics from multilingual text data. [sent-312, score-1.409]
93 the Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model) that can incorporate translation knowledge in bilingual dictionaries as a regularizer to constrain the parameter estimation so that the learned topic models would be synchronized in multiple languages. [sent-315, score-0.515]
94 The experimental results show that PCLSA is effective in extracting common latent topics from multilingual text data, and it outperforms the baseline method which uses the standard PLSA to fit each monolingual text data set. [sent-317, score-0.903]
95 Second, it would also be interesting to further extend PCLSA to accommodate discovering topics in each language that aren’t well-aligned with other languages. [sent-321, score-0.478]
96 Hierarchical topic models and the nested chinese restaurant process. [sent-340, score-0.509]
97 Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. [sent-373, score-0.207]
98 The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. [sent-383, score-0.349]
99 Lexical triggers and latent semantic analysis for crosslingual language model adaptation. [sent-397, score-0.303]
100 Mining correlated bursty topic patterns from coordinated text streams. [sent-454, score-0.378]
wordName wordTfidf (topN-words)
[('pclsa', 0.549), ('topics', 0.449), ('plsa', 0.316), ('topic', 0.31), ('chinese', 0.199), ('latent', 0.163), ('mei', 0.108), ('bilingual', 0.106), ('championship', 0.1), ('crosslingual', 0.09), ('blei', 0.09), ('multilingual', 0.087), ('dictionary', 0.083), ('likelihood', 0.069), ('qiaozhu', 0.063), ('languages', 0.062), ('english', 0.062), ('chengxiang', 0.061), ('vocabularies', 0.059), ('graph', 0.052), ('hofmann', 0.05), ('dell', 0.05), ('dji', 0.05), ('articles', 0.049), ('common', 0.048), ('unaligned', 0.047), ('gem', 0.044), ('distributions', 0.042), ('apple', 0.04), ('fit', 0.04), ('brands', 0.04), ('text', 0.039), ('zhai', 0.039), ('probabilistic', 0.039), ('extracting', 0.038), ('duo', 0.038), ('soft', 0.037), ('dictionaries', 0.037), ('xinhua', 0.037), ('documents', 0.036), ('coherent', 0.036), ('deg', 0.036), ('extract', 0.035), ('corpus', 0.034), ('jagaralamudi', 0.033), ('olympic', 0.033), ('philippine', 0.033), ('synchronized', 0.033), ('wki', 0.033), ('deficiency', 0.033), ('mined', 0.033), ('smooth', 0.032), ('shared', 0.031), ('translations', 0.031), ('correlated', 0.029), ('discovering', 0.029), ('regularizer', 0.029), ('wc', 0.029), ('palestinian', 0.029), ('regularizing', 0.029), ('shrine', 0.029), ('semantic', 0.028), ('ibm', 0.028), ('mine', 0.028), ('mimno', 0.027), ('oj', 0.027), ('gliozzo', 0.027), ('sadat', 0.027), ('wv', 0.027), ('zw', 0.027), ('extracted', 0.026), ('qualitative', 0.026), ('news', 0.026), ('distribution', 0.025), ('mining', 0.025), ('comparable', 0.025), ('kdd', 0.025), ('dj', 0.025), ('extraction', 0.025), ('multinomial', 0.025), ('edge', 0.025), ('probabilities', 0.024), ('li', 0.024), ('branavan', 0.024), ('synthetic', 0.024), ('wp', 0.024), ('masuichi', 0.024), ('whole', 0.023), ('aligned', 0.023), ('document', 0.023), ('share', 0.023), ('ten', 0.022), ('ci', 0.022), ('mixture', 0.022), ('extends', 0.022), ('model', 0.022), ('mentions', 0.021), ('quantitative', 0.021), ('www', 0.021), ('collections', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 79 acl-2010-Cross-Lingual Latent Topic Extraction
Author: Duo Zhang ; Qiaozhu Mei ; ChengXiang Zhai
Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Proba- bilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.
2 0.21307047 10 acl-2010-A Latent Dirichlet Allocation Method for Selectional Preferences
Author: Alan Ritter ; Mausam Mausam ; Oren Etzioni
Abstract: The computation of selectional preferences, the admissible argument values for a relation, is a well-known NLP task with broad applicability. We present LDA-SP, which utilizes LinkLDA (Erosheva et al., 2004) to model selectional preferences. By simultaneously inferring latent topics and topic distributions over relations, LDA-SP combines the benefits of previous approaches: like traditional classbased approaches, it produces humaninterpretable classes describing each relation’s preferences, but it is competitive with non-class-based methods in predictive power. We compare LDA-SP to several state-ofthe-art methods achieving an 85% increase in recall at 0.9 precision over mutual information (Erk, 2007). We also evaluate LDA-SP’s effectiveness at filtering improper applications of inference rules, where we show substantial improvement over Pantel et al. ’s system (Pantel et al., 2007).
3 0.17096226 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization
Author: Asli Celikyilmaz ; Dilek Hakkani-Tur
Abstract: Scoring sentences in documents given abstract summaries created by humans is important in extractive multi-document summarization. In this paper, we formulate extractive summarization as a two step learning problem building a generative model for pattern discovery and a regression model for inference. We calculate scores for sentences in document clusters based on their latent characteristics using a hierarchical topic model. Then, using these scores, we train a regression model based on the lexical and structural characteristics of the sentences, and use the model to score sentences of new documents to form a summary. Our system advances current state-of-the-art improving ROUGE scores by ∼7%. Generated summaries are less rbeydu ∼n7d%an.t a Gnedn more dc sohuemremnatr bieasse adre upon manual quality evaluations.
Author: Mark Johnson
Abstract: This paper establishes a connection between two apparently very different kinds of probabilistic models. Latent Dirichlet Allocation (LDA) models are used as “topic models” to produce a lowdimensional representation of documents, while Probabilistic Context-Free Grammars (PCFGs) define distributions over trees. The paper begins by showing that LDA topic models can be viewed as a special kind of PCFG, so Bayesian inference for PCFGs can be used to infer Topic Models as well. Adaptor Grammars (AGs) are a hierarchical, non-parameteric Bayesian extension of PCFGs. Exploiting the close relationship between LDA and PCFGs just described, we propose two novel probabilistic models that combine insights from LDA and AG models. The first replaces the unigram component of LDA topic models with multi-word sequences or collocations generated by an AG. The second extension builds on the first one to learn aspects of the internal structure of proper names.
5 0.13173755 158 acl-2010-Latent Variable Models of Selectional Preference
Author: Diarmuid O Seaghdha
Abstract: This paper describes the application of so-called topic models to selectional preference induction. Three models related to Latent Dirichlet Allocation, a proven method for modelling document-word cooccurrences, are presented and evaluated on datasets of human plausibility judgements. Compared to previously proposed techniques, these models perform very competitively, especially for infrequent predicate-argument combinations where they exceed the quality of Web-scale predictions while using relatively little data.
6 0.13055284 237 acl-2010-Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
7 0.1194073 262 acl-2010-Word Alignment with Synonym Regularization
8 0.097997889 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
9 0.092293836 123 acl-2010-Generating Focused Topic-Specific Sentiment Lexicons
10 0.08884719 22 acl-2010-A Unified Graph Model for Sentence-Based Opinion Retrieval
11 0.08236447 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure
12 0.077487193 195 acl-2010-Phylogenetic Grammar Induction
13 0.074337624 204 acl-2010-Recommendation in Internet Forums and Blogs
14 0.073566929 215 acl-2010-Speech-Driven Access to the Deep Web on Mobile Devices
15 0.073385 220 acl-2010-Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure
16 0.068754978 146 acl-2010-Improving Chinese Semantic Role Labeling with Rich Syntactic Features
17 0.067413159 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications
18 0.06609872 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints
19 0.064217977 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems
20 0.063480213 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
topicId topicWeight
[(0, -0.18), (1, 0.043), (2, -0.077), (3, 0.082), (4, 0.037), (5, 0.006), (6, -0.004), (7, -0.083), (8, 0.081), (9, -0.104), (10, -0.01), (11, -0.067), (12, 0.124), (13, 0.037), (14, 0.122), (15, -0.045), (16, -0.077), (17, -0.073), (18, -0.025), (19, -0.209), (20, -0.127), (21, -0.238), (22, 0.022), (23, 0.036), (24, -0.118), (25, -0.031), (26, -0.063), (27, 0.031), (28, -0.164), (29, -0.087), (30, -0.073), (31, 0.003), (32, -0.015), (33, -0.021), (34, -0.029), (35, -0.01), (36, 0.033), (37, -0.015), (38, 0.053), (39, -0.059), (40, 0.128), (41, -0.078), (42, 0.092), (43, -0.018), (44, 0.127), (45, -0.046), (46, 0.054), (47, -0.012), (48, 0.009), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.96446764 79 acl-2010-Cross-Lingual Latent Topic Extraction
Author: Duo Zhang ; Qiaozhu Mei ; ChengXiang Zhai
Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Proba- bilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.
Author: Mark Johnson
Abstract: This paper establishes a connection between two apparently very different kinds of probabilistic models. Latent Dirichlet Allocation (LDA) models are used as “topic models” to produce a lowdimensional representation of documents, while Probabilistic Context-Free Grammars (PCFGs) define distributions over trees. The paper begins by showing that LDA topic models can be viewed as a special kind of PCFG, so Bayesian inference for PCFGs can be used to infer Topic Models as well. Adaptor Grammars (AGs) are a hierarchical, non-parameteric Bayesian extension of PCFGs. Exploiting the close relationship between LDA and PCFGs just described, we propose two novel probabilistic models that combine insights from LDA and AG models. The first replaces the unigram component of LDA topic models with multi-word sequences or collocations generated by an AG. The second extension builds on the first one to learn aspects of the internal structure of proper names.
3 0.62722141 10 acl-2010-A Latent Dirichlet Allocation Method for Selectional Preferences
Author: Alan Ritter ; Mausam Mausam ; Oren Etzioni
Abstract: The computation of selectional preferences, the admissible argument values for a relation, is a well-known NLP task with broad applicability. We present LDA-SP, which utilizes LinkLDA (Erosheva et al., 2004) to model selectional preferences. By simultaneously inferring latent topics and topic distributions over relations, LDA-SP combines the benefits of previous approaches: like traditional classbased approaches, it produces humaninterpretable classes describing each relation’s preferences, but it is competitive with non-class-based methods in predictive power. We compare LDA-SP to several state-ofthe-art methods achieving an 85% increase in recall at 0.9 precision over mutual information (Erk, 2007). We also evaluate LDA-SP’s effectiveness at filtering improper applications of inference rules, where we show substantial improvement over Pantel et al. ’s system (Pantel et al., 2007).
4 0.52637661 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization
Author: Asli Celikyilmaz ; Dilek Hakkani-Tur
Abstract: Scoring sentences in documents given abstract summaries created by humans is important in extractive multi-document summarization. In this paper, we formulate extractive summarization as a two step learning problem building a generative model for pattern discovery and a regression model for inference. We calculate scores for sentences in document clusters based on their latent characteristics using a hierarchical topic model. Then, using these scores, we train a regression model based on the lexical and structural characteristics of the sentences, and use the model to score sentences of new documents to form a summary. Our system advances current state-of-the-art improving ROUGE scores by ∼7%. Generated summaries are less rbeydu ∼n7d%an.t a Gnedn more dc sohuemremnatr bieasse adre upon manual quality evaluations.
5 0.51569206 158 acl-2010-Latent Variable Models of Selectional Preference
Author: Diarmuid O Seaghdha
Abstract: This paper describes the application of so-called topic models to selectional preference induction. Three models related to Latent Dirichlet Allocation, a proven method for modelling document-word cooccurrences, are presented and evaluated on datasets of human plausibility judgements. Compared to previously proposed techniques, these models perform very competitively, especially for infrequent predicate-argument combinations where they exceed the quality of Web-scale predictions while using relatively little data.
6 0.4827306 262 acl-2010-Word Alignment with Synonym Regularization
7 0.48075914 34 acl-2010-Authorship Attribution Using Probabilistic Context-Free Grammars
8 0.4461208 204 acl-2010-Recommendation in Internet Forums and Blogs
9 0.4199591 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure
10 0.419581 177 acl-2010-Multilingual Pseudo-Relevance Feedback: Performance Study of Assisting Languages
11 0.3970902 162 acl-2010-Learning Common Grammar from Multilingual Corpus
12 0.39561763 195 acl-2010-Phylogenetic Grammar Induction
13 0.39538467 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems
14 0.39041269 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints
15 0.38895729 237 acl-2010-Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
16 0.3835617 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation
17 0.3479206 123 acl-2010-Generating Focused Topic-Specific Sentiment Lexicons
18 0.343573 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities
19 0.34298563 16 acl-2010-A Statistical Model for Lost Language Decipherment
20 0.33043948 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
topicId topicWeight
[(8, 0.01), (14, 0.016), (25, 0.053), (28, 0.011), (33, 0.013), (39, 0.011), (42, 0.018), (44, 0.014), (49, 0.175), (59, 0.092), (72, 0.02), (73, 0.046), (76, 0.018), (78, 0.038), (83, 0.082), (84, 0.034), (98, 0.228)]
simIndex simValue paperId paperTitle
1 0.91435915 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications
Author: Bin Wei ; Christopher Pal
Abstract: In this paper, we study the problem of using an annotated corpus in English for the same natural language processing task in another language. While various machine translation systems are available, automated translation is still far from perfect. To minimize the noise introduced by translations, we propose to use only key ‘reliable” parts from the translations and apply structural correspondence learning (SCL) to find a low dimensional representation shared by the two languages. We perform experiments on an EnglishChinese sentiment classification task and compare our results with a previous cotraining approach. To alleviate the problem of data sparseness, we create extra pseudo-examples for SCL by making queries to a search engine. Experiments on real-world on-line review data demonstrate the two techniques can effectively improvetheperformancecomparedtoprevious work.
2 0.89158982 253 acl-2010-Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing
Author: Manabu Sassano ; Sadao Kurohashi
Abstract: We investigate active learning methods for Japanese dependency parsing. We propose active learning methods of using partial dependency relations in a given sentence for parsing and evaluate their effectiveness empirically. Furthermore, we utilize syntactic constraints of Japanese to obtain more labeled examples from precious labeled ones that annotators give. Experimental results show that our proposed methods improve considerably the learning curve of Japanese dependency parsing. In order to achieve an accuracy of over 88.3%, one of our methods requires only 34.4% of labeled examples as compared to passive learning.
same-paper 3 0.88134897 79 acl-2010-Cross-Lingual Latent Topic Extraction
Author: Duo Zhang ; Qiaozhu Mei ; ChengXiang Zhai
Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Proba- bilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.
4 0.82387239 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models
Author: David Jurgens ; Keith Stevens
Abstract: We present the S-Space Package, an open source framework for developing and evaluating word space algorithms. The package implements well-known word space algorithms, such as LSA, and provides a comprehensive set of matrix utilities and data structures for extending new or existing models. The package also includes word space benchmarks for evaluation. Both algorithms and libraries are designed for high concurrency and scalability. We demonstrate the efficiency of the reference implementations and also provide their results on six benchmarks.
5 0.82155055 133 acl-2010-Hierarchical Search for Word Alignment
Author: Jason Riesa ; Daniel Marcu
Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Our algorithm induces a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of features, and trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
6 0.81968188 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints
7 0.81876546 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification
8 0.81587625 146 acl-2010-Improving Chinese Semantic Role Labeling with Rich Syntactic Features
9 0.815593 20 acl-2010-A Transition-Based Parser for 2-Planar Dependency Structures
10 0.81389999 188 acl-2010-Optimizing Informativeness and Readability for Sentiment Summarization
11 0.81328475 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
12 0.8127799 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages
13 0.81221753 93 acl-2010-Dynamic Programming for Linear-Time Incremental Parsing
14 0.8070845 170 acl-2010-Letter-Phoneme Alignment: An Exploration
15 0.80556655 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities
16 0.8027364 37 acl-2010-Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
17 0.80252755 262 acl-2010-Word Alignment with Synonym Regularization
18 0.80249727 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning
19 0.80176365 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization
20 0.80122519 116 acl-2010-Finding Cognate Groups Using Phylogenies