jmlr jmlr2011 jmlr2011-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros
Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy
Reference: text
sentIndex sentText sentNum sentScore
1 The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. [sent-8, score-0.635]
2 hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. [sent-9, score-0.642]
3 Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy 1. [sent-11, score-1.298]
4 Introduction In this paper we address the problem of modeling the content of a given document collection as a hierarchy of latent topics given no prior knowledge. [sent-12, score-0.975]
5 These topics represent and capture facets of content meaning, by means of multinomial probability distributions over the words of the term space of the documents. [sent-13, score-0.644]
6 The assignment of documents to latent topics without any preclassification is a powerful text mining technique, useful among others for ontology learning from text and document indexing. [sent-14, score-1.163]
7 Z AVITSANOS , PALIOURAS AND VOUROS Much work on PTMs focuses on a flat clustering of the term space into topics, while the creation of a hierarchical structure of topics without user involvement or pre-defined parameters still remains a challenging task. [sent-22, score-0.624]
8 The goal of discovering a topic hierarchy that comprises levels of topic abstractions is different from conventional hierarchical clustering. [sent-23, score-0.906]
9 The hierarchical modeling of topics allows the construction of more accurate and predictive models than the ones constructed by flat models. [sent-33, score-0.622]
10 Assuming that the higher levels of the hierarchy capture generic topics of a particular domain, while lower-level ones focus on particular aspects of that domain, it is expected that a hierarchical probabilistic topic model would be able to “explain” or could have generated the data set. [sent-37, score-1.144]
11 In particular, we aim to infer a hierarchy of topics and subtopics, such that each topic is more general than its subtopics, in the sense that if a document can be indexed by any of the subtopics it should also be indexed by the topic itself. [sent-43, score-1.552]
12 Moreover, we demand to infer the hierarchy without making any assumption either about the number of topics at any level of the hierarchy, or about the height of the hierarchy. [sent-44, score-0.869]
13 In addition to the basic model, we also present a variant that produces a topic hierarchy, by modeling the vocabulary only at the leaf level and considering topics in the inner levels to be multinomial distributions over subtopics. [sent-48, score-1.035]
14 Documents are assumed to be mixtures of topics and topics are probability distributions over the words of some vocabulary. [sent-55, score-1.151]
15 Based on this assumption of exchangeability, the meaning of documents does not depend on the specific sequence of the words, that is, the syntax, but rather on their “ability” to express specific topics either in isolation or in mixture. [sent-60, score-0.71]
16 We will refer to the distribution of topics as θK , indicating the dimensionality K of the distribution, and finally, φV will stand for the distribution of the words of the vocabulary V . [sent-77, score-0.623]
17 Following the principles of PTMs, the generative model of probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) specifies a simple generative rule for the words in a document di , according to which, each word of a training document di comes from a randomly chosen topic ti . [sent-78, score-0.731]
18 The topics are drawn from a document-specific distribution over topics θK , and there exists one such 2751 Z AVITSANOS , PALIOURAS AND VOUROS distribution for each di . [sent-79, score-1.12]
19 The generative model of LDA, being a probabilistic model of a corpus, represents each di as random mixture over latent topics T . [sent-85, score-0.658]
20 The estimated topics are represented as multinomial probability distributions over the terms of the documents, while each di is represented as a Dirichlet random variable θ, the dimensionality of which, is predefined and equal to the number of estimated latent topics. [sent-87, score-0.742]
21 A question that usually arises when using models like LDA is how many topics the estimated model should have, given the document collection. [sent-90, score-0.699]
22 In such a Bayesian hierarchy, the root DP uses the Dirichlet distribution of the topics as a base distribution and each document samples from it. [sent-94, score-0.677]
23 Although LDA is a true generative probabilistic model for documents and HDP is a convenient mechanism for inferring the number of topics, relations of any type or correlations between the estimated topics are not taken into account. [sent-95, score-0.758]
24 In fact, a flat and soft clustering of the term space of the documents into topics is provided. [sent-96, score-0.731]
25 Thus, there is a need for hierarchical models that are able to capture relations between the latent topics in order to represent common shared structure, as explained in Section 1. [sent-97, score-0.708]
26 In a typical hierarchy, documents are assigned to classes at the leaves of the hierarchy, while words are sampled from topics which occupy non-leaf nodes of the hierarchy. [sent-107, score-0.752]
27 2752 N ON -PARAMETRIC E STIMATION OF T OPIC H IERARCHIES WITH HDP S Another approach to capturing relations between topics is the correlated topic models (CTM) (Blei and Lafferty, 2006), an extension of LDA. [sent-112, score-0.842]
28 Correlations are introduced by topics that appear in the same context, in the sense that they appear together in documents (or parts of documents). [sent-115, score-0.71]
29 PAM connects the words of the vocabulary V and topics T on a DAG, where topics occupy the interior nodes and the leaves are words. [sent-121, score-1.165]
30 The four-level PAM, which is presented in Li and McCallum (2006), is able to model a text collection through a three-level hierarchy of topics with arbitrary connections between them. [sent-123, score-0.83]
31 , 2004) was the first attempt to represent the distribution of topics as a tree-structure by providing at the same time uncertainty over the branching factor at each level of the tree. [sent-126, score-0.611]
32 In hLDA, each document is modeled as a mixture of L topics defined by θL proportions along a path from the root topic to a leaf. [sent-127, score-0.952]
33 Therefore, each document di is generated by the topics along a single path of this tree. [sent-128, score-0.694]
34 Thus, both internal and leaf topics generate words for new documents. [sent-132, score-0.702]
35 In the second one, this distribution does not exist, but the Dirichlet distribution of each internal node has one extra “exit” dimension, which corresponds to the event that a word is produced directly by the internal node, without reaching the leaf topics of the DAG. [sent-140, score-0.753]
36 While the models belonging in the PAM family provide a powerful means to describe inter-topic correlations, they have the same practical difficulty as many other topic models in determining the number of topics at the internal levels. [sent-146, score-0.879]
37 Each topic ti is modeled by a Dirichlet process and the Dirichlet processes at each level are further organized into a hierarchical Dirichlet process (HDP), which is used to estimate the number of topics at this level. [sent-150, score-0.894]
38 During the generation of a document, after sampling the multinomial distributions over topics from the corresponding HDPs, a topic path is sampled repeatedly according to the multinomials for each word in the document di . [sent-152, score-1.088]
39 Representing all topics as multinomial distributions over words is more appealing, than representing only the leaf topics. [sent-154, score-0.718]
40 (2008) uses the LDA model iteratively to produce layers of topics and then establishes hierarchical relations between them, based on conditional independences, given candidate parent topics. [sent-156, score-0.629]
41 The branching factor at each level is decided by the number of discovered relations, since topics that are not connected to others are disregarded. [sent-157, score-0.634]
42 The issue of the depth of the hierarchy is addressed in that work by measuring the similarity of the newly generated topics to the existing ones. [sent-158, score-0.837]
43 In summary, some topic models support a latent hierarchy of topics, but allow the generation of words only at the leaf level. [sent-160, score-0.707]
44 In addition, in contrast to the simple LDA, in the case of hLDA, documents can only access the topics that lie across a single path in the learned tree. [sent-163, score-0.733]
45 Finally, parameters such as the number of topics or the number of levels need to be estimated using cross-validation, which is not efficient even for non-hierarchical topic models like LDA. [sent-169, score-0.867]
46 The majority of the work reviewed in this section assesses the inferred hierarchy on the basis of how “meaningful” the latent topics are to humans. [sent-172, score-0.905]
47 The first variant results in a hierarchy whose internal nodes are represented as probability distributions over topics and over words. [sent-191, score-0.868]
48 There are as many DPs (G j ) as the documents at each level, connected to all topics of the level. [sent-205, score-0.71]
49 In addition, it allows the topics at each level to be shared among the documents of the collection. [sent-210, score-0.746]
50 hvHDP consists of topics that are both distributions over subtopics and over words. [sent-215, score-0.66]
51 htHDP represents only leaf topics as distributions over words. [sent-216, score-0.641]
52 On the other hand, while hPAM needs the number of internal topics to be fixed a priori, hLDA and hHDP are able to infer the number of topics at each level of the hierarchy, due to their non-parametric Bayesian nature. [sent-222, score-1.198]
53 Moreover, while the model of hLDA requires that each document is made of topics across a specific path of the hierarchy, hPAM and hHDP provide much more flexibility, since topics can be shared among super-topics. [sent-223, score-1.2]
54 The topics of the PAM models generate words at the leaf level and the models are based on a fixed three-level hierarchy. [sent-226, score-0.732]
55 The sampling scheme of hHDP estimates both the number of topics at each level and the number of levels of the learned hierarchy. [sent-236, score-0.66]
56 As shown in Figure 5, starting at the leaf level, we use HDP to infer the number of leaf topics as if no hierarchy is to be built. [sent-237, score-0.981]
57 Figure 5: Bottom-up probabilistic estimation of the topic hierarchy: Starting with a corpus of M documents, the leaf topics are inferred first. [sent-240, score-0.956]
58 In hvHDP, where topics are both distributions over subtopics and over words, the inference of the non-leaf levels treats topics, instead of documents, as restaurants. [sent-248, score-0.689]
59 Having inferred the topics at the leaf level, we know the mixture proportions that the documents of the collection follow. [sent-251, score-0.85]
60 In other words, at the leaf level we allocate documents to leaf topics, while at the intermediate levels we allocate topics to super-topics. [sent-255, score-0.923]
61 Therefore, the main contribution of this sampling scheme is the estimation of the non-leaf topics from “artificial” documents that correspond to estimated topics of lower levels. [sent-257, score-1.304]
62 Together with the use of the HDP for the estimation of the number of topics at each level, it makes the estimation of the topic hierarchy completely non-parametric. [sent-260, score-1.054]
63 Regarding the second variant of the model (htHDP), where the internal topics are distributions only over subtopics and not words, the inference procedure differs in the modeling of non-leaf topics. [sent-261, score-0.704]
64 As observations for the inference of the next level up, we use the distributions of topics at the lower level over the original documents. [sent-263, score-0.639]
65 Therefore, while in the first variant of hHDP, we had a topic - term matrix of frequencies as input for the estimation of an intermediate level of the hierarchy, in htHDP, we have a document - topic matrix of frequencies for the sampling procedure. [sent-264, score-0.73]
66 A topic hierarchy is derived from the corpus and a non-parametric Bayesian hierarchy is used at each level of the topic hierarchy. [sent-270, score-1.099]
67 The first hHDP variant satisfies the criteria that we set in Section 2: internal topics are represented as distributions over words and over subtopics, topics can share subtopics at the lower level in the hierarchy, and topics across any level of the hierarchy are shared among documents. [sent-271, score-2.159]
68 The degree of sharing topics across documents is expressed through the inferred parameters of the model, and this sharing of topics reflects the sharing of common terminology between documents. [sent-272, score-1.37]
69 In addition, n·iz is the number of occurrences of word i in topic z, n··z is the total number of words assigned to topic z, and finally, H and V are the prior DP hyper-parameter for word distributions and the total number of words respectively. [sent-290, score-0.717]
70 In addition, in hvHDP, at each level of the hierarchy we transform the inferred topics to documents. [sent-302, score-0.881]
71 1 Document Modeling Given a document collection, the task is to retrieve the latent hierarchy of topics that represents and fits well, in terms of perplexity, to the data set. [sent-324, score-0.975]
72 Taking into account the context of the NIPS conferences, we believe that we have discovered a rather realistic hierarchical structure of 54 topics that fits well the field in question. [sent-355, score-0.626]
73 Examining the results on the Genia data set (Figure 8a), the lowest perplexity is achieved by hvHDP, while hLDA approaches the same perplexity for a number of topics around 60. [sent-431, score-0.78]
74 For this data set, the statistical test showed that hvHDP is better than hLDA in the range 10 − 30 and 100 − 120 topics and better than LDA in the whole range of topics, although for a certain range both models achieve similar perplexity values. [sent-449, score-0.68]
75 Perhaps this can be attributed to the difficulty of identifying sufficiently different topics at various levels of abstraction, 2767 Z AVITSANOS , PALIOURAS AND VOUROS when requesting a large depth for the hierarchy. [sent-461, score-0.609]
76 The Wilcoxon statistical tests have indicated that in the Elegance data set and for a number of topics around 80, hLDA does not perform significantly better than LDA, while in the NIPS data set, the same situation holds for a number of topics between 50 and 100. [sent-464, score-1.084]
77 Therefore, the experiment has confirmed the value of estimating the number of topics and the depth of the hierarchy in a completely non-parametric way. [sent-476, score-0.837]
78 From this comparison we concluded that all the topics of the estimated hierarchy have been correctly inferred. [sent-483, score-0.821]
79 The vocabulary clustering version of hHDP (hvHDP) estimates topics that are defined as distributions over words. [sent-487, score-0.627]
80 Thus, we merge the aforementioned two steps into one, and we assume that the estimated latent topics correspond to ontology concepts. [sent-497, score-0.839]
81 Therefore, in this task, we construct a topic ontology from scratch that comprises only hierarchical relations, given a collection of text documents and we compare it to a given gold standard ontology. [sent-498, score-0.895]
82 In particular, the evaluation method first transforms the concepts of the gold ontology into probability distributions over the terms of the data set, taking into account the context of each ontology concept. [sent-510, score-0.632]
83 In a second step, the gold ontology is matched to the learned hierarchy, based on how “close” the gold concepts and the learned topics are. [sent-511, score-1.075]
84 P= 1 M ∑ (1 − SDi )PCPi M i=1 (7) R= 1 M ∑ (1 − SDi )PCRi M i=1 (8) (β2 + 1)P ∗ R (β2 R) + P (9) F= In Equations (7) - (9), M is the number of matchings between learned topics and gold concepts and SD is a distance measure between concepts, ranging in [0, 1]. [sent-514, score-0.721]
85 Specifically, the total variational distance (TVD) (Gibbs and Su, 2002) of Equation (10) was used to assess the similarity between topics and gold concepts. [sent-515, score-0.658]
86 The estimated topics are already represented as multinomial probability distributions over the term space of the data set, while the concepts of the gold ontology are also transformed into multinomial probability distributions over the same term space. [sent-517, score-1.055]
87 Thus, the comparison between topics and gold concepts becomes straightforward. [sent-518, score-0.698]
88 The matching scheme compares the distributional representations of topics and gold concepts and finds the best matching in the sense that the most similar word distributions among the two hierarchies will be matched. [sent-519, score-0.801]
89 Thus, for a matching i, of a topic T in the learned ontology and a concept C in the gold ontology, PCPi is defined as the number of topics in the cotopy set of T matched to concepts in the cotopy set of C, divided by the number of topics participating in the cotopy set of T . [sent-524, score-1.913]
90 For the same matching i, PCRi is defined as the number of topics in the cotopy set of T matched to concepts in the cotopy set of C, divided by the number of topics participating in the cotopy set of C. [sent-525, score-1.304]
91 Figure 10 depicts a part of the gold ontology on the left and a part of the estimated hierarchy on the right. [sent-528, score-0.61]
92 The labels on the latent topics of the learned hierarchy correspond to the best TVD match of each topic with a gold concept. [sent-529, score-1.253]
93 The labels on the topics of the learned hierarchy correspond to the best match of each topic to a gold concept, according to TVD. [sent-533, score-1.193]
94 Regarding the estimated hierarchy, it comprises 38 topics in total, while the gold ontology comprises 43. [sent-534, score-0.993]
95 Recall from Section 3 that the method estimates a probability distribution for each topic over all topics of the next level. [sent-535, score-0.797]
96 In addition, the way the hierarchy is estimated, through Gibbs sampling, infers the probability distributions, based on the assignments of words to topics and topics to subtopics. [sent-538, score-1.383]
97 In the general case though, the learned hierarchy is expected to have more edges than the gold ontology has. [sent-543, score-0.611]
98 An important contribution of this paper is the inference of the correct number of topics at each level of the hierarchy, as well as the depth of the hierarchy. [sent-593, score-0.616]
99 Moreover, hHDP does not impose restrictions and constraints on the usage of topics, allowing multiple inheritance between topics of different layers and modeling the internal nodes as distributions of both subtopics and words. [sent-596, score-0.724]
100 Finally, we have concluded that such methods are suitable for the task of ontology learning, since they are able to discover topics and arrange them hierarchically, in a way that is independent of the language and the domain of the data set, and without requiring any prior knowledge of the domain. [sent-600, score-0.757]
wordName wordTfidf (topN-words)
[('topics', 0.542), ('hhdp', 0.273), ('hlda', 0.273), ('hierarchy', 0.257), ('topic', 0.255), ('hvhdp', 0.219), ('ontology', 0.215), ('documents', 0.168), ('hdp', 0.158), ('lda', 0.12), ('perplexity', 0.119), ('gold', 0.116), ('document', 0.116), ('paliouras', 0.113), ('pam', 0.096), ('genia', 0.093), ('hthdp', 0.093), ('subtopics', 0.093), ('vouros', 0.093), ('avitsanos', 0.086), ('ierarchies', 0.086), ('opic', 0.086), ('seafood', 0.08), ('tz', 0.08), ('leaf', 0.074), ('dirichlet', 0.074), ('stimation', 0.061), ('hierarchical', 0.061), ('cotopy', 0.06), ('elegance', 0.06), ('latent', 0.06), ('planet', 0.053), ('plsa', 0.053), ('ptms', 0.053), ('ji', 0.053), ('word', 0.049), ('comprises', 0.049), ('hpam', 0.047), ('zavitsanos', 0.047), ('inferred', 0.046), ('lonely', 0.045), ('dish', 0.045), ('restaurant', 0.045), ('internal', 0.044), ('words', 0.042), ('concepts', 0.04), ('mimno', 0.04), ('tvd', 0.04), ('vocabulary', 0.039), ('corpus', 0.039), ('blei', 0.039), ('depth', 0.038), ('level', 0.036), ('di', 0.036), ('multinomial', 0.035), ('infer', 0.034), ('tourism', 0.033), ('branching', 0.033), ('text', 0.031), ('sampling', 0.03), ('dp', 0.03), ('levels', 0.029), ('hierarchies', 0.029), ('regarding', 0.028), ('pachinko', 0.028), ('allocation', 0.027), ('lonelyplanet', 0.027), ('maedche', 0.027), ('quantitative', 0.026), ('tables', 0.026), ('relations', 0.026), ('distributions', 0.025), ('sharing', 0.024), ('learned', 0.023), ('nips', 0.023), ('gibbs', 0.023), ('conferences', 0.023), ('ontologies', 0.023), ('discovered', 0.023), ('mm', 0.022), ('estimated', 0.022), ('evaluation', 0.021), ('semantic', 0.021), ('clustering', 0.021), ('mixture', 0.02), ('inheritance', 0.02), ('restaurants', 0.02), ('aegean', 0.02), ('andrieu', 0.02), ('ctm', 0.02), ('demokritos', 0.02), ('emulates', 0.02), ('hdps', 0.02), ('znew', 0.02), ('models', 0.019), ('root', 0.019), ('serve', 0.019), ('dps', 0.019), ('frequencies', 0.019), ('mccallum', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros
Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy
2 0.098160103 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
3 0.083535589 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
4 0.06551037 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
5 0.058407199 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama
Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel
6 0.054153483 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes
7 0.052456122 54 jmlr-2011-Learning Latent Tree Graphical Models
8 0.045638502 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
9 0.042737979 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
10 0.040465228 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
11 0.038514648 61 jmlr-2011-Logistic Stick-Breaking Process
12 0.034671575 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
13 0.032061353 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
14 0.030799676 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
15 0.022611693 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
16 0.020852722 16 jmlr-2011-Clustering Algorithms for Chains
17 0.020618135 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
18 0.020403286 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
19 0.020252332 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
20 0.019994119 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
topicId topicWeight
[(0, 0.127), (1, -0.097), (2, -0.015), (3, 0.033), (4, -0.149), (5, 0.013), (6, 0.062), (7, 0.17), (8, 0.039), (9, -0.014), (10, -0.026), (11, 0.136), (12, -0.065), (13, -0.016), (14, -0.051), (15, 0.009), (16, -0.071), (17, -0.046), (18, 0.063), (19, -0.052), (20, -0.089), (21, 0.107), (22, -0.071), (23, 0.041), (24, -0.087), (25, 0.143), (26, -0.041), (27, 0.122), (28, 0.096), (29, 0.092), (30, -0.113), (31, 0.075), (32, -0.021), (33, -0.21), (34, 0.122), (35, -0.074), (36, 0.076), (37, -0.1), (38, 0.045), (39, -0.185), (40, -0.043), (41, 0.002), (42, -0.024), (43, 0.139), (44, -0.137), (45, 0.194), (46, -0.151), (47, -0.07), (48, -0.119), (49, -0.113)]
simIndex simValue paperId paperTitle
same-paper 1 0.97145861 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros
Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy
2 0.50849277 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
Author: Huixin Wang, Xiaotong Shen, Wei Pan
Abstract: In hierarchical classification, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classification is specified by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of flat classification, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classification, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its flat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classification, structured learning
3 0.36775452 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
4 0.35478878 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama
Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel
5 0.28590482 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
6 0.2594662 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes
7 0.24131879 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
8 0.22204539 54 jmlr-2011-Learning Latent Tree Graphical Models
9 0.21416885 61 jmlr-2011-Logistic Stick-Breaking Process
10 0.21053998 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
11 0.1937529 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
12 0.17882822 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
13 0.17836326 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
14 0.17191881 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
15 0.15059085 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
16 0.14752275 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
17 0.14364545 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms
18 0.14147009 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
19 0.13406929 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
20 0.13376792 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
topicId topicWeight
[(4, 0.041), (9, 0.028), (10, 0.028), (24, 0.056), (31, 0.07), (32, 0.022), (41, 0.018), (60, 0.013), (73, 0.024), (78, 0.04), (90, 0.556)]
simIndex simValue paperId paperTitle
1 0.90413022 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt
Abstract: In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classification, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT
2 0.89018667 65 jmlr-2011-Models of Cooperative Teaching and Learning
Author: Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich
Abstract: While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects “helpful” training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, “helpful” examples can speed up the learning process. The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using “coding tricks”, has been widely addressed. Nevertheless, the resulting teaching and learning protocols do not seem to make the teacher select intuitively “helpful” examples. The two models introduced in this paper are built on what we call subset teaching sets and recursive teaching sets. They extend previous models of teaching by letting both the teacher and the learner exploit knowing that the partner is cooperative. For this purpose, we introduce a new notion of “coding trick”/“collusion”. We show how both resulting sample complexity measures (the subset teaching dimension and the recursive teaching dimension) can be arbitrarily lower than the classic teaching dimension and known variants thereof, without using coding tricks. For instance, monomials can be taught with only two examples independent of the number of variables. The subset teaching dimension turns out to be nonmonotonic with respect to subclasses of concept classes. We discuss why this nonmonotonicity might be inherent in many interesting cooperative teaching and learning scenarios. Keywords: teaching dimension, learning Boolean functions, interactive learning, collusion c 2011 Sandra Zilles, Steffen Lange, Robert Holte and Martin Zinkevich. Z ILLES , L ANGE , H OLTE AND Z INKEVICH
3 0.87720019 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
Author: Yizhao Ni, Craig Saunders, Sandor Szedmak, Mahesan Niranjan
Abstract: We propose a distance phrase reordering model (DPR) for statistical machine translation (SMT), where the aim is to learn the grammatical rules and context dependent changes using a phrase reordering classification framework. We consider a variety of machine learning techniques, including state-of-the-art structured prediction methods. Techniques are compared and evaluated on a Chinese-English corpus, a language pair known for the high reordering characteristics which cannot be adequately captured with current models. In the reordering classification task, the method significantly outperforms the baseline against which it was tested, and further, when integrated as a component of the state-of-the-art machine translation system, MOSES, it achieves improvement in translation results. Keywords: statistical machine translation (SMT), phrase reordering, lexicalized reordering (LR), maximum entropy (ME), support vector machine (SVM), maximum margin regression (MMR) , max-margin structure learning (MMS)
same-paper 4 0.87377107 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros
Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy
5 0.35487187 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
6 0.3486166 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
7 0.32462808 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
8 0.31791681 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
9 0.30477598 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
10 0.2912769 16 jmlr-2011-Clustering Algorithms for Chains
11 0.28873137 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
12 0.28852588 61 jmlr-2011-Logistic Stick-Breaking Process
13 0.28748932 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
14 0.28523001 48 jmlr-2011-Kernel Analysis of Deep Networks
15 0.28294328 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
16 0.28049237 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
17 0.27960175 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
18 0.2764667 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
20 0.27522597 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data