jmlr jmlr2011 jmlr2011-70 knowledge-graph by maker-knowledge-mining

70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

Source: pdf

Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros

Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. [sent-8, score-0.635]

2 hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. [sent-9, score-0.642]

3 Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy 1. [sent-11, score-1.298]

4 Introduction In this paper we address the problem of modeling the content of a given document collection as a hierarchy of latent topics given no prior knowledge. [sent-12, score-0.975]

5 These topics represent and capture facets of content meaning, by means of multinomial probability distributions over the words of the term space of the documents. [sent-13, score-0.644]

6 The assignment of documents to latent topics without any preclassiﬁcation is a powerful text mining technique, useful among others for ontology learning from text and document indexing. [sent-14, score-1.163]

7 Z AVITSANOS , PALIOURAS AND VOUROS Much work on PTMs focuses on a ﬂat clustering of the term space into topics, while the creation of a hierarchical structure of topics without user involvement or pre-deﬁned parameters still remains a challenging task. [sent-22, score-0.624]

8 The goal of discovering a topic hierarchy that comprises levels of topic abstractions is different from conventional hierarchical clustering. [sent-23, score-0.906]

9 The hierarchical modeling of topics allows the construction of more accurate and predictive models than the ones constructed by ﬂat models. [sent-33, score-0.622]

10 Assuming that the higher levels of the hierarchy capture generic topics of a particular domain, while lower-level ones focus on particular aspects of that domain, it is expected that a hierarchical probabilistic topic model would be able to “explain” or could have generated the data set. [sent-37, score-1.144]

11 In particular, we aim to infer a hierarchy of topics and subtopics, such that each topic is more general than its subtopics, in the sense that if a document can be indexed by any of the subtopics it should also be indexed by the topic itself. [sent-43, score-1.552]

12 Moreover, we demand to infer the hierarchy without making any assumption either about the number of topics at any level of the hierarchy, or about the height of the hierarchy. [sent-44, score-0.869]

13 In addition to the basic model, we also present a variant that produces a topic hierarchy, by modeling the vocabulary only at the leaf level and considering topics in the inner levels to be multinomial distributions over subtopics. [sent-48, score-1.035]

14 Documents are assumed to be mixtures of topics and topics are probability distributions over the words of some vocabulary. [sent-55, score-1.151]

15 Based on this assumption of exchangeability, the meaning of documents does not depend on the speciﬁc sequence of the words, that is, the syntax, but rather on their “ability” to express speciﬁc topics either in isolation or in mixture. [sent-60, score-0.71]

16 We will refer to the distribution of topics as θK , indicating the dimensionality K of the distribution, and ﬁnally, φV will stand for the distribution of the words of the vocabulary V . [sent-77, score-0.623]

17 Following the principles of PTMs, the generative model of probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) speciﬁes a simple generative rule for the words in a document di , according to which, each word of a training document di comes from a randomly chosen topic ti . [sent-78, score-0.731]

18 The topics are drawn from a document-speciﬁc distribution over topics θK , and there exists one such 2751 Z AVITSANOS , PALIOURAS AND VOUROS distribution for each di . [sent-79, score-1.12]

19 The generative model of LDA, being a probabilistic model of a corpus, represents each di as random mixture over latent topics T . [sent-85, score-0.658]

20 The estimated topics are represented as multinomial probability distributions over the terms of the documents, while each di is represented as a Dirichlet random variable θ, the dimensionality of which, is predeﬁned and equal to the number of estimated latent topics. [sent-87, score-0.742]

21 A question that usually arises when using models like LDA is how many topics the estimated model should have, given the document collection. [sent-90, score-0.699]

22 In such a Bayesian hierarchy, the root DP uses the Dirichlet distribution of the topics as a base distribution and each document samples from it. [sent-94, score-0.677]

23 Although LDA is a true generative probabilistic model for documents and HDP is a convenient mechanism for inferring the number of topics, relations of any type or correlations between the estimated topics are not taken into account. [sent-95, score-0.758]

24 In fact, a ﬂat and soft clustering of the term space of the documents into topics is provided. [sent-96, score-0.731]

25 Thus, there is a need for hierarchical models that are able to capture relations between the latent topics in order to represent common shared structure, as explained in Section 1. [sent-97, score-0.708]

26 In a typical hierarchy, documents are assigned to classes at the leaves of the hierarchy, while words are sampled from topics which occupy non-leaf nodes of the hierarchy. [sent-107, score-0.752]

27 2752 N ON -PARAMETRIC E STIMATION OF T OPIC H IERARCHIES WITH HDP S Another approach to capturing relations between topics is the correlated topic models (CTM) (Blei and Lafferty, 2006), an extension of LDA. [sent-112, score-0.842]

28 Correlations are introduced by topics that appear in the same context, in the sense that they appear together in documents (or parts of documents). [sent-115, score-0.71]

29 PAM connects the words of the vocabulary V and topics T on a DAG, where topics occupy the interior nodes and the leaves are words. [sent-121, score-1.165]

30 The four-level PAM, which is presented in Li and McCallum (2006), is able to model a text collection through a three-level hierarchy of topics with arbitrary connections between them. [sent-123, score-0.83]

31 , 2004) was the ﬁrst attempt to represent the distribution of topics as a tree-structure by providing at the same time uncertainty over the branching factor at each level of the tree. [sent-126, score-0.611]

32 In hLDA, each document is modeled as a mixture of L topics deﬁned by θL proportions along a path from the root topic to a leaf. [sent-127, score-0.952]

33 Therefore, each document di is generated by the topics along a single path of this tree. [sent-128, score-0.694]

34 Thus, both internal and leaf topics generate words for new documents. [sent-132, score-0.702]

35 In the second one, this distribution does not exist, but the Dirichlet distribution of each internal node has one extra “exit” dimension, which corresponds to the event that a word is produced directly by the internal node, without reaching the leaf topics of the DAG. [sent-140, score-0.753]

36 While the models belonging in the PAM family provide a powerful means to describe inter-topic correlations, they have the same practical difﬁculty as many other topic models in determining the number of topics at the internal levels. [sent-146, score-0.879]

37 Each topic ti is modeled by a Dirichlet process and the Dirichlet processes at each level are further organized into a hierarchical Dirichlet process (HDP), which is used to estimate the number of topics at this level. [sent-150, score-0.894]

38 During the generation of a document, after sampling the multinomial distributions over topics from the corresponding HDPs, a topic path is sampled repeatedly according to the multinomials for each word in the document di . [sent-152, score-1.088]

39 Representing all topics as multinomial distributions over words is more appealing, than representing only the leaf topics. [sent-154, score-0.718]

40 (2008) uses the LDA model iteratively to produce layers of topics and then establishes hierarchical relations between them, based on conditional independences, given candidate parent topics. [sent-156, score-0.629]

41 The branching factor at each level is decided by the number of discovered relations, since topics that are not connected to others are disregarded. [sent-157, score-0.634]

42 The issue of the depth of the hierarchy is addressed in that work by measuring the similarity of the newly generated topics to the existing ones. [sent-158, score-0.837]

43 In summary, some topic models support a latent hierarchy of topics, but allow the generation of words only at the leaf level. [sent-160, score-0.707]

44 In addition, in contrast to the simple LDA, in the case of hLDA, documents can only access the topics that lie across a single path in the learned tree. [sent-163, score-0.733]

45 Finally, parameters such as the number of topics or the number of levels need to be estimated using cross-validation, which is not efﬁcient even for non-hierarchical topic models like LDA. [sent-169, score-0.867]

46 The majority of the work reviewed in this section assesses the inferred hierarchy on the basis of how “meaningful” the latent topics are to humans. [sent-172, score-0.905]

47 The ﬁrst variant results in a hierarchy whose internal nodes are represented as probability distributions over topics and over words. [sent-191, score-0.868]

48 There are as many DPs (G j ) as the documents at each level, connected to all topics of the level. [sent-205, score-0.71]

49 In addition, it allows the topics at each level to be shared among the documents of the collection. [sent-210, score-0.746]

50 hvHDP consists of topics that are both distributions over subtopics and over words. [sent-215, score-0.66]

51 htHDP represents only leaf topics as distributions over words. [sent-216, score-0.641]

52 On the other hand, while hPAM needs the number of internal topics to be ﬁxed a priori, hLDA and hHDP are able to infer the number of topics at each level of the hierarchy, due to their non-parametric Bayesian nature. [sent-222, score-1.198]

53 Moreover, while the model of hLDA requires that each document is made of topics across a speciﬁc path of the hierarchy, hPAM and hHDP provide much more ﬂexibility, since topics can be shared among super-topics. [sent-223, score-1.2]

54 The topics of the PAM models generate words at the leaf level and the models are based on a ﬁxed three-level hierarchy. [sent-226, score-0.732]

55 The sampling scheme of hHDP estimates both the number of topics at each level and the number of levels of the learned hierarchy. [sent-236, score-0.66]

56 As shown in Figure 5, starting at the leaf level, we use HDP to infer the number of leaf topics as if no hierarchy is to be built. [sent-237, score-0.981]

57 Figure 5: Bottom-up probabilistic estimation of the topic hierarchy: Starting with a corpus of M documents, the leaf topics are inferred ﬁrst. [sent-240, score-0.956]

58 In hvHDP, where topics are both distributions over subtopics and over words, the inference of the non-leaf levels treats topics, instead of documents, as restaurants. [sent-248, score-0.689]

59 Having inferred the topics at the leaf level, we know the mixture proportions that the documents of the collection follow. [sent-251, score-0.85]

60 In other words, at the leaf level we allocate documents to leaf topics, while at the intermediate levels we allocate topics to super-topics. [sent-255, score-0.923]

61 Therefore, the main contribution of this sampling scheme is the estimation of the non-leaf topics from “artiﬁcial” documents that correspond to estimated topics of lower levels. [sent-257, score-1.304]

62 Together with the use of the HDP for the estimation of the number of topics at each level, it makes the estimation of the topic hierarchy completely non-parametric. [sent-260, score-1.054]

63 Regarding the second variant of the model (htHDP), where the internal topics are distributions only over subtopics and not words, the inference procedure differs in the modeling of non-leaf topics. [sent-261, score-0.704]

64 As observations for the inference of the next level up, we use the distributions of topics at the lower level over the original documents. [sent-263, score-0.639]

65 Therefore, while in the ﬁrst variant of hHDP, we had a topic - term matrix of frequencies as input for the estimation of an intermediate level of the hierarchy, in htHDP, we have a document - topic matrix of frequencies for the sampling procedure. [sent-264, score-0.73]

66 A topic hierarchy is derived from the corpus and a non-parametric Bayesian hierarchy is used at each level of the topic hierarchy. [sent-270, score-1.099]

67 The ﬁrst hHDP variant satisﬁes the criteria that we set in Section 2: internal topics are represented as distributions over words and over subtopics, topics can share subtopics at the lower level in the hierarchy, and topics across any level of the hierarchy are shared among documents. [sent-271, score-2.159]

68 The degree of sharing topics across documents is expressed through the inferred parameters of the model, and this sharing of topics reﬂects the sharing of common terminology between documents. [sent-272, score-1.37]

69 In addition, n·iz is the number of occurrences of word i in topic z, n··z is the total number of words assigned to topic z, and ﬁnally, H and V are the prior DP hyper-parameter for word distributions and the total number of words respectively. [sent-290, score-0.717]

70 In addition, in hvHDP, at each level of the hierarchy we transform the inferred topics to documents. [sent-302, score-0.881]

71 1 Document Modeling Given a document collection, the task is to retrieve the latent hierarchy of topics that represents and ﬁts well, in terms of perplexity, to the data set. [sent-324, score-0.975]

72 Taking into account the context of the NIPS conferences, we believe that we have discovered a rather realistic hierarchical structure of 54 topics that ﬁts well the ﬁeld in question. [sent-355, score-0.626]

73 Examining the results on the Genia data set (Figure 8a), the lowest perplexity is achieved by hvHDP, while hLDA approaches the same perplexity for a number of topics around 60. [sent-431, score-0.78]

74 For this data set, the statistical test showed that hvHDP is better than hLDA in the range 10 − 30 and 100 − 120 topics and better than LDA in the whole range of topics, although for a certain range both models achieve similar perplexity values. [sent-449, score-0.68]

75 Perhaps this can be attributed to the difﬁculty of identifying sufﬁciently different topics at various levels of abstraction, 2767 Z AVITSANOS , PALIOURAS AND VOUROS when requesting a large depth for the hierarchy. [sent-461, score-0.609]

76 The Wilcoxon statistical tests have indicated that in the Elegance data set and for a number of topics around 80, hLDA does not perform signiﬁcantly better than LDA, while in the NIPS data set, the same situation holds for a number of topics between 50 and 100. [sent-464, score-1.084]

77 Therefore, the experiment has conﬁrmed the value of estimating the number of topics and the depth of the hierarchy in a completely non-parametric way. [sent-476, score-0.837]

78 From this comparison we concluded that all the topics of the estimated hierarchy have been correctly inferred. [sent-483, score-0.821]

79 The vocabulary clustering version of hHDP (hvHDP) estimates topics that are deﬁned as distributions over words. [sent-487, score-0.627]

80 Thus, we merge the aforementioned two steps into one, and we assume that the estimated latent topics correspond to ontology concepts. [sent-497, score-0.839]

81 Therefore, in this task, we construct a topic ontology from scratch that comprises only hierarchical relations, given a collection of text documents and we compare it to a given gold standard ontology. [sent-498, score-0.895]

82 In particular, the evaluation method ﬁrst transforms the concepts of the gold ontology into probability distributions over the terms of the data set, taking into account the context of each ontology concept. [sent-510, score-0.632]

83 In a second step, the gold ontology is matched to the learned hierarchy, based on how “close” the gold concepts and the learned topics are. [sent-511, score-1.075]

84 P= 1 M ∑ (1 − SDi )PCPi M i=1 (7) R= 1 M ∑ (1 − SDi )PCRi M i=1 (8) (β2 + 1)P ∗ R (β2 R) + P (9) F= In Equations (7) - (9), M is the number of matchings between learned topics and gold concepts and SD is a distance measure between concepts, ranging in [0, 1]. [sent-514, score-0.721]

85 Speciﬁcally, the total variational distance (TVD) (Gibbs and Su, 2002) of Equation (10) was used to assess the similarity between topics and gold concepts. [sent-515, score-0.658]

86 The estimated topics are already represented as multinomial probability distributions over the term space of the data set, while the concepts of the gold ontology are also transformed into multinomial probability distributions over the same term space. [sent-517, score-1.055]

87 Thus, the comparison between topics and gold concepts becomes straightforward. [sent-518, score-0.698]

88 The matching scheme compares the distributional representations of topics and gold concepts and ﬁnds the best matching in the sense that the most similar word distributions among the two hierarchies will be matched. [sent-519, score-0.801]

89 Thus, for a matching i, of a topic T in the learned ontology and a concept C in the gold ontology, PCPi is deﬁned as the number of topics in the cotopy set of T matched to concepts in the cotopy set of C, divided by the number of topics participating in the cotopy set of T . [sent-524, score-1.913]

90 For the same matching i, PCRi is deﬁned as the number of topics in the cotopy set of T matched to concepts in the cotopy set of C, divided by the number of topics participating in the cotopy set of C. [sent-525, score-1.304]

91 Figure 10 depicts a part of the gold ontology on the left and a part of the estimated hierarchy on the right. [sent-528, score-0.61]

92 The labels on the latent topics of the learned hierarchy correspond to the best TVD match of each topic with a gold concept. [sent-529, score-1.253]

93 The labels on the topics of the learned hierarchy correspond to the best match of each topic to a gold concept, according to TVD. [sent-533, score-1.193]

94 Regarding the estimated hierarchy, it comprises 38 topics in total, while the gold ontology comprises 43. [sent-534, score-0.993]

95 Recall from Section 3 that the method estimates a probability distribution for each topic over all topics of the next level. [sent-535, score-0.797]

96 In addition, the way the hierarchy is estimated, through Gibbs sampling, infers the probability distributions, based on the assignments of words to topics and topics to subtopics. [sent-538, score-1.383]

97 In the general case though, the learned hierarchy is expected to have more edges than the gold ontology has. [sent-543, score-0.611]

98 An important contribution of this paper is the inference of the correct number of topics at each level of the hierarchy, as well as the depth of the hierarchy. [sent-593, score-0.616]

99 Moreover, hHDP does not impose restrictions and constraints on the usage of topics, allowing multiple inheritance between topics of different layers and modeling the internal nodes as distributions of both subtopics and words. [sent-596, score-0.724]

100 Finally, we have concluded that such methods are suitable for the task of ontology learning, since they are able to discover topics and arrange them hierarchically, in a way that is independent of the language and the domain of the data set, and without requiring any prior knowledge of the domain. [sent-600, score-0.757]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('topics', 0.542), ('hhdp', 0.273), ('hlda', 0.273), ('hierarchy', 0.257), ('topic', 0.255), ('hvhdp', 0.219), ('ontology', 0.215), ('documents', 0.168), ('hdp', 0.158), ('lda', 0.12), ('perplexity', 0.119), ('gold', 0.116), ('document', 0.116), ('paliouras', 0.113), ('pam', 0.096), ('genia', 0.093), ('hthdp', 0.093), ('subtopics', 0.093), ('vouros', 0.093), ('avitsanos', 0.086), ('ierarchies', 0.086), ('opic', 0.086), ('seafood', 0.08), ('tz', 0.08), ('leaf', 0.074), ('dirichlet', 0.074), ('stimation', 0.061), ('hierarchical', 0.061), ('cotopy', 0.06), ('elegance', 0.06), ('latent', 0.06), ('planet', 0.053), ('plsa', 0.053), ('ptms', 0.053), ('ji', 0.053), ('word', 0.049), ('comprises', 0.049), ('hpam', 0.047), ('zavitsanos', 0.047), ('inferred', 0.046), ('lonely', 0.045), ('dish', 0.045), ('restaurant', 0.045), ('internal', 0.044), ('words', 0.042), ('concepts', 0.04), ('mimno', 0.04), ('tvd', 0.04), ('vocabulary', 0.039), ('corpus', 0.039), ('blei', 0.039), ('depth', 0.038), ('level', 0.036), ('di', 0.036), ('multinomial', 0.035), ('infer', 0.034), ('tourism', 0.033), ('branching', 0.033), ('text', 0.031), ('sampling', 0.03), ('dp', 0.03), ('levels', 0.029), ('hierarchies', 0.029), ('regarding', 0.028), ('pachinko', 0.028), ('allocation', 0.027), ('lonelyplanet', 0.027), ('maedche', 0.027), ('quantitative', 0.026), ('tables', 0.026), ('relations', 0.026), ('distributions', 0.025), ('sharing', 0.024), ('learned', 0.023), ('nips', 0.023), ('gibbs', 0.023), ('conferences', 0.023), ('ontologies', 0.023), ('discovered', 0.023), ('mm', 0.022), ('estimated', 0.022), ('evaluation', 0.021), ('semantic', 0.021), ('clustering', 0.021), ('mixture', 0.02), ('inheritance', 0.02), ('restaurants', 0.02), ('aegean', 0.02), ('andrieu', 0.02), ('ctm', 0.02), ('demokritos', 0.02), ('emulates', 0.02), ('hdps', 0.02), ('znew', 0.02), ('models', 0.019), ('root', 0.019), ('serve', 0.019), ('dps', 0.019), ('frequencies', 0.019), ('mccallum', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros

2 0.098160103 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difﬁcult to optimize, and in this paper, we propose efﬁcient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efﬁcient and scales gracefully to millions of variables, which we illustrate in two types of applications: ﬁrst, we consider ﬁxed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespeciﬁed arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization

3 0.083535589 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The ﬁrst stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justiﬁes the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised

4 0.06551037 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

Author: Thomas L. Griffiths, Zoubin Ghahramani

Abstract: The Indian buffet process is a stochastic process deﬁning a probability distribution over equivalence classes of sparse binary matrices with a ﬁnite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially inﬁnite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an inﬁnite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices

5 0.058407199 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is speciﬁcally deﬁned as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a uniﬁed and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-deﬁnite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A speciﬁc implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 signiﬁcantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

6 0.054153483 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes

7 0.052456122 54 jmlr-2011-Learning Latent Tree Graphical Models

8 0.045638502 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

9 0.042737979 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

10 0.040465228 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

11 0.038514648 61 jmlr-2011-Logistic Stick-Breaking Process

12 0.034671575 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

13 0.032061353 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

14 0.030799676 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

15 0.022611693 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

16 0.020852722 16 jmlr-2011-Clustering Algorithms for Chains

17 0.020618135 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough

18 0.020403286 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

19 0.020252332 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

20 0.019994119 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.127), (1, -0.097), (2, -0.015), (3, 0.033), (4, -0.149), (5, 0.013), (6, 0.062), (7, 0.17), (8, 0.039), (9, -0.014), (10, -0.026), (11, 0.136), (12, -0.065), (13, -0.016), (14, -0.051), (15, 0.009), (16, -0.071), (17, -0.046), (18, 0.063), (19, -0.052), (20, -0.089), (21, 0.107), (22, -0.071), (23, 0.041), (24, -0.087), (25, 0.143), (26, -0.041), (27, 0.122), (28, 0.096), (29, 0.092), (30, -0.113), (31, 0.075), (32, -0.021), (33, -0.21), (34, 0.122), (35, -0.074), (36, 0.076), (37, -0.1), (38, 0.045), (39, -0.185), (40, -0.043), (41, 0.002), (42, -0.024), (43, 0.139), (44, -0.137), (45, 0.194), (46, -0.151), (47, -0.07), (48, -0.119), (49, -0.113)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97145861 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros

2 0.50849277 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

Abstract: In hierarchical classiﬁcation, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classiﬁcation is speciﬁed by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of ﬂat classiﬁcation, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classiﬁcation, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its ﬂat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classiﬁcation, structured learning

3 0.36775452 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

4 0.35478878 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

5 0.28590482 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

6 0.2594662 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes

7 0.24131879 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

8 0.22204539 54 jmlr-2011-Learning Latent Tree Graphical Models

9 0.21416885 61 jmlr-2011-Logistic Stick-Breaking Process

10 0.21053998 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

11 0.1937529 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

12 0.17882822 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

13 0.17836326 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

14 0.17191881 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

15 0.15059085 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

16 0.14752275 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

17 0.14364545 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms

18 0.14147009 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

19 0.13406929 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

20 0.13376792 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.041), (9, 0.028), (10, 0.028), (24, 0.056), (31, 0.07), (32, 0.022), (41, 0.018), (60, 0.013), (73, 0.024), (78, 0.04), (90, 0.556)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90413022 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

Abstract: In this article, we propose a family of efﬁcient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be deﬁned based on this Weisfeiler-Lehman sequence of graphs, including a highly efﬁcient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classiﬁcation benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classiﬁcation, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT

2 0.89018667 65 jmlr-2011-Models of Cooperative Teaching and Learning

Author: Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich

Abstract: While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects “helpful” training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, “helpful” examples can speed up the learning process. The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using “coding tricks”, has been widely addressed. Nevertheless, the resulting teaching and learning protocols do not seem to make the teacher select intuitively “helpful” examples. The two models introduced in this paper are built on what we call subset teaching sets and recursive teaching sets. They extend previous models of teaching by letting both the teacher and the learner exploit knowing that the partner is cooperative. For this purpose, we introduce a new notion of “coding trick”/“collusion”. We show how both resulting sample complexity measures (the subset teaching dimension and the recursive teaching dimension) can be arbitrarily lower than the classic teaching dimension and known variants thereof, without using coding tricks. For instance, monomials can be taught with only two examples independent of the number of variables. The subset teaching dimension turns out to be nonmonotonic with respect to subclasses of concept classes. We discuss why this nonmonotonicity might be inherent in many interesting cooperative teaching and learning scenarios. Keywords: teaching dimension, learning Boolean functions, interactive learning, collusion c 2011 Sandra Zilles, Steffen Lange, Robert Holte and Martin Zinkevich. Z ILLES , L ANGE , H OLTE AND Z INKEVICH

3 0.87720019 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

Author: Yizhao Ni, Craig Saunders, Sandor Szedmak, Mahesan Niranjan

Abstract: We propose a distance phrase reordering model (DPR) for statistical machine translation (SMT), where the aim is to learn the grammatical rules and context dependent changes using a phrase reordering classiﬁcation framework. We consider a variety of machine learning techniques, including state-of-the-art structured prediction methods. Techniques are compared and evaluated on a Chinese-English corpus, a language pair known for the high reordering characteristics which cannot be adequately captured with current models. In the reordering classiﬁcation task, the method signiﬁcantly outperforms the baseline against which it was tested, and further, when integrated as a component of the state-of-the-art machine translation system, MOSES, it achieves improvement in translation results. Keywords: statistical machine translation (SMT), phrase reordering, lexicalized reordering (LR), maximum entropy (ME), support vector machine (SVM), maximum margin regression (MMR) , max-margin structure learning (MMS)

same-paper 4 0.87377107 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros

5 0.35487187 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

6 0.3486166 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

7 0.32462808 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

8 0.31791681 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

9 0.30477598 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

10 0.2912769 16 jmlr-2011-Clustering Algorithms for Chains

11 0.28873137 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.28852588 61 jmlr-2011-Logistic Stick-Breaking Process

13 0.28748932 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models