jmlr jmlr2011 jmlr2011-57 knowledge-graph by maker-knowledge-mining

57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods


Source: pdf

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. [sent-16, score-0.615]

2 Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. [sent-22, score-0.445]

3 WU, XU, LI AND OYAMA relevant to the query and ranks the documents based on the degree of relevance. [sent-29, score-0.464]

4 The relevance of a document with respect to a query can be viewed as a kind of similarity, and the search task is essentially one based on a similarity function between query and document pairs, where query and document are instances from two spaces: query space and document space. [sent-30, score-2.653]

5 We show that we can define a new type of positive semi-definite kernel function called S-kernel and learn a robust relevance model using a kernel method based on S-kernel. [sent-38, score-0.556]

6 Our work is novel and unique in that it learns a similarity function for search using a kernel method . [sent-42, score-0.329]

7 That is, they look at the matched words in a query and document, and calculate the similarity (relevance) based on the degree of matching. [sent-44, score-0.5]

8 For example, if the query is “NY” and the document only contains “New York”, then the BM25 score of the query and document pair will be low (i. [sent-46, score-1.162]

9 , the two will be viewed less relevant), although the query and document are relevant. [sent-48, score-0.561]

10 This is the so-called term mismatch problem, which all existing relevance models suffer from. [sent-50, score-0.371]

11 In other words, the scores from the relevance models may not be reliable and the question of how to learn a more robust similarity function for search arises; this is exactly the problem we want to address in this paper. [sent-51, score-0.537]

12 Intuitively, we calculate a more reliable score between a query document pair by using the scores between the pairs of similar query and similar document. [sent-53, score-0.96]

13 An S-kernel is formally defined as a positive semi-definite kernel such that the reproducing kernel Hilbert space (RKHS) generated by the kernel is also a space of S-functions. [sent-55, score-0.348]

14 The learned Robust BM25 model determines the relevance score of a query document pair on the basis of not only the BM25 score of the query document pair, but also the BM25 scores of similar query and similar document pairs. [sent-62, score-1.959]

15 This paper has the following contributions: 1) proposal of a kernel method for dealing with term mismatch in search, 2) proposal of a unified view to search using S-function, 3) proposal of a family of kernel functions, S-kernel. [sent-65, score-0.479]

16 Section 4 first introduces the term mismatch problem in search, and then proposes a kernel method for learning a robust relevance model to deal with the problem, such as Robust BM25. [sent-68, score-0.615]

17 (2005) have proposed learning a kernel function (similarity function) by using kernel methods, in which the optimal kernel is chosen from RKHS generated by the ‘hyperkernel’. [sent-97, score-0.348]

18 Term mismatch is one of the major challenges for search, because most of the traditional relevance models, including VSM (Salton and McGill, 1986), BM25 (Robertson et al. [sent-104, score-0.371]

19 To solve the problem, heuristic methods of query expansion or (pseudo) relevance feedback (cf. [sent-106, score-0.605]

20 In this paper, we demonstrate that we can learn a relevance model Robust BM25 to address the term mismatch challenge at the term level. [sent-112, score-0.371]

21 Click-through data, which records the URLs clicked by users after their query submissions at a search engine, has been widely used in web search (Agichtein et al. [sent-114, score-0.63]

22 Click-through data has also been used for calculating query similarity, because queries which link to the same URLs in click-through data may represent the same search intent (Beeferman and Berger, 2000; Cui et al. [sent-117, score-0.582]

23 In this paper, we use click-through data for training a Robust BM25 as well as calculating query similarity. [sent-120, score-0.359]

24 A positive semi-definite kernel measures the similarity of pairs of objects in a single space by using the dot product of their images in a Hilbert space. [sent-147, score-0.342]

25 1 In fact, all these models measure the similarity of a query and document from query space and document space. [sent-173, score-1.263]

26 In VSM, query space and document space are treated as the same space, while in the other two models, query space and document space are two different spaces. [sent-174, score-1.122]

27 1 VSM Let Q and D denote query and document spaces. [sent-177, score-0.561]

28 Each dimension in the two spaces corresponds to a term, and query and document are respectively represented as vectors in the two spaces. [sent-178, score-0.595]

29 Given query q ∈ Q and document d ∈ D , VSM is calculated as VSM(q, d) = ϕVSM (q), ϕVSM (d) , D Q where ϕVSM (q) and ϕVSM (d) are mappings to H from Q and D , respectively. [sent-180, score-0.561]

30 D Q ϕVSM (q)t = id f (t) · t f (t, q) Q and ϕVSM (d)t = id f (t) · t f (t, d), D where t is a term, t f (t, q) is the frequency of term t in query q, t f (t, d) is the frequency of term t in document d, id f (t) is the inverse document frequency of term t. [sent-181, score-0.763]

31 2 BM25 Given query q ∈ Q and document d ∈ D , BM25 is calculated as BM25(q, d) = ϕBM25 (q), ϕBM25 (d) , D Q (1) where ϕBM25 (q) and ϕBM25 (d) are mappings to H from Q and D , respectively. [sent-185, score-0.561]

32 “ The matching function between a query and document should be defined as an asymmetric function. [sent-189, score-0.593]

33 Given query q ∈ Q and document d ∈ D , the LMIR with Dirichlet smoothing is calculated as LMIR(q, d) = ϕLMIR (q), ϕLMIR (d) , Q D where φLMIR (q) and φLMIR (d) are (n + 1)-dimensional mappings to H from Q and D , respectively. [sent-195, score-0.561]

34 The (n + 1)th entries of ϕLMIR (q) and ϕLMIR (d) are defined as D Q ϕLMIR (q)n+1 = len(q) Q and ϕLMIR (d)n+1 = log D µ , len(d) + µ where len(q) and len(d) are the lengths of query q and document d, respectively. [sent-201, score-0.561]

35 BM25 and LMIR are nothing but similarity functions representing query and document matching with different formulations. [sent-207, score-0.702]

36 The spaces are defined over query and document, user and item, and image and text, respectively. [sent-216, score-0.393]

37 For example, if the query is “soccer” and the term “soccer” occurs several times in the document, then the document is regarded as ‘relevant’. [sent-223, score-0.561]

38 The relevance models of VSM, BM25 and LMIR will give high scores to the document and the document will be ranked highly. [sent-224, score-0.644]

39 That is, even if the document and the query are relevant, but they do not match at term level, in other words, they do not share a term, then they will not be viewed as relevant. [sent-227, score-0.561]

40 For example, if the query is “New York” and the document contains “NY”, then the document will not be regarded relevant. [sent-228, score-0.763]

41 Similarly, “aircraft” and “airplane” refer to the same concept; but if one of them is used in the query and the other in the document, then the document will be considered irrelevant. [sent-229, score-0.561]

42 For example, we have observed over 200 different forms for representing the same search intent “finding YouTube website” from the query log of a commercial web search engine. [sent-235, score-0.6]

43 If the query and document share a term t, then the term frequencies will be non-zero values and the relevance score between the query and document will become high. [sent-238, score-1.358]

44 That is, the value of the S-function between the query and document will be large. [sent-239, score-0.561]

45 When term mismatch occurs, either t f (t, d) or t f (t, q) will be zero, and the relevance score will be low, although it should not be so. [sent-240, score-0.411]

46 Without loss of generality, we use BM25 as the basic relevance model; one can easily extend the techniques here to other relevance models. [sent-249, score-0.392]

47 In fact, Robust BM25 is also an S-function that measures the similarity of query q and document d through a dot product in a Hilbert space, as will be explained in Section 5. [sent-254, score-0.749]

48 Suppose that the query space contains queries as elements and has the kernel function kQ as a similarity function. [sent-260, score-0.739]

49 Given query q, one can find its similar queries qi based on kQ (q, qi ) (its neighbors). [sent-261, score-0.684]

50 Similarly, the document space contains documents as elements and has the kernel function kD as a similarity function. [sent-262, score-0.564]

51 Given document d, one can find its similar documents di based on kD (d, di ) (its neighbors). [sent-263, score-0.539]

52 The relevance model BM25 is defined as an S-function between query and document over the two spaces. [sent-264, score-0.757]

53 One possible way to deal with the problem is to use the neighboring queries qi and documents di to smooth the BM25 score of q and d, as in the k-nearest neighbor algorithm (Cover and Hart, 1967; Dudani, 1976). [sent-266, score-0.485]

54 In other words, we employ the k-nearest neighbor method in both the query and document spaces to calculate the final relevance score (cf. [sent-267, score-0.831]

55 More specifically, Robust BM25 determines the ranking score of query q and document d, not only based on the relevance score between q and d themselves (i. [sent-270, score-0.906]

56 , kBM25 (q, d)), but also based on the relevance scores between similar queries qi and similar documents di (i. [sent-272, score-0.641]

57 In the i=1 kernel method, we use the following kernel on Q × D : ¯ kHBM25 ((q, d), (q′ , d ′ )) = kBM25 (q, d)kQ (q, q′ )kD (d, d ′ )kBM25 (q′ , d ′ ), (3) where kBM25 (q, d) is the BM25 model, kQ and kD are the query and document similarity kernels. [sent-283, score-0.934]

58 com click Yes No Yes No No Table 2: A record of click-through data for query “walmart”. [sent-325, score-0.387]

59 It is obvious that in the learning of Robust BM25 (4), we specify g(x, y) as BM25 (an Sfunction), and kX and kY as query similarity kernel kQ and document similarity kernel kD , respectively. [sent-335, score-1.075]

60 To learn Robust BM25, we need to decide the query similarity kQ (q, q′ ), document similarity kD (d, d ′ ), training data, and optimization technique. [sent-342, score-0.843]

61 In this case, the user submitted the query “walmart” and received the ranked list of URLs, and the user clicked on the URLs at ranks 1 and 3 but skipped the URLs at ranks 2, 4, and 5. [sent-349, score-0.461]

62 To calculate query similarity, we represent query and URL click-through relationships in a bipartite graph in which queries and URLs are nodes in two sets and clicks are edges between nodes in the two sets. [sent-354, score-0.9]

63 A weight is associated with each edge representing the total number of times that the URL is clicked after the query is issued. [sent-355, score-0.417]

64 We specifically define query similarity using co-clicked URLs in the click-through bipartite graph. [sent-361, score-0.526]

65 Since queries with the same search intent tend to be linked to the same URLs, query similarity defined in this way actually represents the degree of being the same search intent. [sent-363, score-0.795]

66 Note that query similarity kQ (q, q′ ) defined in Equation (9) is a positive ¯ ¯ semi-definite kernel, because it is the dot product of two vectors ( √ nu1 −u 2 , . [sent-365, score-0.547]

67 In fact, with the use of click-through bipartite and query similarity measure, different types of similar queries can be found, including spelling error (e. [sent-374, score-0.649]

68 Document similarity kD (d, d ′ ) is simply defined as the cosine similarity between the titles and URLs of two documents, which is certainly a kernel (cosine similarity is the dot product in an Euclidean space). [sent-393, score-0.586]

69 More precisely, for each query qi we derive preference pairs (di+ , di− ), where di+ and di− mean that document di+ is more preferred than di− with respect to query qi (e. [sent-395, score-1.122]

70 1441 WU, XU, LI original query wallmart ironman AND OYAMA similar queries wall mart, walmart, wal mart, walmarts iron man, ironman movie, irnman, www. [sent-402, score-0.524]

71 edu, uc san diego, uscd, university of california san diego knives aircraft for sale ucsd Table 3: Similar queries extracted from web search click-through data. [sent-406, score-0.347]

72 5 seconds per query to train a model on a workstation with Quad-Core Intel Xeon E5410 2. [sent-429, score-0.359]

73 1 Experimental Data In our experiments, we used two large scale data sets from a commercial web search engine and an enterprise search engine running in an IT company. [sent-434, score-0.328]

74 The click-through data in both data sets was split into two parts, one for learning query similarity and the other for learning Robust BM25. [sent-439, score-0.5]

75 The key idea in query expansion is to add into the original query terms extracted from relevant queries or documents. [sent-450, score-0.891]

76 Thus, even though the original query and document do not share a term, after expansion, the query is enriched and it is likely to be matched with relevant documents. [sent-451, score-0.92]

77 On the other hand, query expansion may also suffer from the so-called topic drift problem. [sent-452, score-0.409]

78 Second, the final ranking of results is based on Robust BM25 which is trained specifically for the query using click-through data. [sent-458, score-0.428]

79 In our experiment, we tried several different ways to conduct query expansion and chose the one performing the best as the baseline. [sent-460, score-0.409]

80 If there is no such a URL, we use the terms of the most similar query to do expansion. [sent-462, score-0.359]

81 The difference between our method and the pairwise kernel is that the pairwise kernel does not use a traditional relevance model kBM25 (q, d). [sent-464, score-0.488]

82 NDCG is usually used to assess a ranking algorithm when documents have multiple relevance grades (e. [sent-473, score-0.37]

83 Given a query q, NDCG at position n is defined as DCG@n(q) , NDCG@n(q) = IDCG@n(q) MAP = where DCG@n(q) is defined as nq 2rel(i) − 1 , i=1 log2 (i + 1) DCG@n(q) = ∑ where rel(i) is the relevance grade of a document ranked at position i. [sent-476, score-0.801]

84 The DCG@n(q) score is normalized by IDCG@n(q), which is an ideal DCG@n(q) score when documents are ranked in decreasing order of their relevance grades. [sent-477, score-0.425]

85 7 training pairs were used for each query in web search data and enterprise data, respectively. [sent-485, score-0.557]

86 The pairwise kernel outperforms BM25 and query expansion, which indicates that it is better to learn a relevance model in search. [sent-494, score-0.701]

87 It seems that Robust BM25 can effectively deal with term mismatch with its mechanisms: using query similarity and document similarity. [sent-536, score-0.877]

88 There is a mismatch between query and page, the basic relevance model BM25 cannot give a high score to the page (note that there is a difference between the query term “wallmart” and the document term “walmart”. [sent-544, score-1.378]

89 com” is the most clicked web page with respect to the original query in the clickthrough data. [sent-550, score-0.533]

90 Because it does not have sufficient knowledge to break “Walmartstores” into “walmart” and “stores”, query expansion cannot add good terms to the original query. [sent-552, score-0.409]

91 When query expansion adds more terms to the original query, “walmart” will appear, but at the same time noise terms will also be included. [sent-553, score-0.409]

92 The query is “mensmagazines”, which is a tail query and does not have a similar query found in the click-through data. [sent-556, score-1.077]

93 There is a mismatch, because there is not sufficient knowledge to break query “mensmagazines” into “mens” and “magazines”. [sent-560, score-0.359]

94 Compared with the pairwise kernel, Robust BM25 successfully leveraged the traditional matching model BM25 when its score is reliable to reflect relevance between query and document. [sent-583, score-0.625]

95 This is because the pairwise kernel does not consider the match between query and documents using BM25, while Robust BM25 does. [sent-594, score-0.61]

96 We have proposed a new kernel method for learning a robust relevance model as an S-function for search. [sent-598, score-0.44]

97 The learned model can deal with the term mismatch problem which traditional relevance models suffer from. [sent-599, score-0.371]

98 ∀z ∈ HY , ||z||HY < ∞, ∞ ΓN (z) 2 X = ∑ H N ∑ i=0 k, j=1 √ i i i αk α j g(xk , yk )g(x j , y j ) ci kX (xk , x j ) ψY (yk ), zi HYi ψY (y j ), zi HYi . [sent-732, score-0.329]

99 , Γn zn ), N N i i where HX and HYi are the Hilbert spaces with respect to feature mappings ψi (·) and ψY (·) of X √ X √ Y Mercer kernels ci ki and ci ki , respectively, ·, · HYi is the inner product defined in HYi , and i Γi (zi ) = ∑N α j g(x j , y j )ψi (x j ) ψY (y j ), zi HYi . [sent-793, score-0.354]

100 ∀z ∈ HY , n ΓN (z) 2 X = ∑ H N ∑ i=1 k, j=1 √ i i αk α j g(xk , yk )g(x j , y j ) ci kiX (xk , x j ) ψY (yk ), zi HYi ψY (y j ), zi HYi . [sent-795, score-0.329]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('query', 0.359), ('ky', 0.284), ('hyi', 0.206), ('document', 0.202), ('kx', 0.202), ('relevance', 0.196), ('lmir', 0.192), ('hy', 0.178), ('mismatch', 0.175), ('kix', 0.15), ('hx', 0.149), ('oyama', 0.142), ('urls', 0.142), ('vsm', 0.142), ('similarity', 0.141), ('robust', 0.128), ('tube', 0.125), ('queries', 0.123), ('kp', 0.12), ('elevance', 0.117), ('di', 0.116), ('kernel', 0.116), ('kq', 0.114), ('yk', 0.113), ('ci', 0.112), ('youtube', 0.108), ('km', 0.106), ('documents', 0.105), ('qi', 0.101), ('ndcg', 0.092), ('xu', 0.087), ('earch', 0.076), ('kd', 0.076), ('walmart', 0.075), ('www', 0.075), ('search', 0.072), ('ranking', 0.069), ('web', 0.069), ('kn', 0.069), ('sing', 0.068), ('xk', 0.063), ('ethods', 0.062), ('odel', 0.059), ('ernel', 0.059), ('hilbert', 0.058), ('clicked', 0.058), ('enterprise', 0.057), ('sigir', 0.053), ('hi', 0.053), ('zi', 0.052), ('wu', 0.052), ('mercer', 0.051), ('expansion', 0.05), ('len', 0.05), ('robertson', 0.05), ('sale', 0.05), ('rel', 0.05), ('page', 0.047), ('dot', 0.047), ('ranked', 0.044), ('limn', 0.044), ('kernels', 0.044), ('croft', 0.042), ('wallmart', 0.042), ('score', 0.04), ('conventional', 0.038), ('salton', 0.038), ('objects', 0.038), ('yi', 0.038), ('cauchy', 0.036), ('dcg', 0.036), ('li', 0.035), ('spaces', 0.034), ('aircraft', 0.033), ('clicks', 0.033), ('hxi', 0.033), ('koide', 0.033), ('magazines', 0.033), ('mensmagazines', 0.033), ('ponte', 0.033), ('southwest', 0.033), ('utube', 0.033), ('zhai', 0.033), ('hofmann', 0.032), ('asymmetric', 0.032), ('rank', 0.031), ('retrieval', 0.03), ('pairwise', 0.03), ('ong', 0.03), ('engine', 0.029), ('earning', 0.029), ('grangier', 0.028), ('cao', 0.028), ('intent', 0.028), ('com', 0.028), ('mart', 0.028), ('title', 0.027), ('bipartite', 0.026), ('retrieved', 0.026), ('videos', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

2 0.11272023 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: Gaussian process regression is a flexible and powerful tool for machine learning, but the high computational complexity hinders its broader applications. In this paper, we propose a new approach for fast computation of Gaussian process regression with a focus on large spatial data sets. The approach decomposes the domain of a regression function into small subdomains and infers a local piece of the regression function for each subdomain. We explicitly address the mismatch problem of the local pieces on the boundaries of neighboring subdomains by imposing continuity constraints. The new approach has comparable or better computation complexity as other competing methods, but it is easier to be parallelized for faster computation. Moreover, the method can be adaptive to non-stationary features because of its local nature and, in particular, its use of different hyperparameters of the covariance function for different local regions. We illustrate application of the method and demonstrate its advantages over existing methods using two synthetic data sets and two real spatial data sets. Keywords: domain decomposition, boundary value problem, Gaussian process regression, parallel computation, spatial prediction

3 0.10982375 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

4 0.063028812 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

5 0.061666857 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

6 0.058407199 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

7 0.052883878 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

8 0.050508991 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

9 0.050448753 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

10 0.045632441 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

11 0.045296628 16 jmlr-2011-Clustering Algorithms for Chains

12 0.044218492 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

13 0.040343657 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

14 0.038026966 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

15 0.037080012 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

16 0.035425935 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

17 0.034482293 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

18 0.034115877 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

19 0.032059569 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

20 0.03177413 91 jmlr-2011-The Sample Complexity of Dictionary Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.181), (1, -0.083), (2, 0.05), (3, -0.178), (4, 0.043), (5, 0.002), (6, -0.051), (7, 0.065), (8, -0.044), (9, -0.043), (10, -0.009), (11, 0.113), (12, 0.04), (13, 0.018), (14, 0.0), (15, 0.075), (16, 0.085), (17, 0.075), (18, 0.149), (19, -0.071), (20, -0.266), (21, 0.121), (22, -0.276), (23, -0.121), (24, -0.168), (25, 0.176), (26, -0.056), (27, 0.017), (28, 0.092), (29, 0.074), (30, -0.202), (31, 0.216), (32, 0.057), (33, -0.057), (34, 0.048), (35, 0.091), (36, 0.148), (37, 0.097), (38, -0.085), (39, -0.039), (40, 0.032), (41, -0.011), (42, -0.083), (43, -0.031), (44, -0.022), (45, -0.115), (46, 0.057), (47, -0.013), (48, -0.108), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9625923 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

2 0.54580706 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: Gaussian process regression is a flexible and powerful tool for machine learning, but the high computational complexity hinders its broader applications. In this paper, we propose a new approach for fast computation of Gaussian process regression with a focus on large spatial data sets. The approach decomposes the domain of a regression function into small subdomains and infers a local piece of the regression function for each subdomain. We explicitly address the mismatch problem of the local pieces on the boundaries of neighboring subdomains by imposing continuity constraints. The new approach has comparable or better computation complexity as other competing methods, but it is easier to be parallelized for faster computation. Moreover, the method can be adaptive to non-stationary features because of its local nature and, in particular, its use of different hyperparameters of the covariance function for different local regions. We illustrate application of the method and demonstrate its advantages over existing methods using two synthetic data sets and two real spatial data sets. Keywords: domain decomposition, boundary value problem, Gaussian process regression, parallel computation, spatial prediction

3 0.43519044 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

4 0.3765282 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros

Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy

5 0.30459136 16 jmlr-2011-Clustering Algorithms for Chains

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a finite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The first one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the significance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artificial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

6 0.3044607 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

7 0.29213905 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

8 0.26261273 66 jmlr-2011-Multiple Kernel Learning Algorithms

9 0.23416759 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

10 0.23375587 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

11 0.22107533 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.21851955 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning

13 0.21149781 105 jmlr-2011-lp-Norm Multiple Kernel Learning

14 0.21005 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

15 0.20314479 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

16 0.19678725 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

17 0.19313318 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

18 0.18334809 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

19 0.17119879 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

20 0.17113087 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.041), (9, 0.036), (10, 0.027), (24, 0.032), (31, 0.06), (32, 0.026), (36, 0.011), (41, 0.023), (60, 0.012), (65, 0.011), (71, 0.018), (73, 0.075), (78, 0.053), (90, 0.031), (98, 0.472)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79689229 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

2 0.26811296 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefficient, label complexity, smooth function

3 0.24916665 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints

4 0.24664016 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

5 0.24503037 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

6 0.24205182 48 jmlr-2011-Kernel Analysis of Deep Networks

7 0.24192874 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

8 0.23881979 66 jmlr-2011-Multiple Kernel Learning Algorithms

9 0.23832454 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

10 0.23726878 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

11 0.23471545 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

12 0.23177291 12 jmlr-2011-Bayesian Co-Training

13 0.23133479 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

14 0.23070155 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.2304896 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

16 0.23029095 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

17 0.2298096 105 jmlr-2011-lp-Norm Multiple Kernel Learning

18 0.22702648 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

19 0.22678494 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

20 0.22570568 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs