nips nips2009 nips2009-190 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri
Abstract: We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. [sent-5, score-0.426]
2 Dealing with polynomial models on word features is computationally challenging. [sent-6, score-0.227]
3 We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. [sent-7, score-0.201]
4 We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. [sent-8, score-0.158]
5 1 Introduction Ranking text documents given a text-based query is one of the key tasks in information retrieval. [sent-9, score-0.519]
6 model queries and target documents using a vector representation; and then (ii) choose (or learn) a similarity metric that operates in this vector space. [sent-12, score-0.428]
7 Ranking is then performed by sorting the documents based on their similarity score with the query. [sent-13, score-0.28]
8 via tf-idf) as the feature space, and the cosine similarity for ranking. [sent-18, score-0.122]
9 This type of model often performs remarkably well, but suffers from the fact that only exact matches of words between query and target texts contribute to the similarity score. [sent-20, score-0.461]
10 More recently, supervised models for ranking texts have been proposed that can be trained on a supervised signal (i. [sent-25, score-0.439]
11 , labeled data) to provide a ranking of a database of documents given a query. [sent-27, score-0.465]
12 Or, if one is interested in finding documents related to a given query document, one can use known hyperlinks to learn a model that performs well on this task. [sent-29, score-0.462]
13 In this work, we investigate an orthogonal research direction, as we analyze supervised methods that are based on words only. [sent-33, score-0.122]
14 To deal with this space we propose low rank (but diagonal preserving) representations of our polynomial models to induce feasible memory and computation requirements, resulting in a method that both exhibits strong performance and is tractable to train and test. [sent-40, score-0.381]
15 We show experimentally on retrieval tasks derived from Wikipedia that our method strongly outperforms other word based models, including tf-idf vector space models, LSI, query expansion, margin rank perceptrons and Hash Kernels. [sent-41, score-0.672]
16 2 Polynomial Semantic Indexing Let us denote the set of documents in the corpus as {dt }t=1 ⊂ RD and a query text as q ∈ RD , where D is the dictionary size, and the j th dimension of a vector indicates the frequency of occurrence of the j th word, e. [sent-44, score-0.639]
17 Given a query q and a document d we wish to learn a (nonlinear) function f (q, d) that returns a score measuring the relevance of d given q. [sent-47, score-0.429]
18 Let us first consider the naive approach of concatenating (q, d) into a single vector and using f (q, d) = w [q, d] as a linear ranking model. [sent-48, score-0.242]
19 This clearly does not learn anything useful as it would result in the same document ordering for any query, given fixed parameters w. [sent-49, score-0.19]
20 However, considering a polynomial model: f (q, d) = w Φk ([q, d]) where Φk (·) is a feature map that considers all possible k-degree terms: Φk (x1 , . [sent-50, score-0.154]
21 For example for degree k = 2 we obtain: 1 wij qi qj + f (q, d) = 2 wij di qj + ij ij ij where w has been rewritten as w ∈ R ,w ∈ R and w3 ∈ RD×D . [sent-60, score-0.309]
22 The ranking order 1 of documents d given a fixed query q is independent of w and the value of the term with w3 is independent of the query, so in the following we will consider models containing only terms with both q and d. [sent-61, score-0.704]
23 In particular, we will consider the following degree k = 2 model: 1 D×D 3 wij di dj 2 D×D D f 2 (q, d) = Wij qi dj = q W d (1) i,j=1 where W ∈ RD×D , and the degree k = 3 model: D f 3 (q, d) = Wijk qi dj dk + f 2 (q, d). [sent-62, score-0.366]
24 (2) i,j,k=1 Note that if W is an identity matrix in equation (1), we obtain the cosine similarity with tf-idf weighting. [sent-63, score-0.122]
25 the word “jagger” in the query and “stones” in the target could be given a large value during training. [sent-68, score-0.366]
26 Note also that in equation (2) we could have just as easily have considered pairs of words in the query (rather than the document) as well. [sent-71, score-0.34]
27 If the dictionary size is D = 30000, then, for k = 2 this requires 3. [sent-74, score-0.12]
28 4GB of RAM (assuming floats), and if the dictionary size is 2. [sent-75, score-0.12]
29 • Further, U and V differ so it does not assume the query and target document should be embedded in the same way. [sent-87, score-0.469]
30 This can hence model when the query text distribution is very different to the document text distribution, e. [sent-88, score-0.543]
31 queries are often short and have different word occurrence and co-occurrence statistics. [sent-90, score-0.195]
32 In the extreme case in cross language retrieval query and target texts are in different languages yet are naturally modeled in this setup. [sent-91, score-0.601]
33 This is important because the diagonal of the W matrix gives the specificity of picking out when a word co-occurs in both documents (indeed, setting W = I is equivalent to cosine similarity using tf-idf). [sent-93, score-0.459]
34 The matrix I is full rank and therefore cannot be approximated with the low-rank model U V , so our model combines both terms in the approximation. [sent-94, score-0.135]
35 Typically, one caches the N -dimensional representation for each document to use at query time. [sent-96, score-0.429]
36 i=1 Clearly, we can approximate any degree k polynomial using a product of k linear embeddings in such a scheme. [sent-102, score-0.153]
37 Note that at test time one can again cache the N -dimensional representation for each document by computing the product between the V and Y terms and are then still left with only N multiplications per document for the embedding term at query time. [sent-103, score-0.68]
38 Interestingly, one can view this model as a “product of experts”: the document is projected twice i. [sent-104, score-0.19]
39 Suppose we are given a set of tuples R (labeled data), where each tuple contains a query q, a relevant document d+ and an non-relevant (or lower ranked) document d− . [sent-110, score-0.666]
40 We thus employ the margin ranking loss [17] which has already been used in several IR methods before [20, 5, 14], and minimize: max(0, 1 − f (q, d+ ) + f (q, d− )). [sent-112, score-0.322]
41 Note that researchers have also explored optimizing various alternative loss functions other than the ranking loss including optimizing normalized discounted cumulative gain (NDCG) and mean average precision (MAP) [5, 4, 6, 28]. [sent-123, score-0.368]
42 In fact, one could use those optimization strategies to train our models instead of optimizing the ranking loss. [sent-124, score-0.323]
43 The authors of [15] used a model similar to the naive full rank model (1), but for the task of image retrieval, and [13] also used a related (regression-based) method for advert placement. [sent-134, score-0.173]
44 In all cases, the task of document retrieval, and the use of low-rank approximation or polynomial features is not studied. [sent-137, score-0.368]
45 Methods like LMNN [27] also learn a model similar to the naive full rank model (1), i. [sent-140, score-0.135]
46 This would hence not be scalable for large-scale text ranking experiments. [sent-147, score-0.299]
47 Nevertheless, [7] compared LMNN [27], LEGO [19] and MCML [12] to a stochastic gradient method with a full matrix W (identical to the model (1)) on a small image ranking task and reported in fact that the stochastic method provides both improved results and efficiency. [sent-148, score-0.28]
48 However, in the case of [23] we note their method is rather slow, and a dictionary size of only 2000 was used. [sent-157, score-0.12]
49 This provides supervision at the document level (via a class label or regression value) which is not a task of learning to rank, whereas here we study supervision at the (query,documents) level. [sent-159, score-0.228]
50 These include first applying machine translation and then a conventional retrieval method such as LSI [16], a direct method of applying LSI for this task called CL-LSI [9], or using Kernel Canonical Correlation Analysis, KCCA [26]. [sent-166, score-0.301]
51 Standard retrieval datasets like TREC3 or LETOR [22] contain only a few hundred training queries, and are hence too small for that purpose. [sent-169, score-0.188]
52 We hence conducted experiments on Wikipedia and used links within Wikipedia to build a largescale ranking task. [sent-173, score-0.347]
53 We considered several tasks: document-document and query-document retrieval described in Section 4. [sent-174, score-0.186]
54 In these experiments we compared our approach, Polynomial Semantic Indexing (PSI), to the following methods: tf-idf + cosine similarity (TFIDF), Query Expansion (QE), LSI4 , αLSI + (1 − α) TFIDF, and the margin ranking perceptron and Hash Kernels with hash size h using model (1). [sent-177, score-0.571]
55 E Query Expansion involves applying TFIDF and then adding mean vector β i=1 dri of the top E retrieved documents multiplied by a weighting β to the query, and applying TFIDF again. [sent-178, score-0.223]
56 For each method, we measured the ranking loss (the percentage of tuples in R that are incorrectly ordered), precision P (n) at position n = 10 (P@10) and the mean average precision (MAP), as well as their standard deviations. [sent-180, score-0.316]
57 For computational reasons, MAP and P@10 were measured by averaging over a fixed set of 1000 test queries, and the true test links and random subsets of 10,000 documents were used as the database, rather than the whole testing set. [sent-181, score-0.328]
58 The ranking loss is measured using 100,000 testing tuples. [sent-182, score-0.269]
59 1 Document Retrieval We considered a set of 1,828,645 English Wikipedia documents as a database, and split the 24,667,286 links randomly into two portions, 70% for training (plus validation) and 30% for test3 http://trec. [sent-184, score-0.386]
60 4 5 Table 1: Document-document ranking results on Wikipedia (limited dictionary size of 30,000 words). [sent-189, score-0.362]
61 007 Table 2: Empirical results for document-document ranking on Wikipedia (unlimited dictionary size). [sent-226, score-0.362]
62 008 Table 3: Empirical results for document-document ranking in two train/test setups: partitioning into train+test sets of links, or into train+test sets of documents with no cross-links (limited dictionary size of 30,000 words). [sent-257, score-0.585]
63 006 Table 4: Empirical results for query-document ranking on Wikipedia where query has n keywords (this experiment uses a limited dictionary size of 30,000 words). [sent-270, score-0.601]
64 For each n we measure the ranking loss, MAP and P@10 metrics. [sent-271, score-0.242]
65 5 We then considered the following task: given a query document q, rank the other documents such that if q links to d then d is highly ranked. [sent-300, score-0.92]
66 This allowed us to compare to a margin ranking perceptron using model (1) which would otherwise not fit in memory. [sent-302, score-0.345]
67 The margin rank perceptron using (1) can be seen as a full rank version of PSI for k = 2 (with W unconstrained) but is outperformed by its lowrank counterpart – probably because it has too much capacity. [sent-306, score-0.409]
68 Degree k = 3 outperforms k = 2, indicating that the higher order nonlinearities captured provide better ranking scores. [sent-307, score-0.242]
69 In terms of other techniques, LSI is slightly better than TFIDF but QE in this case does not improve much over TFIDF, perhaps because of the difficulty of this task (there may too often be many irrelevant documents in the top E documents initially retrieved for QE to help). [sent-309, score-0.484]
70 6 Table 5: The closest five words in the document embedding space to some example query words. [sent-311, score-0.563]
71 In this setting we compare to Hash Kernels which can deal with these dictionary sizes. [sent-314, score-0.12]
72 The results, given in Table 2 show the same trends, indicating that the dictionary size restriction in the previous experiment did not bias the results in favor of any one algorithm. [sent-315, score-0.12]
73 In some cases, one might be worried that our experimental setup has split training and testing data only by partitioning the links, but not the documents, hence performance of our model when new unseen documents are added to the database might be in question. [sent-318, score-0.293]
74 We therefore also tested an experimental setup where the test set of documents is completely separate from the training set of documents, by completely removing all training set links between training and testing documents. [sent-319, score-0.458]
75 One might wonder if the reported improvements also hold in a setup where queries consist of only a few keywords. [sent-322, score-0.148]
76 We used the same setup as before but we constructed queries by keeping only n random words from query documents in an attempt to mimic a “keyword search”. [sent-324, score-0.683]
77 Table 4 reports the results for keyword queries of length n = 5, 10 and 20. [sent-325, score-0.139]
78 PSI yields similar improvements as in the document-document retrieval case over the baselines. [sent-326, score-0.158]
79 Word Embedding The document embedding V d in equation (3) (similarly for the query embedding U q) can be viewed as V d = i V·i di , in which each column V·i is the embedding of the word di . [sent-327, score-0.765]
80 The first column contains query words, on the right are the 5 words with smallest Euclidean distance in the embedded space. [sent-330, score-0.312]
81 2 Cross Language Document Retrieval Cross Language Retrieval [16] is the task of retrieving documents in a target language E given a query in a different source language F . [sent-333, score-0.624]
82 This is an interesting case for word-based learning to rank models which can naturally deal with this task without the need for machine translation as they directly learn the correspondence between the two languages from bi-lingual labeled data in the form of tuples R. [sent-335, score-0.361]
83 We then consider a cross language retrieval task that is analogous to the task in Section 4. [sent-343, score-0.31]
84 1: given a Japanese query document qJap that is the mate of the English document qEng , rank the English 6 http://translate. [sent-344, score-0.782]
85 com/translate_s 7 Table 6: Cross-lingual Japanese document-English document ranking (limited dictionary size of 30,000 words). [sent-346, score-0.552]
86 009 documents so that the documents linked to qEng appear above the others. [sent-402, score-0.446]
87 The document qEng is removed and not considered during training or testing. [sent-403, score-0.248]
88 We considered three methods of machine translation: Google’s API7 or Fujitsu’s ATLAS8 was used to translate each query document, or we translated each word in the Japanese dictionary using ATLAS and then applied this word-based translation to a query. [sent-407, score-0.626]
89 For PSI, we considered two cases: (i) apply the ATLAS machine translation tool first, and then use PSI trained on the task in Section 4. [sent-409, score-0.218]
90 the model given in equation (3) (PSIEngEng ), which was trained on English queries and English target documents; or (ii) train PSI directly with Japanese queries and English target documents using the model using (3) without the identity, which we call PSIJapEng . [sent-412, score-0.611]
91 The dictionary size was again limited to the 30,000 most frequent words in both languages for ease of comparison with CL-LSI. [sent-415, score-0.229]
92 Machine translation followed by PSIEngEng outperformed all these methods, however the direct PSIJapEng which required no machine translation tool at all, improved results even further. [sent-418, score-0.21]
93 We conjecture that this is because translation mistakes generate noisy features which PSIJapEng circumvents. [sent-419, score-0.141]
94 However, we also considered combining PSIJapEng with TFIDF or PSIEngEng using a mixing parameter α and this provided further gains at the expense of requiring a machine translation tool. [sent-420, score-0.133]
95 [9], typically measure the performance of finding a “mate”, the same document in another language, whereas our experiment tries to model a querybased retrieval task. [sent-423, score-0.348]
96 5 Conclusion We described a versatile, powerful set of discriminatively trained models for document ranking based on polynomial features over words, which was made feasible with a low-rank (but diagonal preserving) approximation. [sent-428, score-0.675]
97 The rate adapting poisson (rap) model for information retrieval and object recognition. [sent-509, score-0.158]
98 A discriminative kernel-based approach to rank images from text queries. [sent-530, score-0.192]
99 Advances in Large Margin Classifiers, chapter Large margin rank boundaries for ordinal regression. [sent-541, score-0.188]
100 Letor: Benchmark dataset for research on learning to rank for information retrieval. [sent-571, score-0.135]
wordName wordTfidf (topN-words)
[('lsi', 0.331), ('psi', 0.295), ('tfidf', 0.249), ('ranking', 0.242), ('query', 0.239), ('documents', 0.223), ('wikipedia', 0.217), ('document', 0.19), ('atlas', 0.162), ('retrieval', 0.158), ('flr', 0.145), ('psijapeng', 0.145), ('rank', 0.135), ('semantic', 0.125), ('tfidfengeng', 0.124), ('dictionary', 0.12), ('indexing', 0.109), ('queries', 0.108), ('links', 0.105), ('translation', 0.105), ('polynomial', 0.104), ('hash', 0.104), ('psiengeng', 0.103), ('qe', 0.103), ('japanese', 0.103), ('english', 0.096), ('word', 0.087), ('words', 0.073), ('cosine', 0.065), ('jagger', 0.062), ('qeng', 0.062), ('stones', 0.062), ('embedding', 0.061), ('text', 0.057), ('similarity', 0.057), ('preserving', 0.054), ('grangier', 0.054), ('plsa', 0.054), ('sigir', 0.053), ('margin', 0.053), ('texts', 0.052), ('perceptron', 0.05), ('map', 0.05), ('wijk', 0.05), ('wij', 0.05), ('supervised', 0.049), ('degree', 0.049), ('trained', 0.047), ('translated', 0.047), ('tuples', 0.047), ('train', 0.045), ('qi', 0.043), ('language', 0.042), ('kunihiko', 0.041), ('lsiengeng', 0.041), ('uli', 0.041), ('vlj', 0.041), ('yanjun', 0.041), ('latent', 0.041), ('memory', 0.041), ('title', 0.04), ('setup', 0.04), ('target', 0.04), ('mohri', 0.039), ('burges', 0.039), ('task', 0.038), ('languages', 0.036), ('iij', 0.036), ('lowrank', 0.036), ('optimizing', 0.036), ('features', 0.036), ('google', 0.035), ('engines', 0.035), ('lda', 0.035), ('cross', 0.034), ('corinna', 0.033), ('di', 0.033), ('dj', 0.033), ('pages', 0.031), ('kernels', 0.031), ('dumais', 0.031), ('bai', 0.031), ('keyword', 0.031), ('expansion', 0.031), ('training', 0.03), ('letor', 0.029), ('polynomials', 0.029), ('url', 0.029), ('agnostic', 0.029), ('feasible', 0.029), ('qin', 0.028), ('collobert', 0.028), ('mate', 0.028), ('setups', 0.028), ('ij', 0.028), ('considered', 0.028), ('loss', 0.027), ('diagonal', 0.027), ('lmnn', 0.027), ('triple', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 190 nips-2009-Polynomial Semantic Indexing
Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri
Abstract: We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. 1
2 0.19897208 87 nips-2009-Exponential Family Graph Matching and Ranking
Author: James Petterson, Jin Yu, Julian J. Mcauley, Tibério S. Caetano
Abstract: We present a method for learning max-weight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that for one very relevant application–document ranking–exact inference is efficient. For general model instances, an appropriate sampler is readily available. Contrary to existing max-margin matching models, our approach is statistically consistent and, in addition, experiments with increasing sample sizes indicate superior improvement over such models. We apply the method to graph matching in computer vision as well as to a standard benchmark dataset for learning document ranking, in which we obtain state-of-the-art results, in particular improving on max-margin variants. The drawback of this method with respect to max-margin alternatives is its runtime for large graphs, which is comparatively high. 1
3 0.18358038 136 nips-2009-Learning to Rank by Optimizing NDCG Measure
Author: Hamed Valizadegan, Rong Jin, Ruofei Zhang, Jianchang Mao
Abstract: Learning to rank is a relatively new field of study, aiming to learn a ranking function from a set of training data with relevancy labels. The ranking algorithms are often evaluated using information retrieval measures, such as Normalized Discounted Cumulative Gain (NDCG) [1] and Mean Average Precision (MAP) [2]. Until recently, most learning to rank algorithms were not using a loss function related to the above mentioned evaluation measures. The main difficulty in direct optimization of these measures is that they depend on the ranks of documents, not the numerical values output by the ranking function. We propose a probabilistic framework that addresses this challenge by optimizing the expectation of NDCG over all the possible permutations of documents. A relaxation strategy is used to approximate the average of NDCG over the space of permutation, and a bound optimization approach is proposed to make the computation efficient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art ranking algorithms on several benchmark data sets. 1
4 0.17586398 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank
Author: Wei Chen, Tie-yan Liu, Yanyan Lan, Zhi-ming Ma, Hang Li
Abstract: Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking functions by minimizing loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking functions. In this work, we reveal the relationship between ranking measures and loss functions in learningto-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that the loss functions of these methods are upper bounds of the measurebased ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classification tasks, and define a so-called essential loss for ranking as the weighted sum of the classification errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modifications can lead to better ranking performances, demonstrating the correctness of our theoretical analysis. 1
5 0.14953814 260 nips-2009-Zero-shot Learning with Semantic Output Codes
Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell
Abstract: We consider the problem of zero-shot learning, where the goal is to learn a classifier f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classifier and study its theoretical properties in a PAC framework, showing conditions under which the classifier can accurately predict novel classes. As a case study, we build a SOC classifier for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words. 1
6 0.13614686 230 nips-2009-Statistical Consistency of Top-k Ranking
7 0.13253482 96 nips-2009-Filtering Abstract Senses From Image Search Results
8 0.12831225 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
9 0.1272562 204 nips-2009-Replicated Softmax: an Undirected Topic Model
10 0.11907933 104 nips-2009-Group Sparse Coding
11 0.10683411 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
12 0.10026145 166 nips-2009-Noisy Generalized Binary Search
13 0.093661211 135 nips-2009-Learning to Hash with Binary Reconstructive Embeddings
14 0.088673152 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions
15 0.085198179 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning
16 0.081867747 205 nips-2009-Rethinking LDA: Why Priors Matter
17 0.080516741 211 nips-2009-Segmenting Scenes by Matching Image Composites
18 0.076598093 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
19 0.07489647 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
20 0.073172368 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
topicId topicWeight
[(0, -0.208), (1, -0.004), (2, -0.143), (3, -0.107), (4, 0.024), (5, -0.086), (6, -0.385), (7, -0.063), (8, 0.035), (9, 0.031), (10, 0.056), (11, 0.072), (12, -0.028), (13, 0.155), (14, -0.014), (15, -0.06), (16, -0.051), (17, 0.129), (18, -0.129), (19, -0.055), (20, 0.027), (21, 0.018), (22, 0.047), (23, 0.067), (24, 0.021), (25, 0.048), (26, 0.015), (27, 0.004), (28, 0.065), (29, -0.004), (30, -0.06), (31, 0.011), (32, 0.01), (33, -0.057), (34, -0.022), (35, -0.029), (36, -0.008), (37, -0.04), (38, -0.032), (39, -0.096), (40, 0.006), (41, 0.132), (42, -0.018), (43, 0.011), (44, 0.046), (45, 0.025), (46, -0.121), (47, 0.099), (48, 0.052), (49, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.96135145 190 nips-2009-Polynomial Semantic Indexing
Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri
Abstract: We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. 1
2 0.67584229 87 nips-2009-Exponential Family Graph Matching and Ranking
Author: James Petterson, Jin Yu, Julian J. Mcauley, Tibério S. Caetano
Abstract: We present a method for learning max-weight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that for one very relevant application–document ranking–exact inference is efficient. For general model instances, an appropriate sampler is readily available. Contrary to existing max-margin matching models, our approach is statistically consistent and, in addition, experiments with increasing sample sizes indicate superior improvement over such models. We apply the method to graph matching in computer vision as well as to a standard benchmark dataset for learning document ranking, in which we obtain state-of-the-art results, in particular improving on max-margin variants. The drawback of this method with respect to max-margin alternatives is its runtime for large graphs, which is comparatively high. 1
3 0.6494475 136 nips-2009-Learning to Rank by Optimizing NDCG Measure
Author: Hamed Valizadegan, Rong Jin, Ruofei Zhang, Jianchang Mao
Abstract: Learning to rank is a relatively new field of study, aiming to learn a ranking function from a set of training data with relevancy labels. The ranking algorithms are often evaluated using information retrieval measures, such as Normalized Discounted Cumulative Gain (NDCG) [1] and Mean Average Precision (MAP) [2]. Until recently, most learning to rank algorithms were not using a loss function related to the above mentioned evaluation measures. The main difficulty in direct optimization of these measures is that they depend on the ranks of documents, not the numerical values output by the ranking function. We propose a probabilistic framework that addresses this challenge by optimizing the expectation of NDCG over all the possible permutations of documents. A relaxation strategy is used to approximate the average of NDCG over the space of permutation, and a bound optimization approach is proposed to make the computation efficient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art ranking algorithms on several benchmark data sets. 1
4 0.54479545 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank
Author: Wei Chen, Tie-yan Liu, Yanyan Lan, Zhi-ming Ma, Hang Li
Abstract: Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking functions by minimizing loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking functions. In this work, we reveal the relationship between ranking measures and loss functions in learningto-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that the loss functions of these methods are upper bounds of the measurebased ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classification tasks, and define a so-called essential loss for ranking as the weighted sum of the classification errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modifications can lead to better ranking performances, demonstrating the correctness of our theoretical analysis. 1
5 0.52145594 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
Author: Shuang-hong Yang, Hongyuan Zha, Bao-gang Hu
Abstract: We propose Dirichlet-Bernoulli Alignment (DBA), a generative model for corpora in which each pattern (e.g., a document) contains a set of instances (e.g., paragraphs in the document) and belongs to multiple classes. By casting predefined classes as latent Dirichlet variables (i.e., instance level labels), and modeling the multi-label of each pattern as Bernoulli variables conditioned on the weighted empirical average of topic assignments, DBA automatically aligns the latent topics discovered from data to human-defined classes. DBA is useful for both pattern classification and instance disambiguation, which are tested on text classification and named entity disambiguation in web search queries respectively.
6 0.51590079 260 nips-2009-Zero-shot Learning with Semantic Output Codes
7 0.50894338 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
8 0.49649426 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions
9 0.49251002 96 nips-2009-Filtering Abstract Senses From Image Search Results
10 0.48389241 230 nips-2009-Statistical Consistency of Top-k Ranking
11 0.47600016 204 nips-2009-Replicated Softmax: an Undirected Topic Model
12 0.46982256 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning
13 0.46071401 233 nips-2009-Streaming Pointwise Mutual Information
14 0.41789845 24 nips-2009-Adapting to the Shifting Intent of Search Queries
15 0.41485026 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
16 0.36641788 104 nips-2009-Group Sparse Coding
17 0.36453736 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
18 0.36452544 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
19 0.36336854 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information
20 0.33573762 90 nips-2009-Factor Modeling for Advertisement Targeting
topicId topicWeight
[(24, 0.026), (25, 0.036), (35, 0.035), (36, 0.052), (39, 0.038), (58, 0.063), (71, 0.05), (81, 0.014), (86, 0.569), (91, 0.017)]
simIndex simValue paperId paperTitle
1 0.98426759 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
Author: Peter Sollich, Matthew Urry, Camille Coti
Abstract: We investigate how well Gaussian process regression can learn functions defined on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we find that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function defined on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive definite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then defined as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, defined as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one finds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be verified by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the figure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to find the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reflecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn fixes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we find the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior defined by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simplification but is appropriate for the present first exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is fixed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and finally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as specified by p, a and σ 2 . We fix d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difficult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by first expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a fixed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are defined appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the definition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects fluctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justified in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative fluctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small filled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reflecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefficient c with p, which implies that to reach a specified (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at fixed , while in the tail ∝ p, it continues to change. For finite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simplifies to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this significantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one finds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a finite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the figure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artificial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning first to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this fixes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 first terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to fine detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9
2 0.97879004 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification
Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders
Abstract: Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicted the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification.
same-paper 3 0.96407419 190 nips-2009-Polynomial Semantic Indexing
Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri
Abstract: We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. 1
4 0.96359122 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
Author: Xiao-ming Wu, Anthony M. So, Zhenguo Li, Shuo-yen R. Li
Abstract: Kernel learning is a powerful framework for nonlinear data modeling. Using the kernel trick, a number of problems have been formulated as semidefinite programs (SDPs). These include Maximum Variance Unfolding (MVU) (Weinberger et al., 2004) in nonlinear dimensionality reduction, and Pairwise Constraint Propagation (PCP) (Li et al., 2008) in constrained clustering. Although in theory SDPs can be efficiently solved, the high computational complexity incurred in numerically processing the huge linear matrix inequality constraints has rendered the SDP approach unscalable. In this paper, we show that a large class of kernel learning problems can be reformulated as semidefinite-quadratic-linear programs (SQLPs), which only contain a simple positive semidefinite constraint, a second-order cone constraint and a number of linear constraints. These constraints are much easier to process numerically, and the gain in speedup over previous approaches is at least of the order m2.5 , where m is the matrix dimension. Experimental results are also presented to show the superb computational efficiency of our approach.
5 0.94531661 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks
Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng
Abstract: In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. 1
6 0.93573898 176 nips-2009-On Invariance in Hierarchical Models
7 0.81519681 151 nips-2009-Measuring Invariances in Deep Networks
8 0.76604211 119 nips-2009-Kernel Methods for Deep Learning
9 0.73419058 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning
10 0.71174574 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
11 0.71006519 137 nips-2009-Learning transport operators for image manifolds
12 0.69951957 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters
13 0.69803935 196 nips-2009-Quantification and the language of thought
14 0.69337642 95 nips-2009-Fast subtree kernels on graphs
15 0.69247043 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection
16 0.68226928 104 nips-2009-Group Sparse Coding
17 0.6778003 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs
18 0.6702559 2 nips-2009-3D Object Recognition with Deep Belief Nets
19 0.66642714 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
20 0.65951067 96 nips-2009-Filtering Abstract Senses From Image Search Results