nips nips2007 nips2007-83 knowledge-graph by maker-knowledge-mining

83 nips-2007-Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks


Source: pdf

Author: Ben Carterette, Rosie Jones

Abstract: We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. [sent-5, score-0.853]

2 This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. [sent-6, score-1.322]

3 After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. [sent-7, score-1.633]

4 These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. [sent-8, score-0.451]

5 This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. [sent-9, score-0.797]

6 When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. [sent-10, score-1.765]

7 While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. [sent-11, score-0.474]

8 Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. [sent-12, score-0.192]

9 1 Introduction Web search engine evaluation is an expensive process: it requires relevance judgments that indicate the degree of relevance of each document retrieved for each query in a testing set. [sent-13, score-1.715]

10 In addition, reusing old relevance judgements to evaluate an updated ranking function can be problematic, since documents disappear or become obsolete, and the distribution of queries entered changes [15]. [sent-14, score-0.835]

11 Click data from web searchers, used in aggregate, can provide valuable evidence about the relevance of each document. [sent-15, score-0.567]

12 The general problem with using clicks as relevance judgments is that clicks are biased. [sent-16, score-1.605]

13 They are biased to the top of the ranking [12], to trusted sites, to attractive abstracts; they are also biased by the type of query and by other things shown on the results page. [sent-17, score-0.263]

14 To cope with this, we introduce a family of models relating clicks to relevance. [sent-18, score-0.413]

15 By conditioning on clicks, we can predict the relevance of a document or a set of documents. [sent-19, score-0.625]

16 [12] used eye-tracking devices to track what documents users looked at before clicking. [sent-21, score-0.234]

17 They found that users tend to look at results ranked higher than the one they click on more often than they look at results ranked lower, and this information can in principle be used to train a search engine using these “preference judgments”[10]. [sent-22, score-0.539]

18 The problem with using preference judgments inferred from clicks for learning is that they will tend to learn to reverse the list. [sent-23, score-0.839]

19 A click at the lowest rank is preferred to everything else, while a click at the highest rank is preferred to nothing ∗ Work done while author was at Yahoo! [sent-24, score-0.5]

20 Radlinski and Joachims [13] suggest an antidote to this: randomly swapping adjacent pairs of documents. [sent-26, score-0.024]

21 This ensures that users will not prefer document i to document i + 1 solely because of rank. [sent-27, score-0.315]

22 However, we may not wish to show a suboptimal document ordering in order acquire data. [sent-28, score-0.198]

23 Our approach instead will be to use discounted cumulative gain (DCG [9]), an evaluation metric commonly used in search engine evaluation. [sent-29, score-0.5]

24 Using click data, we can estimate the confidence that a difference in DCG exists between two rankings without having any relevance judgments for the documents ranked. [sent-30, score-1.134]

25 We will show how a comparison of ranking functions can be performed when clicks are available but complete relevance judgments are not. [sent-31, score-1.322]

26 After an initial training phase with a few relevance judgments, the relevance of unjudged documents can be predicted from clickthrough rates. [sent-32, score-1.195]

27 The confidence in the evaluation can be estimated with the knowledge of which documents are most frequently clicked. [sent-33, score-0.255]

28 Confidence can be dramatically increased with only a few more judiciously chosen relevance judgments. [sent-34, score-0.461]

29 Section 2 covers previous work on using clickthrough rates and on estimating evaluation metrics. [sent-36, score-0.195]

30 Section 3 describes the evaluation of web retrieval systems using the metric discounted cumulative gain (DCG) and shows how to estimate the confidence that a difference exists when relevance judgments are missing. [sent-37, score-1.227]

31 Our model for predicting relevance from clicks is described in Section 4. [sent-38, score-0.895]

32 We discuss our data in Section 5 and in Section 6 we return to the task of estimating relevance for the evaluation of search engines. [sent-39, score-0.636]

33 Our experiments are conducted in the context of sponsored search, but the methods we use are general enough to translate to general web search engines. [sent-40, score-0.27]

34 2 Previous Work There has been a great deal of work on low-cost evaluation in TREC-type settings ([20, 6, 16, 5] are a few), but we are aware of little for the web. [sent-41, score-0.078]

35 As discussed above, Joachims [10, 12] and Radlinski and Joachims [13] conducted seminal work on using clicks to infer user preferences between documents. [sent-42, score-0.521]

36 [2, 1] used and applied models of user interaction to predict preference relationships and to improve ranking functions. [sent-44, score-0.311]

37 They use many features beyond clickthrough rate, and show that they can learn preference relationships using these features. [sent-45, score-0.208]

38 Our work is superficially similar, but we explicitly model dependencies among clicks for results at different ranks with the purpose of learning probabilistic relevance judgments. [sent-46, score-0.893]

39 These relevance judgments are a stronger result than preference ordering, since preference ordering can be derived from them. [sent-47, score-1.009]

40 In addition, given a strong probabilistic model of relevance from clicks, better combined models can be built. [sent-48, score-0.461]

41 [7] give a theoretical model for the rank-position effects of click-through rate, and build theoretical models for search engine quality using them. [sent-50, score-0.199]

42 They do not evaluate estimates of document quality, while we empirically compare relevance estimated from clicks to manual relevance judgments. [sent-51, score-1.503]

43 Joachims [11] investigated the use of clickthrough rates for evaluation, showing that relative differences in performance could be measured by interleaving results from two ranking functions, then observing which function produced results that are more frequently clicked. [sent-52, score-0.353]

44 As we will show, interleaving results can change user behavior, and not necessarily in a way that will lead to the user clicking more relevant documents. [sent-53, score-0.147]

45 Soboroff [15] proposed methods for maintaining the relevance judgments in a corpus that is constantly changing. [sent-54, score-0.798]

46 [3] investigated minimum variance unbiased estimators of system performance, and Carterette et al. [sent-56, score-0.049]

47 [5] introduced the idea of treating an evaluation measure as a random variable with a distribution over all possible relevance judgments. [sent-57, score-0.539]

48 This can be used to create an optimal sampling strategy to obtain judgments, and to estimate the confidence in an evaluation measure. [sent-58, score-0.078]

49 DCG is defined as the sum of the “gain” of presenting a particular document times a “discount” of presenting it at a particular rank, up to some maximum rank : DCG = i=1 gaini discounti . [sent-61, score-0.253]

50 DCG = rel1 + i=2 reli log2 i The constants reli are the relevance scores. [sent-63, score-0.583]

51 Human assessors typically judge documents on an ordinal scale, with labels such as “Perfect”, “Excellent”, “Good”, “Fair”, and “Bad”. [sent-64, score-0.231]

52 These are then mapped to a numeric scale for use in DCG computation. [sent-65, score-0.019]

53 We will denote five levels of relevance aj , with a1 > a2 > a3 > a4 > a5 . [sent-66, score-0.523]

54 In this section we will show that we can compare ranking functions without having labeled all the documents. [sent-67, score-0.13]

55 1 Estimating DCG from Incomplete Information DCG requires that the ranked documents have been judged with respect to a query. [sent-69, score-0.238]

56 If the index has recently been updated, or a new algorithm is retrieving new results, we have documents that have not been judged. [sent-70, score-0.176]

57 Rather than ask a human assessor for a judgment, we may be able to infer something about DCG based on the judgments we already have. [sent-71, score-0.378]

58 Let Xi be a random variable representing the relevance of document i. [sent-72, score-0.598]

59 Since relevance is ordinal, the distribution of Xi is multinomial. [sent-73, score-0.461]

60 We will define pij = p(Xi = aj ) for 1 ≤ j ≤ 5 with 5 5 j=1 pij = 1. [sent-74, score-0.274]

61 The expectation of Xi is E[Xi ] = j=1 pij aj , and its variance is V ar[Xi ] = 5 2 2 j=1 pij aj − E[Xi ] . [sent-75, score-0.336]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dcg', 0.476), ('relevance', 0.461), ('clicks', 0.413), ('judgments', 0.318), ('click', 0.16), ('documents', 0.156), ('document', 0.137), ('ranking', 0.13), ('pij', 0.106), ('web', 0.106), ('engine', 0.102), ('clickthrough', 0.097), ('search', 0.097), ('joachims', 0.094), ('preference', 0.087), ('ar', 0.078), ('evaluation', 0.078), ('dence', 0.075), ('discounted', 0.074), ('gain', 0.065), ('aj', 0.062), ('carterette', 0.061), ('interleaving', 0.061), ('radlinski', 0.061), ('reli', 0.061), ('ranked', 0.059), ('engines', 0.054), ('xi', 0.054), ('formalization', 0.053), ('rank', 0.05), ('cumulative', 0.047), ('sponsored', 0.045), ('user', 0.043), ('amherst', 0.043), ('query', 0.041), ('retrieval', 0.041), ('users', 0.041), ('ordinal', 0.039), ('yahoo', 0.039), ('rankings', 0.039), ('con', 0.039), ('ordering', 0.038), ('metric', 0.037), ('judge', 0.036), ('putting', 0.036), ('cov', 0.034), ('presenting', 0.033), ('discount', 0.031), ('manual', 0.031), ('preferred', 0.03), ('score', 0.03), ('predict', 0.027), ('obsolete', 0.027), ('reciprocal', 0.027), ('biased', 0.026), ('et', 0.025), ('investigated', 0.024), ('judgment', 0.024), ('judgement', 0.024), ('abstracts', 0.024), ('judgements', 0.024), ('swapping', 0.024), ('ave', 0.024), ('relationships', 0.024), ('human', 0.023), ('seminal', 0.023), ('acquire', 0.023), ('entered', 0.023), ('trusted', 0.023), ('backbone', 0.023), ('judged', 0.023), ('conducted', 0.022), ('disappear', 0.021), ('nancial', 0.021), ('frequently', 0.021), ('evaluating', 0.021), ('tend', 0.021), ('predicting', 0.021), ('retrieving', 0.02), ('devices', 0.02), ('everything', 0.02), ('retrieved', 0.02), ('ben', 0.02), ('phase', 0.02), ('rates', 0.02), ('infer', 0.02), ('updated', 0.02), ('numeric', 0.019), ('constantly', 0.019), ('ranks', 0.019), ('leverages', 0.019), ('jones', 0.019), ('lists', 0.018), ('sites', 0.018), ('contrasts', 0.018), ('derived', 0.018), ('looked', 0.017), ('things', 0.017), ('cially', 0.017), ('something', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 83 nips-2007-Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

Author: Ben Carterette, Rosie Jones

Abstract: We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. 1

2 0.31244475 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

Author: Ping Li, Qiang Wu, Christopher J. Burges

Abstract: We cast the ranking problem as (1) multiple classification (“Mc”) (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classifications result in perfect DCG scores and the DCG errors are bounded by classification errors. We propose using the Expected Relevance to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve LambdaRank [5] and the regressions-based ranker [6], in terms of the (normalized) DCG scores. An efficient implementation of the boosting tree algorithm is also presented.

3 0.23181422 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

Author: Zhaohui Zheng, Hongyuan Zha, Tong Zhang, Olivier Chapelle, Keke Chen, Gordon Sun

Abstract: We present a general boosting method extending functional gradient boosting to optimize complex loss functions that are encountered in many machine learning problems. Our approach is based on optimization of quadratic upper bounds of the loss functions which allows us to present a rigorous convergence analysis of the algorithm. More importantly, this general framework enables us to use a standard regression base learner such as single regression tree for £tting any loss function. We illustrate an application of the proposed method in learning ranking functions for Web search by combining both preference data and labeled data for training. We present experimental results for Web search using data from a commercial search engine that show signi£cant improvements of our proposed methods over some existing methods. 1

4 0.12579961 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

Author: Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, Alex J. Smola

Abstract: In this paper, we consider collaborative filtering as a ranking problem. We present a method which uses Maximum Margin Matrix Factorization and optimizes ranking instead of rating. We employ structured output prediction to optimize directly for ranking scores. Experimental results show that our method gives very good ranking scores and scales well on collaborative filtering tasks. 1

5 0.075346626 142 nips-2007-Non-parametric Modeling of Partially Ranked Data

Author: Guy Lebanon, Yi Mao

Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. In particular, we demonstrate for the first time a non-parametric coherent and consistent model capable of efficiently aggregating partially ranked data of different types. 1

6 0.065667525 19 nips-2007-Active Preference Learning with Discrete Choice Data

7 0.056361835 183 nips-2007-Spatial Latent Dirichlet Allocation

8 0.054876473 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

9 0.053352822 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

10 0.049506031 189 nips-2007-Supervised Topic Models

11 0.047215044 129 nips-2007-Mining Internet-Scale Software Repositories

12 0.036617786 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index

13 0.036330707 39 nips-2007-Boosting the Area under the ROC Curve

14 0.032891404 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

15 0.031835631 8 nips-2007-A New View of Automatic Relevance Determination

16 0.031616267 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

17 0.02837869 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

18 0.027173225 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

19 0.026336242 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs

20 0.025779536 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.105), (1, 0.039), (2, -0.072), (3, -0.01), (4, 0.056), (5, 0.066), (6, 0.166), (7, -0.136), (8, 0.131), (9, -0.091), (10, -0.121), (11, 0.031), (12, 0.398), (13, 0.173), (14, 0.007), (15, 0.012), (16, -0.086), (17, 0.023), (18, -0.061), (19, 0.045), (20, -0.019), (21, -0.149), (22, 0.045), (23, -0.055), (24, -0.074), (25, -0.022), (26, 0.06), (27, -0.017), (28, 0.081), (29, 0.032), (30, -0.012), (31, -0.02), (32, 0.014), (33, -0.03), (34, 0.092), (35, -0.014), (36, -0.111), (37, -0.082), (38, -0.019), (39, -0.003), (40, 0.056), (41, 0.065), (42, -0.076), (43, -0.127), (44, 0.08), (45, 0.069), (46, 0.001), (47, 0.044), (48, 0.033), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98202556 83 nips-2007-Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

Author: Ben Carterette, Rosie Jones

Abstract: We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. 1

2 0.76559681 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

Author: Zhaohui Zheng, Hongyuan Zha, Tong Zhang, Olivier Chapelle, Keke Chen, Gordon Sun

Abstract: We present a general boosting method extending functional gradient boosting to optimize complex loss functions that are encountered in many machine learning problems. Our approach is based on optimization of quadratic upper bounds of the loss functions which allows us to present a rigorous convergence analysis of the algorithm. More importantly, this general framework enables us to use a standard regression base learner such as single regression tree for £tting any loss function. We illustrate an application of the proposed method in learning ranking functions for Web search by combining both preference data and labeled data for training. We present experimental results for Web search using data from a commercial search engine that show signi£cant improvements of our proposed methods over some existing methods. 1

3 0.75913477 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

Author: Ping Li, Qiang Wu, Christopher J. Burges

Abstract: We cast the ranking problem as (1) multiple classification (“Mc”) (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classifications result in perfect DCG scores and the DCG errors are bounded by classification errors. We propose using the Expected Relevance to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve LambdaRank [5] and the regressions-based ranker [6], in terms of the (normalized) DCG scores. An efficient implementation of the boosting tree algorithm is also presented.

4 0.60731411 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

Author: Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, Alex J. Smola

Abstract: In this paper, we consider collaborative filtering as a ranking problem. We present a method which uses Maximum Margin Matrix Factorization and optimizes ranking instead of rating. We employ structured output prediction to optimize directly for ranking scores. Experimental results show that our method gives very good ranking scores and scales well on collaborative filtering tasks. 1

5 0.37735367 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index

Author: Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, Vikas C. Raykar

Abstract: In this paper, we show that classical survival analysis involving censored data can naturally be cast as a ranking problem. The concordance index (CI), which quantifies the quality of rankings, is the standard performance measure for model assessment in survival analysis. In contrast, the standard approach to learning the popular proportional hazard (PH) model is based on Cox’s partial likelihood. We devise two bounds on CI–one of which emerges directly from the properties of PH models–and optimize them directly. Our experimental results suggest that all three methods perform about equally well, with our new approach giving slightly better results. We also explain why a method designed to maximize the Cox’s partial likelihood also ends up (approximately) maximizing the CI. 1

6 0.3605026 19 nips-2007-Active Preference Learning with Discrete Choice Data

7 0.33060861 142 nips-2007-Non-parametric Modeling of Partially Ranked Data

8 0.23785666 39 nips-2007-Boosting the Area under the ROC Curve

9 0.20641965 85 nips-2007-Experience-Guided Search: A Theory of Attentional Control

10 0.19887431 129 nips-2007-Mining Internet-Scale Software Repositories

11 0.18323378 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

12 0.16053885 183 nips-2007-Spatial Latent Dirichlet Allocation

13 0.1581116 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events

14 0.15186572 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

15 0.14961867 16 nips-2007-A learning framework for nearest neighbor search

16 0.13827422 62 nips-2007-Convex Learning with Invariances

17 0.13560452 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

18 0.13557065 189 nips-2007-Supervised Topic Models

19 0.13141774 27 nips-2007-Anytime Induction of Cost-sensitive Trees

20 0.13064159 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.039), (13, 0.048), (16, 0.033), (21, 0.097), (31, 0.011), (34, 0.062), (35, 0.022), (36, 0.011), (41, 0.306), (47, 0.073), (49, 0.012), (83, 0.071), (85, 0.01), (87, 0.054), (90, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76840532 83 nips-2007-Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

Author: Ben Carterette, Rosie Jones

Abstract: We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. 1

2 0.64308608 58 nips-2007-Consistent Minimization of Clustering Objective Functions

Author: Ulrike V. Luxburg, Stefanie Jegelka, Michael Kaufmann, Sébastien Bubeck

Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of “nearest neighbor clustering”. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound. 1

3 0.44510719 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

Author: Ping Li, Qiang Wu, Christopher J. Burges

Abstract: We cast the ranking problem as (1) multiple classification (“Mc”) (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classifications result in perfect DCG scores and the DCG errors are bounded by classification errors. We propose using the Expected Relevance to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve LambdaRank [5] and the regressions-based ranker [6], in terms of the (normalized) DCG scores. An efficient implementation of the boosting tree algorithm is also presented.

4 0.4401311 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

Author: Bill Triggs, Jakob J. Verbeek

Abstract: Conditional Random Fields (CRFs) are an effective tool for a variety of different data segmentation and labeling tasks including visual scene interpretation, which seeks to partition images into their constituent semantic-level regions and assign appropriate class labels to each region. For accurate labeling it is important to capture the global context of the image as well as local information. We introduce a CRF based scene labeling model that incorporates both local features and features aggregated over the whole image or large sections of it. Secondly, traditional CRF learning requires fully labeled datasets which can be costly and troublesome to produce. We introduce a method for learning CRFs from datasets with many unlabeled nodes by marginalizing out the unknown labels so that the log-likelihood of the known ones can be maximized by gradient ascent. Loopy Belief Propagation is used to approximate the marginals needed for the gradient and log-likelihood calculations and the Bethe free-energy approximation to the log-likelihood is monitored to control the step size. Our experimental results show that effective models can be learned from fragmentary labelings and that incorporating top-down aggregate features significantly improves the segmentations. The resulting segmentations are compared to the state-of-the-art on three different image datasets. 1

5 0.44001466 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

Author: Omer Bobrowski, Ron Meir, Shy Shoham, Yonina Eldar

Abstract: It is becoming increasingly evident that organisms acting in uncertain dynamical environments often employ exact or approximate Bayesian statistical calculations in order to continuously estimate the environmental state, integrate information from multiple sensory modalities, form predictions and choose actions. What is less clear is how these putative computations are implemented by cortical neural networks. An additional level of complexity is introduced because these networks observe the world through spike trains received from primary sensory afferents, rather than directly. A recent line of research has described mechanisms by which such computations can be implemented using a network of neurons whose activity directly represents a probability distribution across the possible “world states”. Much of this work, however, uses various approximations, which severely restrict the domain of applicability of these implementations. Here we make use of rigorous mathematical results from the theory of continuous time point process filtering, and show how optimal real-time state estimation and prediction may be implemented in a general setting using linear neural networks. We demonstrate the applicability of the approach with several examples, and relate the required network properties to the statistical nature of the environment, thereby quantifying the compatibility of a given network with its environment. 1

6 0.43994769 59 nips-2007-Continuous Time Particle Filtering for fMRI

7 0.43568036 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

8 0.43434206 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

9 0.43295014 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

10 0.43270853 56 nips-2007-Configuration Estimates Improve Pedestrian Finding

11 0.43161291 69 nips-2007-Discriminative Batch Mode Active Learning

12 0.43135118 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

13 0.43044996 164 nips-2007-Receptive Fields without Spike-Triggering

14 0.43036324 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

15 0.42909941 189 nips-2007-Supervised Topic Models

16 0.42742908 86 nips-2007-Exponential Family Predictive Representations of State

17 0.42640468 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

18 0.42594564 19 nips-2007-Active Preference Learning with Discrete Choice Data

19 0.42499277 100 nips-2007-Hippocampal Contributions to Control: The Third Way

20 0.42433947 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons