nips nips2010 nips2010-277 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. [sent-10, score-1.165]
2 Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. [sent-19, score-0.439]
3 However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. [sent-20, score-1.433]
4 The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. [sent-22, score-0.67]
5 Each document is represented by a set of features, measuring the matching between document and query. [sent-26, score-0.446]
6 Widely-used features include the frequency of query terms in the document and the query likelihood given by the language model of the document. [sent-27, score-0.871]
7 A ranking function, which combines the features to predict the relevance of a document to the query, is learned by minimizing a loss function defined on the training data. [sent-28, score-0.673]
8 Then for a new query, the ranking function is used to rank its associated documents according to their predicted relevance. [sent-29, score-0.924]
9 Many learning to rank algorithms have been proposed, among which the pairwise ranking algorithms such as Ranking SVMs [12, 13], RankBoost [11], and RankNet [5] have been widely applied. [sent-30, score-0.605]
10 To understand existing ranking algorithms, and to guide the development of new ones, people have studied the learning theory for ranking, in particular, the generalization ability of ranking methods. [sent-31, score-0.976]
11 Generalization ability is usually represented by a bound of the deviation between the expected and empirical risks for an arbitrary ranking function in the hypothesis space. [sent-32, score-0.493]
12 , the generalization bounds of RankBoost [11], stable pairwise ranking algorithms like Ranking ∗ The work was performed when the first author was an intern at Microsoft Research Asia. [sent-37, score-0.807]
13 We call these generalization bounds “document-level generalization bounds”, which converge to zero when the number of documents in the training set approaches infinity. [sent-39, score-0.865]
14 , the generalization bounds of stable pairwise ranking algorithms like Ranking SVMs and IR-SVM [6] and listwise algorithms were obtained in [15] and [14]. [sent-43, score-0.923]
15 When analyzing the query-level generalization bounds, the documents associated with each query are usually regarded as a deterministic set [10, 14], and no random sampling of documents is assumed. [sent-45, score-1.411]
16 As a result, query-level generalization bounds converge to zero only when the number of queries approaches infinity, no matter how many documents are associated with them. [sent-46, score-0.902]
17 While the existing generalization bounds can explain the behaviors of some ranking algorithms, they also have their limitations. [sent-47, score-0.714]
18 makes the document-level generalization bounds not directly applicable to ranking in IR. [sent-51, score-0.652]
19 This is because it has been widely accepted that the documents associated with different queries do not follow the same distribution [17] and the documents with the same query are no longer independent after represented by documentquery matching features. [sent-52, score-1.387]
20 (2) It is not reasonable for query-level generalization bounds to assume that one can obtain the document set associated with each query in a deterministic manner. [sent-53, score-0.843]
21 For example, in the labeling process of TREC, the ranking results submitted by all TREC participants were put together and then a proportion of them were selected and presented to human annotators for labeling. [sent-55, score-0.48]
22 In this process, the number of participants, the ranking result given by each participant, the overlap between different ranking results, the labeling budget, and the selection methodology can all influence which documents and how many documents are labeled for each query. [sent-56, score-1.581]
23 As a result, it is more reasonable to assume a random sampling process for the generation of labeled documents per query. [sent-57, score-0.563]
24 sampled from the query space according to a fixed but unknown probability distribution. [sent-62, score-0.39]
25 In the second layer, for each query, documents are i. [sent-63, score-0.389]
26 sampled from the document space according to a fixed but unknown conditional probability distribution determined by the query (i. [sent-66, score-0.601]
27 , documents associated with different queries do not have the identical distribution). [sent-68, score-0.62]
28 Note that the feature representations of the documents with the same query, as random variables, are not independent any longer. [sent-70, score-0.413]
29 But they are conditionally independent if the query is given. [sent-71, score-0.354]
30 Based on the framework, we have performed two-layer generalization analysis for pairwise ranking algorithms. [sent-73, score-0.698]
31 However, the task is non-trivial mainly because the two-layer sampling does not correspond to a typical empirical process: the documents for different queries are not identically distributed while the documents for the same query are not independent. [sent-74, score-1.436]
32 To tackle the challenge, we carefully decompose the expected risk according to the query and document layers, and employ a new concept called two-layer Rademacher average. [sent-76, score-0.717]
33 The new concept accurately describes the complexity in the two-layer setting, and its reduced versions can be used to derive meaningful bounds for query layer error and document layer error respectively. [sent-77, score-0.784]
34 1 Related Work Pairwise Learning to Rank Pairwise ranking is one of the major approaches to learning to rank, and has been widely adopted in real applications [5, 11, 12, 13]. [sent-81, score-0.391]
35 The process of pairwise ranking can be described as follows. [sent-82, score-0.55]
36 Each query qi is associated with mi i i i documents {di , · · · , di i } and their judgments {y1 , · · · , ymi }, where yj ∈ Y. [sent-84, score-1.52]
37 Each document di is m 1 j represented by a set of features xi = ψ(di , qi ) ∈ X , measuring the matching between document di j j j and query qi . [sent-85, score-1.504]
38 Widely-used features include the frequency of query terms in the document and the query likelihood given by the language model of the document. [sent-86, score-0.871]
39 Then the training set can be denoted as S = {S1 , · · · , Sn } where Si {zj ∈ Z}j=1,··· ,mi is the document sample for query qi . [sent-88, score-0.889]
40 2 Document-level Generalization Analysis In document-level generalization analysis, it is assumed that the documents are i. [sent-93, score-0.56]
41 Then the expected risk of pairwise ranking algorithms can be defined as below, l RD (f ) = l(z, z ; f )dP 2 (z, z ), Z2 where P 2 (z, z ) is the product probability of P (z) on the product space Z 2 . [sent-97, score-0.628]
42 The document-level generalization bound usually takes the following form: with probability at least 1 − δ, l RD (f ) ≤ 1 m(m − 1) l(zj , zk ; f ) + (δ, F , m), ∀f ∈ F , j=k where (δ, F, m) → 0 when document number m → ∞. [sent-98, score-0.467]
43 As representative work, the generalization bounds for the pairwise 0-1 loss were derived in [1, 9] and the generalization bounds for RankBoost and Ranking SVMs were obtained in [2] and [11]. [sent-99, score-0.684]
44 makes the document-level generalization bounds not directly applicable to ranking in IR. [sent-103, score-0.652]
45 Even if the assumption holds, the documentlevel generalization bounds still cannot be used to explain existing pairwise ranking algorithms. [sent-104, score-0.825]
46 Actually, according to the document-level generalization bound, what we can obtain is: with probal(z i ,z i ;f ) j l bility at least 1 − δ, RD (f ) ≤ (i,j)=(i (,k)mi −1)k + (δ, F , mi ), ∀f ∈ F . [sent-105, score-0.519]
47 The empirical risk is the mi average of the pairwise losses on all the document pairs. [sent-106, score-0.755]
48 This is clearly not the real empirical risk of ranking in IR, where documents associated with different queries cannot be compared with each other, and pairs are constructed only by documents associated with the same query. [sent-107, score-1.505]
49 3 Query-level Generalization Analysis In existing query-level generalization analysis [14], it is assumed that each query qi , represented by a deterministic document set Si with the same number of documents (i. [sent-109, score-1.4]
50 j=k The query-level generalization bound usually takes the following form: with probability at least 1 − δ, l RQ (f ) ≤ 1 n n i=1 1 m(m − 1) i i l(zj , zk ; f ) + (δ, F , n), ∀f ∈ F , j=k where (δ, F, n) → 0 as query number n → ∞. [sent-116, score-0.586]
51 3 As representative work, the query-level generalization bounds for stable pairwise ranking algorithms such as Ranking SVMs and IR-SVM and listwise ranking algorithms were derived in [15] 1 and [14]. [sent-117, score-1.314]
52 As mentioned in the introduction, the assumption that each query is associated with a deterministic set of documents is not reasonable. [sent-118, score-0.76]
53 The fact is that many random factors can influence what kinds of documents and how many documents are labeled for each query. [sent-119, score-0.778]
54 For example, when more labeled documents are added to the training set, the generalization bounds of stable pairwise ranking algorithms derived in [15] do not change and the generalization bounds of some of the listwise ranking algorithms derived in [14] get even looser. [sent-121, score-1.989]
55 First, queries are randomly sampled from query logs of search engines. [sent-127, score-0.555]
56 Then for each query, documents that are potentially relevant to the query are sampled (e. [sent-128, score-0.754]
57 sampled from the query space Q according to distribution P (q). [sent-136, score-0.39]
58 Second, for each query qi , its associated documents and their relevant judgments i i {(di , y1 ), · · · , (di i , ymi )} are i. [sent-137, score-1.132]
59 sampled from the document space D according to a conditional m 1 distribution P (d|qi ) where mi is the number of sampled documents. [sent-140, score-0.629]
60 1, we use zj = (xi , yj ) to represent document di and its label, and j j i denote the training data for query qi as Si = {zj }j=1,··· ,mi . [sent-145, score-1.036]
61 samples, random variables {zj }j=1,··· ,mi are no longer independent because they share the same query qi . [sent-149, score-0.653]
62 We call the above data generation process two-layer sampling, and denote the training data generated in this way as (Q, S), where Q is the query sample and S = {Si }i=1,··· ,n is the document sample. [sent-151, score-0.669]
63 As a result, the generalization bound they obtained is a query-level generalization bound, but not a two-layer generalization bound. [sent-224, score-0.553]
64 (ii) As compared to the sampling in query-level generalization analysis, two-layer sampling considers the sampling of documents for each query. [sent-253, score-0.833]
65 To some extent, the aforementioned two-layer sampling has relationship with directly sampling from the product space of query and document, and the (n, m)-sampling proposed in [4]. [sent-254, score-0.529]
66 Firstly, it is clear that directly sampling from the product space of query and document does not describe the real data generation process. [sent-256, score-0.686]
67 Furthermore, even if we sample a large number of documents in this way, it is not guaranteed that we can have sufficient number of documents for each single query. [sent-257, score-0.802]
68 , however, in two-layer sampling documents (if represented by matching features) associated with the same query are not independent of each other. [sent-263, score-0.899]
69 2 Two-Layer Generalization Ability With the probabilistic assumption of two-layer sampling, we define the expected risk for pairwise ranking as follows, Rl (f ) = l(z, z ; f )dP (z, z |q)dP (q), Q (2) Z2 where P (z, z |q) is the product probability of P (z|q) on the product space Z 2 . [sent-265, score-0.628]
70 In the next section, we will show our theoretical results on the two-layer generalization abilities of typical pairwise ranking algorithms. [sent-268, score-0.698]
71 4 Main Theoretical Result In this section, we show our results on the two-layer generalization ability of ERM learning with pairwise ranking losses (either pairwise 0-1 loss or pairwise surrogate losses). [sent-269, score-1.064]
72 With the above definitions, we have the following theorem, which describes when and how the two-layer generalization bounds of pairwise ranking algorithms converge to zero. [sent-279, score-0.788]
73 mi n 2 Remark: The condition of the existence of upper bounds for E [Rm (l ◦ F)] can be satisfied in ˜ many situations. [sent-283, score-0.413]
74 For example, for ranking function class F that satisfies V C(F ) = V , where ˜ = {f (x, x ) = f (x) − f (x ); f ∈ F}, V C(·) denotes the VC dimension, and |f (x)| ≤ B, it F has been proved that D(l0−1 ◦ F, m) = c1 V /m and D(lφ ◦ F, m) = c2 Bφ (B) V /m in [3, 9], where c1 and c2 are both constants. [sent-284, score-0.391]
75 1 Proof of Theorem 1 Note that the proof of Theorem 1 is non-trivial because documents generated by two-layer sampling are neither independent nor identically distributed, as aforementioned. [sent-286, score-0.533]
76 For two-layer sample (Q, S), the two-layer RA of l ◦ F is defined as follows, 2 Rn;m1 ,··· ,mn (l ◦ F (Q, S)) = Eσ sup f ∈F n n i=1 1 mi /2 mi /2 i i σj l(zj , z i mi /2 +j ; f ) j=1 , i where {σj } are independent Rademacher random variables independent of data sample. [sent-292, score-1.127]
77 By introducing a ghost query sample Q = {q1 , · · · , qn }, we have ˜ ˜ Proof. [sent-323, score-0.433]
78 2 Document-layer Error Bound In order to obtain the bound for document-layer error, we consider the fact that documents are independent if the query sample is given. [sent-328, score-0.807]
79 2 Given query sample Q, all the documents in the document sample will become independent. [sent-334, score-0.978]
80 Denote S as the document sample i obtained by replacing document zj0 in S with a new document zj0 . [sent-335, score-0.657]
81 It is clear that ˜i0 0 ˆ ˆ sup G(S) − G(S ) ≤ sup sup Rn;m1 ,··· ,mn (f ; S) − Rn;m1 ,··· ,mn (f ; S ) S,S f ∈F S,S ≤ sup sup i i (l(zj0 , zk0 ; f ) − l(˜j0 , zk0 ; f ) z i0 i 0 k=j0 ≤ nmi0 (mi0 − 1) S,S f ∈F 2M . [sent-336, score-0.43]
82 Assume Smi is the symmetric group of degree mi and πi ∈ Smi (i = 1, · · · , n) which permutes the mi documents associates with qi . [sent-345, score-1.334]
83 Since documents associated with the same query follow the identical distribution, we have, 1 n n i=1 1 mi (mi − 1) p i i l(zj , zk ; f ) = j=k 1 n n 1 mi ! [sent-346, score-1.451]
84 i=1 πi mi /2 1 mi /2 i i l(zπi (j) , zπi ( mi /2 +j) ; f ), (6) j=1 p ˜ where = means identity in distribution. [sent-347, score-0.969]
85 Define a function G(Si ) on each Si as follows: ˜ G(Si ) = sup f ∈F mi /2 1 mi /2 i l(zj , z i mi 2 j=1 +j ; f ) − Ez,z |qi l(z, z ; f ) . [sent-348, score-1.055]
86 (6), we can decompose ES|Q [G(S)] into the sum of ESi |qi [G(Si )] as below: ES|Q [G(S)] ≤ − 1 mi 2 1 n n i=1 1 mi ! [sent-351, score-0.674]
87 ESi |qi sup l(z, z ; f )dP (z, z |qi ) Z2 f ∈F πi mi /2 i i l(Zπi (j) , Zπi ( j=1 mi 2 +j) ; f ) = n 1 n ˜ ESi |qi G(Si ) . [sent-352, score-0.732]
88 Assume zi i i ˜ σ1 , · · · , σ mi /2 are independent Rademacher random variables, independent of Si and Si . [sent-355, score-0.502]
89 Then, ˜ ESi |qi G(Si ) ≤ ESi ,Si |qi sup ˜ = ESi ,σi |qi sup f ∈F 2 mi /2 f ∈F 1 mi /2 mi /2 i l(zj , z i mi 2 j=1 mi /2 i i σj l(zj , z i mi /2 +j ; f ) j=1 +j ; f ) = ES i |qi − l(˜j , z i mi zi ˜ 2 +j ; f ) [R1;mi (l ◦ F (qi , Si ))] . [sent-356, score-2.564]
90 3 Combining the Bounds Considering Theorem 3 and taking expectation on query sample Q, we can obtain that with probability at least 1 − δ, 1 ˆl ˆl Rn (f ) − Rn;m1 ,··· ,mn (f ) ≤ n n n ESi |qi [R1;mi (l ◦ F (qi , Si ))] + i=1 i=1 2M 2 log 2 δ , ∀f ∈ F . [sent-361, score-0.354]
91 mi n 2 Furthermore, if conventional RA has an upper bound D(l ◦ F, ·), for arbitrary sample distribution, document-layer reduced two-layer RA can be upper bounded by D(l◦F, n) and query-layer reduced two-layer RA can be bounded by D(l ◦ F, mi /2 ). [sent-362, score-0.842]
92 (1) The increasing number of either queries or documents per query in the training data will enhance the two-layer generalization ability. [sent-366, score-1.148]
93 (2) Only if n → ∞ and mi → ∞ simultaneously does the two-layer generalization bound uniformly converge. [sent-368, score-0.534]
94 (3) If we only have a limited budget to label C documents in total, according to Theorem 1, there is an optimal trade off between the number of training queries and that of training documents per query. [sent-370, score-1.118]
95 Actually one can attain the optimal trade off by solving the following optimization problem: n min n,m1 ,··· ,mn D(l ◦ F , n) + i=1 2M 2 log mi n 2 2 δ n + D(l ◦ F , mi /2 ) i=1 n s. [sent-372, score-0.675]
96 For example, if ranking function class F satisfies V C(F) = √ √ √ c V + 2 log(4/δ) C √ C, m∗ ≡ n∗ where c1 is a constant. [sent-375, score-0.391]
97 That is, we should label fewer queries and more documents per query when the hypothesis space is larger. [sent-378, score-0.932]
98 That is, we should label more query if we want the bound to hold with a larger probability. [sent-380, score-0.37]
99 The above findings can be used to explain the behavior of existing pairwise ranking algorithms, and can be used to guide the construction of training set for learning to rank. [sent-381, score-0.589]
100 5 Conclusions and Discussions In this paper, we have proposed conducting two-layer generalization analysis for ranking, and proved a two-layer generalization bound for ERM learning with pairwise losses. [sent-382, score-0.518]
wordName wordTfidf (topN-words)
[('ranking', 0.391), ('documents', 0.389), ('query', 0.33), ('mi', 0.323), ('qi', 0.299), ('document', 0.211), ('queries', 0.19), ('generalization', 0.171), ('lf', 0.142), ('pairwise', 0.136), ('esi', 0.131), ('zi', 0.131), ('ra', 0.128), ('listwise', 0.116), ('zj', 0.106), ('si', 0.101), ('sampling', 0.091), ('bounds', 0.09), ('sup', 0.086), ('rn', 0.084), ('rank', 0.078), ('ir', 0.076), ('rademacher', 0.074), ('ezi', 0.066), ('rankboost', 0.066), ('di', 0.065), ('dp', 0.061), ('zm', 0.061), ('trec', 0.058), ('rl', 0.055), ('qn', 0.052), ('risk', 0.046), ('zk', 0.045), ('judgments', 0.044), ('sigir', 0.044), ('associated', 0.041), ('supf', 0.04), ('bound', 0.04), ('explain', 0.037), ('generation', 0.037), ('svms', 0.035), ('sampled', 0.035), ('erm', 0.034), ('retrieval', 0.034), ('concept', 0.033), ('ranknet', 0.033), ('liu', 0.032), ('identically', 0.029), ('qin', 0.029), ('ymi', 0.029), ('smi', 0.029), ('mcdiarmid', 0.029), ('trade', 0.029), ('reduced', 0.028), ('decompose', 0.028), ('bounded', 0.028), ('layer', 0.027), ('shallow', 0.027), ('graepel', 0.027), ('ghost', 0.027), ('loss', 0.026), ('ndings', 0.026), ('behaviors', 0.025), ('according', 0.025), ('training', 0.025), ('annotators', 0.025), ('cao', 0.025), ('matching', 0.024), ('surrogate', 0.024), ('independent', 0.024), ('lan', 0.024), ('herbrich', 0.024), ('sample', 0.024), ('theorem', 0.023), ('ability', 0.023), ('per', 0.023), ('rq', 0.023), ('budget', 0.023), ('tackle', 0.023), ('process', 0.023), ('matter', 0.021), ('labeling', 0.021), ('losses', 0.021), ('expected', 0.021), ('participants', 0.02), ('discussions', 0.02), ('conventional', 0.02), ('enhance', 0.02), ('relevance', 0.02), ('stable', 0.019), ('error', 0.019), ('call', 0.019), ('prove', 0.019), ('empirical', 0.018), ('chinese', 0.018), ('mn', 0.018), ('ii', 0.018), ('proceedings', 0.017), ('agarwal', 0.017), ('product', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
2 0.1594799 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
Author: Gilbert Leung, Novi Quadrianto, Kostas Tsioutsiouliklis, Alex J. Smola
Abstract: We present a fast online solver for large scale parametric max-flow problems as they occur in portfolio optimization, inventory management, computer vision, and logistics. Our algorithm solves an integer linear program in an online fashion. It exploits total unimodularity of the constraint matrix and a Lagrangian relaxation to solve the problem as a convex online game. The algorithm generates approximate solutions of max-flow problems by performing stochastic gradient descent on a set of flows. We apply the algorithm to optimize tier arrangement of over 84 million web pages on a layered set of caches to serve an incoming query stream optimally. 1
3 0.15646471 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
Author: Eric Wang, Dehong Liu, Jorge Silva, Lawrence Carin, David B. Dunson
Abstract: We consider problems for which one has incomplete binary matrices that evolve with time (e.g., the votes of legislators on particular legislation, with each year characterized by a different such matrix). An objective of such analysis is to infer structure and inter-relationships underlying the matrices, here defined by latent features associated with each axis of the matrix. In addition, it is assumed that documents are available for the entities associated with at least one of the matrix axes. By jointly analyzing the matrices and documents, one may be used to inform the other within the analysis, and the model offers the opportunity to predict matrix values (e.g., votes) based only on an associated document (e.g., legislation). The research presented here merges two areas of machine-learning that have previously been investigated separately: incomplete-matrix analysis and topic modeling. The analysis is performed from a Bayesian perspective, with efficient inference constituted via Gibbs sampling. The framework is demonstrated by considering all voting data and available documents (legislation) during the 220-year lifetime of the United States Senate and House of Representatives. 1
4 0.14964001 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
Author: Matthew Blaschko, Andrea Vedaldi, Andrew Zisserman
Abstract: A standard approach to learning object category detectors is to provide strong supervision in the form of a region of interest (ROI) specifying each instance of the object in the training images [17]. In this work are goal is to learn from heterogeneous labels, in which some images are only weakly supervised, specifying only the presence or absence of the object or a weak indication of object location, whilst others are fully annotated. To this end we develop a discriminative learning approach and make two contributions: (i) we propose a structured output formulation for weakly annotated images where full annotations are treated as latent variables; and (ii) we propose to optimize a ranking objective function, allowing our method to more effectively use negatively labeled images to improve detection average precision performance. The method is demonstrated on the benchmark INRIA pedestrian detection dataset of Dalal and Triggs [14] and the PASCAL VOC dataset [17], and it is shown that for a significant proportion of weakly supervised images the performance achieved is very similar to the fully supervised (state of the art) results. 1
5 0.1399457 150 nips-2010-Learning concept graphs from text with stick-breaking priors
Author: America Chambers, Padhraic Smyth, Mark Steyvers
Abstract: We present a generative probabilistic model for learning general graph structures, which we term concept graphs, from text. Concept graphs provide a visual summary of the thematic content of a collection of documents—a task that is difficult to accomplish using only keyword search. The proposed model can learn different types of concept graph structures and is capable of utilizing partial prior knowledge about graph structure as well as labeled documents. We describe a generative model that is based on a stick-breaking process for graphs, and a Markov Chain Monte Carlo inference procedure. Experiments on simulated data show that the model can recover known graph structure when learning in both unsupervised and semi-supervised modes. We also show that the proposed model is competitive in terms of empirical log likelihood with existing structure-based topic models (hPAM and hLDA) on real-world text data sets. Finally, we illustrate the application of the model to the problem of updating Wikipedia category graphs. 1
6 0.12789418 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
7 0.12234057 27 nips-2010-Agnostic Active Learning Without Constraints
8 0.11559714 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
9 0.10943149 243 nips-2010-Smoothness, Low Noise and Fast Rates
10 0.10814212 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
11 0.10685648 286 nips-2010-Word Features for Latent Dirichlet Allocation
12 0.099306293 151 nips-2010-Learning from Candidate Labeling Sets
13 0.095816553 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
14 0.09559685 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
15 0.094358593 194 nips-2010-Online Learning for Latent Dirichlet Allocation
16 0.091972359 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
17 0.091280438 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
18 0.087377034 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
19 0.079158202 215 nips-2010-Probabilistic Deterministic Infinite Automata
20 0.077542946 100 nips-2010-Gaussian Process Preference Elicitation
topicId topicWeight
[(0, 0.181), (1, 0.039), (2, 0.106), (3, -0.039), (4, -0.196), (5, 0.136), (6, 0.002), (7, -0.075), (8, -0.016), (9, -0.149), (10, 0.13), (11, -0.04), (12, -0.12), (13, 0.007), (14, -0.062), (15, -0.021), (16, -0.067), (17, -0.035), (18, -0.031), (19, 0.1), (20, 0.122), (21, 0.059), (22, -0.04), (23, -0.086), (24, 0.015), (25, 0.01), (26, -0.079), (27, -0.129), (28, 0.063), (29, 0.118), (30, -0.081), (31, 0.009), (32, 0.093), (33, -0.055), (34, 0.137), (35, 0.103), (36, 0.137), (37, 0.09), (38, -0.003), (39, 0.059), (40, -0.154), (41, 0.195), (42, -0.061), (43, 0.034), (44, -0.062), (45, -0.077), (46, 0.011), (47, 0.055), (48, 0.057), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.97366917 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
2 0.63695055 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
Author: Thomas Peel, Sandrine Anthoine, Liva Ralaivola
Abstract: We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernsteintype inequality for U-statistics derived by Arcones [2]. Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. We discuss potential applications of our new inequalities, especially in the realm of learning ranking/scoring functions. In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions. 1
3 0.61967325 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
Author: Gilbert Leung, Novi Quadrianto, Kostas Tsioutsiouliklis, Alex J. Smola
Abstract: We present a fast online solver for large scale parametric max-flow problems as they occur in portfolio optimization, inventory management, computer vision, and logistics. Our algorithm solves an integer linear program in an online fashion. It exploits total unimodularity of the constraint matrix and a Lagrangian relaxation to solve the problem as a convex online game. The algorithm generates approximate solutions of max-flow problems by performing stochastic gradient descent on a set of flows. We apply the algorithm to optimize tier arrangement of over 84 million web pages on a layered set of caches to serve an incoming query stream optimally. 1
4 0.5974564 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
Author: Eric Wang, Dehong Liu, Jorge Silva, Lawrence Carin, David B. Dunson
Abstract: We consider problems for which one has incomplete binary matrices that evolve with time (e.g., the votes of legislators on particular legislation, with each year characterized by a different such matrix). An objective of such analysis is to infer structure and inter-relationships underlying the matrices, here defined by latent features associated with each axis of the matrix. In addition, it is assumed that documents are available for the entities associated with at least one of the matrix axes. By jointly analyzing the matrices and documents, one may be used to inform the other within the analysis, and the model offers the opportunity to predict matrix values (e.g., votes) based only on an associated document (e.g., legislation). The research presented here merges two areas of machine-learning that have previously been investigated separately: incomplete-matrix analysis and topic modeling. The analysis is performed from a Bayesian perspective, with efficient inference constituted via Gibbs sampling. The framework is demonstrated by considering all voting data and available documents (legislation) during the 220-year lifetime of the United States Senate and House of Representatives. 1
5 0.56365108 9 nips-2010-A New Probabilistic Model for Rank Aggregation
Author: Tao Qin, Xiubo Geng, Tie-yan Liu
Abstract: This paper is concerned with rank aggregation, which aims to combine multiple input rankings to get a better ranking. A popular approach to rank aggregation is based on probabilistic models on permutations, e.g., the Luce model and the Mallows model. However, these models have their limitations in either poor expressiveness or high computational complexity. To avoid these limitations, in this paper, we propose a new model, which is defined with a coset-permutation distance, and models the generation of a permutation as a stagewise process. We refer to the new model as coset-permutation distance based stagewise (CPS) model. The CPS model has rich expressiveness and can therefore be used in versatile applications, because many different permutation distances can be used to induce the coset-permutation distance. The complexity of the CPS model is low because of the stagewise decomposition of the permutation probability and the efficient computation of most coset-permutation distances. We apply the CPS model to supervised rank aggregation, derive the learning and inference algorithms, and empirically study their effectiveness and efficiency. Experiments on public datasets show that the derived algorithms based on the CPS model can achieve state-ofthe-art ranking accuracy, and are much more efficient than previous algorithms.
6 0.50167429 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
7 0.48558789 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
8 0.42190567 150 nips-2010-Learning concept graphs from text with stick-breaking priors
9 0.39594597 100 nips-2010-Gaussian Process Preference Elicitation
10 0.39295629 27 nips-2010-Agnostic Active Learning Without Constraints
11 0.38392389 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
12 0.38214469 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
13 0.37779513 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
14 0.36575314 286 nips-2010-Word Features for Latent Dirichlet Allocation
15 0.3607035 129 nips-2010-Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
16 0.35070837 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
17 0.34616467 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
18 0.33969814 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
19 0.33340299 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
20 0.3313694 194 nips-2010-Online Learning for Latent Dirichlet Allocation
topicId topicWeight
[(13, 0.03), (17, 0.011), (24, 0.161), (27, 0.121), (30, 0.053), (35, 0.016), (45, 0.223), (50, 0.05), (52, 0.023), (60, 0.019), (77, 0.06), (78, 0.091), (90, 0.045)]
simIndex simValue paperId paperTitle
1 0.89682704 183 nips-2010-Non-Stochastic Bandit Slate Problems
Author: Satyen Kale, Lev Reyzin, Robert E. Schapire
Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1
same-paper 2 0.8788451 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
3 0.85113865 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
Author: Katya Scheinberg, Shiqian Ma, Donald Goldfarb
Abstract: Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an ℓ1 -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an ϵ-optimal solution in O(1/ϵ) iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms. 1
4 0.84695393 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
Author: Alessandro Chiuso, Gianluigi Pillonetto
Abstract: We introduce a new Bayesian nonparametric approach to identification of sparse dynamic linear systems. The impulse responses are modeled as Gaussian processes whose autocovariances encode the BIBO stability constraint, as defined by the recently introduced “Stable Spline kernel”. Sparse solutions are obtained by placing exponential hyperpriors on the scale factors of such kernels. Numerical experiments regarding estimation of ARMAX models show that this technique provides a definite advantage over a group LAR algorithm and state-of-the-art parametric identification techniques based on prediction error minimization. 1
5 0.83847505 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
Author: Mahdi M. Fard, Joelle Pineau
Abstract: This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. 1
6 0.82126093 98 nips-2010-Functional form of motion priors in human motion perception
7 0.8196305 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
8 0.81751519 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
9 0.8172819 151 nips-2010-Learning from Candidate Labeling Sets
10 0.81675971 17 nips-2010-A biologically plausible network for the computation of orientation dominance
11 0.81665272 268 nips-2010-The Neural Costs of Optimal Control
12 0.81470025 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
13 0.81376791 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
14 0.81340241 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
15 0.81271523 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
16 0.81169844 161 nips-2010-Linear readout from a neural population with partial correlation data
17 0.80933827 282 nips-2010-Variable margin losses for classifier design
18 0.80794001 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
19 0.80732322 194 nips-2010-Online Learning for Latent Dirichlet Allocation
20 0.8072415 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts