jmlr jmlr2013 jmlr2013-60 knowledge-graph by maker-knowledge-mining

60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents

Source: pdf

Author: Wei Wu, Zhengdong Lu, Hang Li

Abstract: The task of matching data from two heterogeneous domains naturally arises in various areas such as web search, collaborative ﬁltering, and drug design. In web search, existing work has designed relevance models to match queries and documents by exploiting either user clicks or content of queries and documents. To the best of our knowledge, however, there has been little work on principled approaches to leveraging both clicks and content to learn a matching model for search. In this paper, we propose a framework for learning to match heterogeneous objects. The framework learns two linear mappings for two objects respectively, and matches them via the dot product of their images after mapping. Moreover, when different regularizations are enforced, the framework renders a rich family of matching models. With orthonormal constraints on mapping functions, the framework subsumes Partial Least Squares (PLS) as a special case. Alternatively, with a ℓ1 +ℓ2 regularization, we obtain a new model called Regularized Mapping to Latent Structures (RMLS). RMLS enjoys many advantages over PLS, including lower time complexity and easy parallelization. To further understand the matching framework, we conduct generalization analysis and apply the result to both PLS and RMLS. We apply the framework to web search and implement both PLS and RMLS using a click-through bipartite with metadata representing features of queries and documents. We test the efﬁcacy and scalability of RMLS and PLS on large scale web search problems. The results show that both PLS and RMLS can signiﬁcantly outperform baseline methods, while RMLS substantially speeds up the learning process. Keywords: web search, partial least squares, regularized mapping to latent structures, generalization analysis

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In web search, existing work has designed relevance models to match queries and documents by exploiting either user clicks or content of queries and documents. [sent-10, score-0.508]

2 We apply the framework to web search and implement both PLS and RMLS using a click-through bipartite with metadata representing features of queries and documents. [sent-19, score-0.301]

3 Existing models in web search use information from different sources to match queries and documents. [sent-29, score-0.232]

4 , 1994), and Language Models for Information Retrieval (LMIR) (Ponte and Croft, 1998; Zhai and Lafferty, 2004), match queries and documents based on their content. [sent-31, score-0.217]

5 Speciﬁcally, queries and documents are represented as feature vectors in a Euclidean space, and conventional relevance models match them by the dot products of their feature vectors (Xu et al. [sent-32, score-0.283]

6 On the other hand, a click-through bipartite graph, which represents users’ implicit judgments on query-document relevance, has proven to be a very valuable resource for matching queries and documents. [sent-35, score-0.253]

7 Existing models rely on either features or the click-through bipartite graph to match queries and documents. [sent-38, score-0.228]

8 Speciﬁcally, we implement both PLS and RMLS using a click-through bipartite graph with metadata on the nodes representing features of queries and documents. [sent-55, score-0.236]

9 We take click numbers as a response and learn linear mappings to matching queries and documents, each represented by heterogeneous feature vectors consisting of both key words and click numbers. [sent-56, score-0.296]

10 However, we focus on learning to match queries and documents, while learning to rank has been more concerned with optimizing the ranking model. [sent-91, score-0.196]

11 In web search, existing work for matching queries and documents can be roughly categorized into two groups: feature based methods and graph based methods. [sent-93, score-0.329]

12 , 1990) can be employed, which uses SVD to project queries and documents in a click-through bipartite graph into a latent space, and calculates query-document matching scores through the dot product of their images in the latent space. [sent-101, score-0.403]

13 In this paper, we propose a general framework for matching queries and 2521 W U , L U AND L I documents. [sent-103, score-0.194]

14 In information retrieval, some recent work also considers leveraging both user clicks and content of queries and documents. [sent-106, score-0.197]

15 (2009) propose learning a low rank model for ranking documents, which is like matching queries and documents. [sent-108, score-0.238]

16 For web search, the objects are queries and documents, and the response can be judgment from human labelers or the click number from user logs. [sent-128, score-0.258]

17 Under Assumption 1, we have a sample set S = {(xi , yi j , ri j )}, with x n y i 1 i nx , and for any given i, 1 j ny . [sent-136, score-0.412]

18 1 Model ⊤ ⊤ We intend to ﬁnd a linear mapping pair (Lx , Ly ), so that the corresponding images Lx x and Ly y are in the same d-dimensional latent space L (with d ≪ min{dx , dy }), and the degree of matching between x and y can be reduced to the dot product in L : ⊤ matchLx ,Ly (x, y) = x⊤ Lx Ly y. [sent-146, score-0.218]

19 The expectation in (1) can be estimated as x 1 n 1 ∑ nx i=1 ny i y ni ⊤ ∑ ri j xi⊤ Lx Ly yi j . [sent-157, score-0.494]

20 1 n 1 ∑ nx i=1 ny i y ni ⊤ ∑ ri j xi⊤ Lx Ly yi j , (2) j=1 Lx ∈ Hx , Ly ∈ Hy , where Hx and Hy are hypothesis spaces for Lx and Ly respectively. [sent-160, score-0.494]

21 n x n 1 n ⊤ ⊤ ⊤ 1 xi Lx Ly y′ = trace(Ly ( x ∑ y′ xi⊤ )Lx ), ∑ i nx i=1 n i=1 i arg max y i where y′ = 1/ny ∑ j=1 ri j yi j . [sent-172, score-0.405]

22 The objective y x ni ⊤ function in (2) becomes trace Ly (∑n ∑ j=1 ri j yi j xi⊤ )Lx after ignoring nx and ny , which is exactly i=1 i the objective for the SVD in LSI assuming the same orthonormal Hx and Hy deﬁned for PLS. [sent-179, score-0.494]

23 More speciﬁcally, we deﬁne the following hypothesis spaces Hx = {Lx | |lxu | λx , lxu Hy = {Ly | |lyv | λy , lyv θx , u = 1, . [sent-186, score-0.501]

24 , dy }, where | · | and · are respectively the ℓ1 -norm and ℓ2 -norm, lxu and lyv are respectively the uth and vth row of Lx and Ly , {λx , θx , λy , θy } are parameters. [sent-192, score-0.599]

25 x(dx ) ]⊤ , its image ⊤ ⊤ in L is Lx x = ∑dx x(u) lxu . [sent-199, score-0.283]

26 When both x and lxu are sparse, Lx x is the sum of a few sparse vectors, u=1 and therefore likely to be sparse itself. [sent-200, score-0.283]

27 1 n 1 ∑ nx i=1 ny i |lxu | y ni ⊤ ∑ ri j xi⊤ Lx Ly yi j , (3) j=1 λx , lxu θx , |lyv | λy , lyv θy , 1 u dx , 1 v dy . [sent-206, score-1.201]

28 1 Optimization In practice, we instead solve the following penalized variant of (3) for easier optimization x n y d dx y 1 n i 1 ⊤ − x ∑ ∑ y ri j xi⊤ Lx Ly yi j + β ∑ |lxu | + γ ∑ |lyv |, n i=1 j=1 ni u=1 v=1 arg min Lx ,Ly s. [sent-211, score-0.266]

29 θx , lyv lxu θy , 1 dx , 1 u v (4) dy , where β > 0 and γ > 0 control the trade-off between the objective and the penalty. [sent-213, score-0.707]

30 Speciﬁcally, for a ﬁxed Ly , the objective function of problem (4) can be re-written as dx ∑ u=1 y nx ni 1 (u) ⊤ x ri j Ly yi j )⊤ lxu + β|lxu | x ny i n i i=1 j=1 −( ∑ ∑ x n y (u) . [sent-216, score-0.885]

31 , ωu ]⊤ , the optii=1 i mal lxu is given by (z) ∗ (z) (z) lxu = Cu · max(|ωu | − β, 0)sign(ωu ) , 1 z d, (5) (z) where lxu represents the zth element of lxu . [sent-220, score-1.132]

32 Cu is a constant that ∗ ∗ makes lxu = θx if there are nonzero elements in lxu , otherwise Cu = 0. [sent-222, score-0.566]

33 Similarly, for a ﬁxed Lx , the objective function of problem (4) can be re-written as dy ∑ v=1 y y nx ni 1 (v) ⊤ y ri j Lx xi )⊤ lyv + γ|lyv | . [sent-223, score-0.765]

34 nx ny i j i i=1 j=1 −( ∑ ∑ (v) (1) (z) ∗ n x (d) (z) ⊤ i Writing ∑n ∑ j=1 nx1ny yi j ri j Lx xi as ηv =[ηv , . [sent-224, score-0.473]

35 , ηv ]⊤ , the optimal lyv is given by i=1 i (z) lyv = Cv · max(|ηv | − γ, 0)sign(ηv ) , 1 z d, (6) (z) ∗ where lyv represents the zth element of lyv . [sent-227, score-0.872]

36 Cv is a constant that makes ||lyv || = θy if there are n x y (u) ⊤ ⊤ ∗ i nonzero elements in lyv , otherwise Cv = 0. [sent-228, score-0.218]

37 Note that ∑n ∑ j=1 nx1ny xi ri j Ly yi j = Ly wxu , where i=1 x n i y i wxu = ∑n ∑ j=1 nx1ny x(u) ri j yi j does not rely on the update of Lx and Ly and can be pre-calculated to i=1 i x n y (v) i save time. [sent-229, score-0.333]

38 Similarly we pre-calculate wyv = ∑n ∑ j=1 nx1ny yi j ri j xi . [sent-230, score-0.193]

39 2525 W U , L U AND L I Algorithm 1 Preprocessing 1: Input: S = {(xi , yi j , ri j )}, 1 i nx , and 1 2: for u = 1 : dx wxu ← 0 for v = 1 : dy wyv ← 0 y 3: for u = 1 : dx , i = 1 : nx , j = 1 : ni (u) wxu ← wxu + nx1ny xi ri j yi j j ny . [sent-232, score-1.449]

40 i i 4: for v = 1 : dy , i = 1 : nx , j = 1 : ny i (v) wyv ← wyv + nx1ny yi j ri j xi i 5: d y Output: {wxu }dx , {wyv }v=1 . [sent-233, score-0.683]

41 In web search, it is usually the case that queries (x here) and documents (y here) are of high dimension (e. [sent-244, score-0.252]

42 In other words, both cx and cy are small despite large dx and dy . [sent-247, score-0.206]

43 Moreover, it is quite common that for each x, there are only a few y that have response with it and vice versa, rendering quite small ny and nx . [sent-248, score-0.352]

44 This situation is easy to understand in the ˜ ˜ context of web search, since for each query only a small number of documents are retrieved and viewed, and each document can only be retrieved with a few queries and get viewed. [sent-249, score-0.324]

45 For example, in web search, with the features extracted from the content of queries and documents, each word only relates to a few queries and dy documents. [sent-251, score-0.473]

46 We formally deﬁne D(S ) as the gap between the expected objective and the empirical objective over all Lx and Ly x D(S ) 1 n 1 sup | x ∑ y Lx ,Ly n i=1 ni y ni ⊤ ∑ ri j xi⊤ Lx Ly yi j − Ex,y j=1 ⊤ r(x, y)x⊤ Lx Ly y |, ˆ ˆ and bound it. [sent-278, score-0.264]

47 supLx ,Ly | nx ∑n i=1 1 y ni n y i ∑ j=1 fLx ,Ly (xi , yi j ) − Ey|{xi } fLx ,Ly (xi , y) |, denoted as D1 (S ), x x 1 2. [sent-285, score-0.388]

48 supLx ,Ly | nx ∑n Ey|{xi } fLx ,Ly (xi , y) − Ex,y fLx ,Ly (x, y)|, denoted as D2 ({xi }n ). [sent-286, score-0.268]

49 We have 2527 W U , L U AND L I Theorem 1 Given an arbitrary small positive number δ, with probability at least 1 − δ, the following inequality holds: D1 (S ) RB 2 log 1 2CR δ √ √ + , x ny x ny n n x where ny represents the harmonic mean of {ny }n . [sent-290, score-0.204]

50 nx nx Combining Theorem 1 and Theorem 2, we are able to bound D(S ): Theorem 3 Given an arbitrary small positive number δ, with probability at least 1 − 2δ, the following inequality holds: D(S ) (2CR + RB 1 1 1 2 log )( √ x y + √ x ). [sent-292, score-0.536]

51 Since ny = nx n1/ny , the bound ∑i=1 i tells us that to make the gap between the empirical objective and the expected objective small enough, we not only need large nx , but also need large ny for each xi , which is consistent with our i intuition. [sent-294, score-0.733]

52 δ nn n Theorem 5 Suppose that Hx = {Lx | |lxu | λx , ||lxu || θx , 1 u dx } and Hy = {Ly | |lyv | λy , ||lyv || θy , 1 v dy }. [sent-299, score-0.206]

53 If we suppose that the numbers of nonzero elements in x and y are respectively √ bounded by mx and my , then B = mx my min (dλx λy , θx θy ) and C = dx dy min (λx λy , θx θy ). [sent-300, score-0.294]

54 Thus, the generalization bound for RMLS is given by D(S ) 1 1 ( √ x y + √ x ) × (2 nn n dx dy min(λx λy , θx θy )R + 2528 √ mx my min(dλx λy , θx θy )R 1 2 log ). [sent-301, score-0.25]

55 δ L EARNING B ILINEAR M ODEL FOR M ATCHING Q UERIES AND D OCUMENTS Figure 1: Click-through bipartite graph with metadata on nodes, representing queries and documents in feature spaces and their associations. [sent-302, score-0.28]

56 The features may stand for the content of queries and documents and the clicks of queries and documents on the bipartite graph (Baeza-Yates and Tiberi, 2007), as seen below. [sent-309, score-0.535]

57 In this case, we actually leveraged both user clicks and features of queries and documents to perform matching. [sent-315, score-0.268]

58 After ﬁltering out noise, there are 94, 022 queries and 111, 631 documents in the one week data set, and 6, 372, 254 queries and 4, 599, 849 documents in the half year data set. [sent-319, score-0.474]

59 For the word feature, we represented queries and documents as tf-idf vectors (Salton and McGill, 1986) in a word space, where words are extracted from queries, URLs and the titles of documents. [sent-321, score-0.237]

60 For the click feature, we followed (Baeza-Yates and Tiberi, 2007) and took the number of clicks of documents as a feature of queries, and the number of clicks of queries as a feature of documents. [sent-323, score-0.329]

61 (2011), and it can leverage both the user clicks and the content of queries and documents. [sent-339, score-0.197]

62 It learns a large margin perceptron to map queries and documents into a latent space and measures their similarity in the space. [sent-341, score-0.228]

63 For one week data, we obtained 4, 445 judged queries and each query has on average 11. [sent-350, score-0.227]

64 There are 57, 514 judged queries and each query has on average 13. [sent-353, score-0.191]

65 The reason is that PLS requires SVD and has a complexity of at least O(dcdx dy + d 2 max(dx , dy )), where c represents the density of the matrix for SVD. [sent-465, score-0.196]

66 , large dx and dy ) and the quadratic growth with respect to d still make SVD quite expensive. [sent-468, score-0.206]

67 4 Results on Half Year Data We further tested the performance of RMLS and PLS on a half year data set with millions of queries and documents. [sent-477, score-0.192]

68 5 Discussion In this section, we investigate the effect of matching models as features in a state of the art learning to rank algorithm and performance of matching models across queries with different numbers of click. [sent-513, score-0.293]

69 An interesting question is therefore how different matching models perform across queries with different numbers of click. [sent-566, score-0.194]

70 We took four levels: totalclick 10, 10 < totalclick 100, 100 < totalclick 1000, and totalclick > 1000. [sent-568, score-0.224]

71 From Table 6, we can see that RMLS and PLS beat other baseline methods on queries with moderate and large number of clicks, but lose to RW and RW+BM25 when queries only have relatively few clicks (less than 100). [sent-571, score-0.336]

72 , logarithm) on click num2534 L EARNING B ILINEAR M ODEL FOR M ATCHING Q UERIES AND D OCUMENTS totalclick 10 # queries = 230 @1 @3 @5 RMLS 0. [sent-576, score-0.22]

73 592 Table 6: Evaluation on different query bins on one week data totalclick 10 # queries = 704 @1 @3 @5 RMLS 0. [sent-708, score-0.263]

74 We applied both PLS and RMLS to web search, leveraging a click-through bipartite graph with metadata representing features of queries and documents to learn relevance models. [sent-777, score-0.395]

75 Results on a small data set and a large data set with millions of queries and documents show the promising performance of PLS and RMLS, and particularly demonstrate the advantage of RMLS on scalability. [sent-778, score-0.213]

76 i i=1 j=1 (u) (v) r j i If we deﬁne auv = ∑n ∑ j=1 nxiny xi yi j , the objective of problem (8) can be re-written as i=1 i dy dx ⊤ ∑ ∑ auv lxu lyv . [sent-785, score-0.908]

77 u=1 v=1 k=1 With the existence of the upper bound, we can see that if Lx = λx ex l ⊤ and Ly = λy ey l ⊤ , the value of the objective (8) is dx ∑ dy ⊤ ∑ auv lxu lyv = u=1 v=1 dx dy ∑ ∑ auv λx λy l u=1 v=1 2536 2 dx = dy ∑ ∑ auv λx λy . [sent-789, score-1.322]

78 1 Proof of Theorem 1 Theorem 1 Given an arbitrary small positive number δ, with probability at least 1 − δ, the following inequality holds: D1 (S ) RB 2 log 1 2CR δ √ √ + , x ny x ny n n x where ny represents the harmonic mean of {ny }n . [sent-794, score-0.204]

79 i i=1 To prove this theorem, we need two lemmas: Lemma 1 Given ε > 0, the following inequality holds: P D1 (S ) − E{yi j }|{xi } D1 (S ) ε|{xi } exp − ε2 nx ny 2R2 B2 . [sent-795, score-0.336]

80 nx ny u y′ |{xi } | uv x Given {xi }n , {yi j } are independent. [sent-798, score-0.336]

81 i=1 y nx ni E{yi j ,y′i j ,σi j }|{xi } sup ∑ ∑ σi j f (xi , yi j ) − f (xi , y′ j ) i nx ny i Lx ,Ly i=1 j=1 y nx ni σi j f (xi , yi j ) |. [sent-811, score-1.136]

82 nx ny i i=1 j=1 2E{yi j ,σi j }|{xi } sup | ∑ ∑ Lx ,Ly Note that ⊤ ⊤ σi j f (xi , yi j ) = σi j r(xi , yi j )xi⊤ Lx Ly yi j = σi j vec(Lx Ly ), r(xi , yi j )vec(yi j ⊗ xi ) , where yi j ⊗ xi represents the tensor of column vectors yi j and xi , and vec(·) is the vectorization of a matrix. [sent-812, score-0.771]

83 = δ, we have ε2 nx ny = log δ 2R2 B2 2R2 B2 log 1 δ ε2 = nx ny − RB ε= 2 log 1 δ √ nx ny . [sent-817, score-1.008]

84 random variables, by McDiarmid inequality (Bartlett and Mendelson, 2002), i=1 we know   2 x x 2ε P D2 ({xi }n ) − E{xi } D2 ({xi }n ) ε exp − nx 4R2 B2  i=1 i=1 ∑i=1 (nx )2 = exp − Lemma 4 x E{xi } D2 ({xi }n ) i=1 ε2 nx 2R2 B2 . [sent-826, score-0.536]

85 C, Lx ,Ly 1 ∑ ∑ σi σ j r(xi , y)r(x j , y) xi , x j (nx )2 i=1 j=1 ⊤ 2E{xi },{σi } Ey|{xi } sup ||vec(Lx Ly )|| 2C nx nx 1 ⊤ 2E{xi },{σi } Ey|{xi } sup ||vec(Lx Ly )|| Lx ,Ly | i y, y 2CR √ . [sent-838, score-0.645]

86 With Lemma 3 and Lemma 4, we can prove Theorem 2: Proof Combining the conclusions of Lemma 3 and Lemma 4, we have x 2CR P D2 ({xi }n ) − √ x i=1 n ε x x P D2 ({xi }n ) − E{xi } D2 ({xi }n ) i=1 i=1 exp − 2541 ε2 nx 2R2 B2 . [sent-840, score-0.268]

87 ε y, y W U , L U AND L I 2 x ε Given a small number δ > 0, by letting exp − 2R2nB2 = δ, we have ε2 nx = log δ 2R2 B2 2R2 B2 log 1 δ ε2 = nx − RB ε= 2 log 1 δ √ . [sent-841, score-0.536]

88 x n Thus, with probability at least 1 − δ, 1 2CR RB 2 log δ √ + √ nx nx x D2 ({xi }n ) i=1 holds true. [sent-842, score-0.536]

89 , ly , where {lx }d and {ly }d represent the columns k=1 k=1 of Lx and Ly respectively. [sent-854, score-0.349]

90 Note that ⊤ ||Lx x||2 = d ∑ k x ⊤ lx k=1 2 ⊤ , ||Ly y||2 = d ∑ k y⊤ ly 2 . [sent-855, score-0.921]

91 4 Proof of Theorem 5 Theorem 5 Suppose that Hx = {Lx | |lxu | λx , ||lxu || θx , 1 u dx } and Hy = {Ly | |lyv | λy , ||lyv || θy , 1 v dy }. [sent-865, score-0.206]

92 If we suppose that the numbers of nonzero elements in x and y are respec√ tively bounded by mx and my , then B = mx my min (dλx λy , θx θy ) and C = dx dy min (λx λy , θx θy ). [sent-866, score-0.294]

93 Thus, the generalization bound for RMLS is given by D(S ) 2 √ mx my min(dλx λy , θx θy )R dx dy min(λx λy , θx θy )R + ⊤ ⊤ Proof Remember that B is deﬁned by supx,y,Lx ,Ly ||Lx x||||Ly y|| 2 log 1 1 √ + √ x nx ny n 1 δ . [sent-867, score-0.586]

94 Since dx dy u=1 v=1 ⊤ ⊤ ||Lx x||2 = || ∑ x(u) lxu ||2 , ||Ly y||2 = || ∑ y(v) lyv ||2 , where x = x(1) , x(2) , . [sent-869, score-0.707]

95 Since ||x|| 0} mx , we have dx ⊤ ||Lx x||2 = || ∑ x(u) lxu ||2 u=1 = 2 dx d ∑ ∑x k=1 1[x (u) dx dx ∑ (x(u) )2 ∑ k=1 ∑ 1[x(u) u=1 ∑ (x(u) )2 u=1 = ||x||2 (k) 0](lxu )2 u=1 dx = (k) 0]lxu u=1 d (u) d dx ∑ ∑ 1[x(u) k=1 u=1 dx ∑ 1[x(u) u=1 mx θ2 . [sent-884, score-1.127]

96 x 2543 0]||lxu ||2 (k) 0](lxu )2 and ∀v, lyv = W U , L U AND L I ⊤ Similarly, since ||y|| 1, ||lyv ||2 θ2 , and #{y(v) | y(v) 0} my we have ||Ly y||2 y √ ⊤ ⊤ mx my θx θy . [sent-885, score-0.262]

97 0]|x | (k) k d max1 u dx (|lxu |) ∑ ∑ 1[x k=1 (u) u=1 d λ2 ∑ x 1 2 dx d λx . [sent-888, score-0.216]

98 x dx ∑ (x(u) )2 u=1 ⊤ Similarly, since ||y|| 1, |lyv | λy , ∀ 1 v dy , and #{y(v) | y(v) 0} my , we have ||Ly y||2 dλ2 my . [sent-890, score-0.206]

99 Since ||lxu || θx and ||lyv || θy , Lx ,Ly ∀1 u dx and 1 v dy , we have ⊤ ⊤ ⊤ ||vec(Lx Ly )||2 = trace(Ly Lx Lx Ly ) dx = dy d ∑∑ ∑ u=1 v=1 dx k=1 dy d ∑∑ ∑ u=1 v=1 dx = 2 (k) (k) lxu lyv (k) 2 lxu k=1 dy ∑ ∑ ||lxu||2 ||lyv ||2 u=1 v=1 dx dy θ2 θ2 . [sent-895, score-1.814]

100 ⊤ Therefore, we have sup ||vec(Lx Ly )|| 2 ∑ dy ∑∑ 2 k=1 dy u=1 v=1 dy , we have (k) (k) lxu lyv ∑∑ ∑ u=1 v=1 v 2 (k) |lyv | dx dy min(λx λy , θx θy ), and we can choose C as Lx ,Ly dx dy min(λx λy , θx θy ). [sent-897, score-1.231]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lx', 0.572), ('ly', 0.349), ('rmls', 0.33), ('lxu', 0.283), ('nx', 0.268), ('pls', 0.25), ('lyv', 0.218), ('queries', 0.132), ('dx', 0.108), ('dy', 0.098), ('vec', 0.083), ('ni', 0.082), ('ndcg', 0.074), ('ny', 0.068), ('ilinear', 0.065), ('documents', 0.065), ('matching', 0.062), ('xi', 0.061), ('wxu', 0.06), ('atching', 0.06), ('ueries', 0.06), ('ssi', 0.056), ('totalclick', 0.056), ('wyv', 0.056), ('ocuments', 0.056), ('web', 0.055), ('auv', 0.051), ('svdfeature', 0.051), ('ey', 0.05), ('clicks', 0.05), ('hy', 0.05), ('flx', 0.046), ('hx', 0.046), ('rw', 0.044), ('mx', 0.044), ('bltm', 0.042), ('bipartite', 0.04), ('relevance', 0.039), ('query', 0.039), ('odel', 0.039), ('ri', 0.038), ('yi', 0.038), ('lsiqd', 0.037), ('week', 0.036), ('document', 0.033), ('click', 0.032), ('latent', 0.031), ('year', 0.029), ('ranking', 0.028), ('metadata', 0.028), ('nonzeros', 0.028), ('dot', 0.027), ('search', 0.025), ('semantic', 0.025), ('earning', 0.024), ('lsi', 0.024), ('jv', 0.024), ('rb', 0.024), ('sup', 0.024), ('craswell', 0.023), ('salton', 0.023), ('objects', 0.023), ('heterogeneous', 0.022), ('baseline', 0.022), ('svd', 0.022), ('id', 0.022), ('features', 0.021), ('judged', 0.02), ('word', 0.02), ('collaborative', 0.02), ('robertson', 0.02), ('match', 0.02), ('judgments', 0.019), ('mcgill', 0.019), ('ponte', 0.019), ('schreier', 0.019), ('suplx', 0.019), ('sigir', 0.017), ('deerwester', 0.016), ('mart', 0.016), ('szummer', 0.016), ('wx', 0.016), ('yiu', 0.016), ('zhai', 0.016), ('millions', 0.016), ('walk', 0.016), ('rank', 0.016), ('response', 0.016), ('half', 0.015), ('indexing', 0.015), ('lt', 0.015), ('content', 0.015), ('graph', 0.015), ('cao', 0.014), ('xu', 0.014), ('wu', 0.014), ('degenerative', 0.014), ('grangier', 0.014), ('huawei', 0.014), ('lmir', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents

Author: Wei Wu, Zhengdong Lu, Hang Li

2 0.044167973 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

Author: Joachim Niehren, Jérôme Champavère, Aurélien Lemay, Rémi Gilleron

Abstract: Inference algorithms for tree automata that deﬁne node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schemaguided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justiﬁed by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction. Keywords: XML information extraction, XML schemas, interactive learning, tree automata, grammatical inference

3 0.041063901 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections

Author: Aleksandrs Slivkins, Filip Radlinski, Sreenivas Gollapudi

Abstract: Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learning-to-rank formulation that optimizes the fraction of satisﬁed users, with several scalable algorithms that explicitly takes document similarity and ranking context into account. Our formulation is a non-trivial common generalization of two multi-armed bandit models from the literature: ranked bandits (Radlinski et al., 2008) and Lipschitz bandits (Kleinberg et al., 2008b). We present theoretical justiﬁcations for this approach, as well as a near-optimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches. Keywords: online learning, clickthrough data, diversity, multi-armed bandits, contextual bandits, regret, metric spaces

4 0.034388129 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

Author: Sivan Sabato, Nathan Srebro, Naftali Tishby

Abstract: We obtain a tight distribution-speciﬁc characterization of the sample complexity of large-margin classiﬁcation with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-speciﬁc upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classiﬁcation. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classiﬁers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classiﬁers, distribution-dependence

5 0.031756204 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

Author: Kenji Fukumizu, Le Song, Arthur Gretton

Abstract: A kernel method for realizing Bayes’ rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. Probabilities are uniquely characterized by the mean of the canonical map to the RKHS. The prior and conditional probabilities are expressed in terms of RKHS functions of an empirical sample: no explicit parametric model is needed for these quantities. The posterior is likewise an RKHS mean of a weighted sample. The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. Some representative applications of the kernel Bayes’ rule are presented, including Bayesian computation without likelihood and ﬁltering with a nonparametric state-space model. Keywords: kernel method, Bayes’ rule, reproducing kernel Hilbert space

6 0.029629432 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

7 0.029525617 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

8 0.026172135 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

9 0.025384204 33 jmlr-2013-Dimension Independent Similarity Computation

10 0.02474313 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

11 0.022623654 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

12 0.02252521 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

13 0.022308294 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

14 0.021084905 104 jmlr-2013-Sparse Single-Index Model

15 0.019328458 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

16 0.018576875 108 jmlr-2013-Stochastic Variational Inference

17 0.018464565 68 jmlr-2013-Machine Learning with Operational Costs

18 0.01824614 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

19 0.017736489 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

20 0.016829496 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.096), (1, 0.019), (2, -0.01), (3, 0.021), (4, -0.014), (5, 0.018), (6, 0.01), (7, 0.014), (8, -0.03), (9, 0.003), (10, -0.006), (11, -0.068), (12, 0.02), (13, 0.053), (14, -0.046), (15, 0.079), (16, -0.053), (17, -0.023), (18, 0.078), (19, 0.014), (20, -0.072), (21, 0.088), (22, 0.068), (23, 0.14), (24, 0.177), (25, -0.014), (26, 0.076), (27, -0.1), (28, 0.026), (29, 0.016), (30, 0.336), (31, 0.0), (32, 0.147), (33, 0.152), (34, -0.123), (35, -0.003), (36, 0.03), (37, -0.147), (38, 0.065), (39, 0.0), (40, 0.081), (41, 0.053), (42, -0.048), (43, -0.213), (44, 0.213), (45, -0.031), (46, -0.106), (47, -0.07), (48, 0.297), (49, -0.251)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.957515 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents

Author: Wei Wu, Zhengdong Lu, Hang Li

2 0.42512929 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

Author: Joachim Niehren, Jérôme Champavère, Aurélien Lemay, Rémi Gilleron

3 0.37244812 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

Author: Kenji Fukumizu, Le Song, Arthur Gretton

4 0.25197843 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections

Author: Aleksandrs Slivkins, Filip Radlinski, Sreenivas Gollapudi

5 0.23872051 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing speciﬁc rates of convergence under standard assumptions. We ﬁnd that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.

6 0.22848707 15 jmlr-2013-Bayesian Canonical Correlation Analysis

7 0.22096175 95 jmlr-2013-Ranking Forests

8 0.19580542 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

9 0.19466206 33 jmlr-2013-Dimension Independent Similarity Computation

10 0.15704082 104 jmlr-2013-Sparse Single-Index Model

11 0.14552554 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

12 0.14515291 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

13 0.14490555 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

14 0.13584888 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

15 0.13573483 68 jmlr-2013-Machine Learning with Operational Costs

16 0.13562283 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

17 0.13070269 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

18 0.12868594 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

19 0.1265327 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

20 0.1263932 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (5, 0.078), (6, 0.025), (10, 0.045), (20, 0.02), (23, 0.036), (41, 0.014), (53, 0.021), (63, 0.458), (68, 0.017), (70, 0.033), (73, 0.012), (75, 0.034), (85, 0.037), (87, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73126429 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents

Author: Wei Wu, Zhengdong Lu, Hang Li

2 0.2393553 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

3 0.23747711 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

4 0.2365344 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

Author: Partha Niyogi

Abstract: Manifold regularization (Belkin et al., 2006) is a geometrically motivated framework for machine learning within which several semi-supervised algorithms have been constructed. Here we try to provide some theoretical understanding of this approach. Our main result is to expose the natural structure of a class of problems on which manifold regularization methods are helpful. We show that for such problems, no supervised learner can learn effectively. On the other hand, a manifold based learner (that knows the manifold or “learns” it from unlabeled examples) can learn with relatively few labeled examples. Our analysis follows a minimax style with an emphasis on ﬁnite sample results (in terms of n: the number of labeled examples). These results allow us to properly interpret manifold regularization and related spectral and geometric algorithms in terms of their potential use in semi-supervised learning. Keywords: semi-supervised learning, manifold regularization, graph Laplacian, minimax rates

5 0.23434468 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-deﬁnite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm ﬁrst estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to inﬁnity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

6 0.23380713 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

7 0.23292382 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

8 0.23289765 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

9 0.23215337 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

10 0.23186074 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

11 0.23062812 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

12 0.23041424 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

13 0.22945024 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

14 0.22943303 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

15 0.2291922 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

16 0.22898267 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

17 0.22894706 32 jmlr-2013-Differential Privacy for Functions and Functional Data

18 0.22866058 108 jmlr-2013-Stochastic Variational Inference

19 0.2286444 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

20 0.22831976 68 jmlr-2013-Machine Learning with Operational Costs