nips nips2010 nips2010-74 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas Peel, Sandrine Anthoine, Liva Ralaivola
Abstract: We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernsteintype inequality for U-statistics derived by Arcones [2]. Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. We discuss potential applications of our new inequalities, especially in the realm of learning ranking/scoring functions. In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. [sent-12, score-0.331]
2 They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernsteintype inequality for U-statistics derived by Arcones [2]. [sent-13, score-0.247]
3 Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. [sent-14, score-0.066]
4 In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. [sent-15, score-0.057]
5 In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. [sent-17, score-0.355]
6 We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions. [sent-18, score-0.264]
7 Among those, we especially have in mind the task of ranking, where one is interested in learning a ranking function capable of predicting an accurate ordering of objects according to some attached relevance information. [sent-20, score-0.14]
8 Tackling such problems generally implies the use of loss functions other than the 0-1 misclassification loss such as, for example, a misranking loss [6] or a surrogate thereof. [sent-21, score-0.193]
9 , X = Rd and Y = R) the misranking loss rank and a surrogate convex loss sur may be defined for a scoring function f ∈ Y X as: rank sur (f, (x, y), (x , y )) := 1{(y−y )(f (x)−f (x ))<0} , (1) 2 (f, (x, y), (x , y )) := (1 − (y − y )(f (x) − f (x )) . [sent-24, score-0.451]
10 In practice, this naturally brings up the empirical estimate R (f, Z n ) ˆ R (f, Z n ) := 1 n(n − 1) (f, (Xi , Yi ), (Xj , Yj )), (3) i=j which is a U-statistic [6, 10]. [sent-26, score-0.066]
11 It is therefore of the utmost importance to have at hand tail inequalities that are sharp; it is just as important that these inequalities rely as much as possible on empirical quantities. [sent-29, score-0.588]
12 Here, we propose new empirical Bernstein inequalities for U-statistics. [sent-30, score-0.262]
13 Our new inequalities generalize those of [3] and [13], which also feature points (i) and (ii) (but not (iii)), while based on simple arguments. [sent-32, score-0.196]
14 Section 2 introduces the notations and briefly recalls the basics of U-statistics as well as tail inequalities our results are based upon. [sent-35, score-0.326]
15 Our empirical Bernstein inequalities are presented in Section 3; we also provide an efficient way of computing the empirical variance when the U-statistics considered are based on the misranking loss rank of (1). [sent-36, score-0.563]
16 Section 4 discusses two applications of our new results: test set bounds for bipartite ranking and online ranking. [sent-37, score-0.319]
17 , zm ) is independent of the order of the zi ’s in z. [sent-54, score-0.258]
18 , Zm ) = EZ n Uq (Z n ); in addition, EZ n Uq (Z n ) is a lowest variance estimate of EZ m q(Z1 , . [sent-65, score-0.05]
19 (3) is a U-statistic of order 2 with kernel qf (Z, Z ) := (f, Z, Z ). [sent-70, score-0.627]
20 1 0 101 εBernstein εHoeffding 102 103 Figure 1: First two plots: values of the right-hand size of (5) and (6), for Duni and kernel qm for m = 2 and m = 10 (see Example 1) as functions of n. [sent-109, score-0.065]
21 (5), (6), (7)) that hold for U-statistics with symmetric and bounded kernels q. [sent-113, score-0.069]
22 Normally, these inequalities make explicit use of the length qmax − qmin of the range [qmin , qmax ] of q. [sent-114, score-0.291]
23 To simplify the reading, we will consider without loss of generality that q has range [0, 1] (an easy way of retrieving the results for bounded q is to consider q/ q ∞ ). [sent-115, score-0.068]
24 One key quantity that appears in the original versions of tail inequalities (5) and (6) below is n/m , the integer part of the ratio n/m – this quantity might be thought of as the effective number of data. [sent-116, score-0.326]
25 ˆ Theorem 1 (First order tail inequality for Uq , [11]. [sent-118, score-0.187]
26 Hoeffding proved the following: ∀ε > 0, PZ n ˆ ˆ EZ n Uq (Z n ) − Uq (Z n ) ≥ ε ≤ 2 exp −(n/m)ε2 , (4) Hence ∀δ ∈ (0, 1], with probability at least 1 − δ over the random draw of Z n : ˆ ˆ EZ n Uq (Z n ) − Uq (Z n ) ≤ 2 1 ln . [sent-120, score-0.202]
27 (n/m) δ (5) To go from the tail inequality (4) to the bound version (5), it suffices to make use of the elementary inequality reversal lemma (Lemma 1) provided in section 3, used also for the bounds given below. [sent-121, score-0.368]
28 Hoeffding [11] and, later, Arcones [2] refined the previous result in the form of Bernstein-type inequalities of the form (n/m)ε2 ˆ ˆ ∀ε > 0, PZ n EZ n Uq (Z n ) − Uq (Z n ) ≥ ε ≤ a exp − , 2ϑq,m + bm ε For Hoeffding, a = 2, ϑq,m = Σ2 where, Σ2 is the variance of q(Z1 , . [sent-123, score-0.308]
29 q q Hence, ∀δ ∈ (0, 1], with probability at least 1 − δ: 2Σ2 2 2 2 q ln + ln . [sent-127, score-0.362]
30 (n/m) δ 3(n/m) δ ˆ ˆ EZ n Uq (Z n ) − Uq (Z n ) ≤ (6) 2 2 For Arcones, a = 4, ϑq,m = mσq where σq is the variance of EZ2 ,. [sent-128, score-0.05]
31 , Zm ) (this is m+3 m−1 a function of Z1 ) and bm = 2 m + (2/3)m−2 . [sent-134, score-0.062]
32 ∀δ ∈ (0, 1], with probability at least 1 − δ: 2 2mσq 4 bm 4 ln + ln . [sent-135, score-0.424]
33 It provides 2 a more accurate Bernstein-type inequality than that of Eq. [sent-142, score-0.057]
34 Thus, we detail our results focusing on an empirical version of (6). [sent-150, score-0.066]
35 To illustrate how the use of the variance information provides smaller confidence interm vals, consider the kernel qm := i=1 zi and two distributions Duni and DBer (p). [sent-152, score-0.288]
36 DBer (p) is the Bernoulli distribution with parameter m m p ∈ [0, 1], for which Σ2 = pm (1 − pm ). [sent-154, score-0.062]
37 Observe that the variance information renders the bound smaller. [sent-156, score-0.083]
38 We first introduce the inequality reversal lemma, which allows to transform tail inequalities into upper bounds (or confidence intervals), as in (5)-(7). [sent-158, score-0.457]
39 Let X be a random variable and a, b > 0, c, d ≥ 0 such that ∀ε > 0, PX (|X| ≥ ε) ≤ a exp − bε2 c + dε , (8) then, with probability at least 1 − δ c a d a ln + ln . [sent-160, score-0.362]
40 1 √ a+b≤ √ a+ √ 1 2b d ln a + δ d2 ln2 a a + 4bc ln δ δ . [sent-163, score-0.362]
41 Empirical Bernstein Inequalities Let us now define the empirical variances we will use in our main result. [sent-165, score-0.066]
42 With probability at least 1 − δ over Z n , ˆq 2 Σ2 4 5 4 ln + ln . [sent-186, score-0.362]
43 (n/m) δ (n/m) δ ˆ ˆ EZ n Uq (Z n ) − Uq (Z n ) ≤ (12) And, also, with probability at least 1 − δ, (bm is the same as in (7)) ˆ ˆ EZ n Uq (Z n ) − Uq (Z n ) ≤ √ 2mˆq σ2 8 5 m + bm 8 ln + ln . [sent-187, score-0.424]
44 We provide the proof of (12) for the upper bound of the confidence interval; the same reasoning carries over to prove the lower bound. [sent-189, score-0.075]
45 An equivalent symmetric ˆq kernel for Σ2 is Qsym : Qsym (Z1 , . [sent-202, score-0.067]
46 This kernel is symmetric (and has range [0, 1]) and Theorem 2 can be applied to bound Σ2 as follows: with prob. [sent-216, score-0.1]
47 at least 1 − δ ˆ ˆ Σ2 = EZ 2m Qsym (Z 2m ) = EZ n Σ2 (Z n ) ≤ Σ2 (Z n ) + q q 2V(Qsym ) 2 2 2 ln + ln , (n/2m) δ 3(n/2m) δ where V(Qsym ) is the variance of Qsym . [sent-217, score-0.412]
48 As Qsym has range [0, 1], V(Qsym ) = EQ2 − E2 Qsym ≤ EQ2 ≤ EQsym = Σ2 , sym sym and therefore 2 4 2 4Σ2 ln + ln . [sent-218, score-0.432]
49 ˆ2 √ Following the approach of [13], we introduce √ 2 Σ2 − (m/n) ln(2/δ) and taking the square root of both side, using 1 + √ Σ2 ≤ 2 Σ2 − (m/n) ln(2/δ) and we get 2 7 ln , 3(n/m) δ √ √ √ 7/3 < 3 and a + b ≤ a + b again gives ˆq ≤ Σ2 (Z n ) + ˆ Σ2 (Z n ) + 3 q 1 2 ln . [sent-220, score-0.362]
50 (n/m) δ ˆ ˆ We now apply Theorem 2 to bound |EZ n Uq (Z n )−Uq (Z n )|, and plug in the latter equation, adjusting δ to δ/2 so the obtained inequality still holds with probability 1 − δ. [sent-221, score-0.09]
51 In addition to providing an empirical Bernstein bound for U-statistics based on arbitrary bounded kernels, our result differs from that of Maurer and Pontil [13] by the way we derive it. [sent-224, score-0.118]
52 Here, we apply the same tail inequality twice, taking advantage of the fact that estimates for the variances we are interested in are also U-statistics. [sent-225, score-0.211]
53 Maurer and Pontil use a tail inequality on self bounded random variables and do not explicitly take advantage of the estimates they use being U-statistics. [sent-226, score-0.23]
54 2 Efficient Computation of the Variance Estimate for Bipartite Ranking We have just showed how empirical Bernstein inequalities can be derived for U-statistics. [sent-228, score-0.262]
55 In other words, we address the bipartite ranking problem. [sent-232, score-0.224]
56 ∀n, the computation of f 1 ˆ Σqf (zn ) = 4 |An | 1{(yi 1 −yi2 )(f (xi1 )−f (xi2 ))<0 i∈A4 n can be performed in O(n ln n). [sent-235, score-0.181]
57 We have ˆq Σ2f (zn ) ∝ (qf (zi , zj ) − qf (zk , zl ))2 = i,j,k,l i=j=k=l 2 2 qf (zi , zj ) − 2qf (zi , zj )qf (zk , zl ) + qf (zk , zl ) , i,j,k,l i=j=k=l qf (zi , zj ) − 2 = 2(n − 2)(n − 3) i,j qf (zi , zj )qf (zk , zl ) since 2 qf =qf qf (z,z) 0 = . [sent-240, score-5.33]
58 There exist efficient ways (O(n ln n)) to compute it, based on sorting the values of the f (xi )’s. [sent-242, score-0.238]
59 We show how to deal with the second term, using sorting arguments as well. [sent-243, score-0.057]
60 Note that 2 qf (zi , zj )qf (zk , zl ) = −4 qf (zi , zj ) i,j i,j,k,l i=j=k=l 2 qf (zi , zj ). [sent-244, score-2.348]
61 We also have subtracted all the products qf (zi , zj )qf (zk , zl ) where i = k and j = l or i = l and 2 j = k, in which case the product reduces to qf (zi , zj ) (hence the factor 2) – this gives the last term. [sent-246, score-1.625]
62 ˆ2 Recalling that qf (zi , zj ) = 1{(yi −yj )(f (xi )−f (xj ))<0} , we observe that qf (zi , zj )qf (zi , zk ) = 1 ⇔ yi = −1, yj = yk = +1 and f (xi ) > f (xj ), f (xk ), or yi = +1, yj = yk = −1 and f (xi ) < f (xj ), f (xk ). [sent-248, score-1.848]
63 (15) Let us define E + (i) and E − (i) as E + (i) := {j : yj = −1, f (xj ) > f (xi )} , and E − (i) := {j : yj = +1, f (xj ) < f (xi )} . [sent-249, score-0.1]
64 i i For i such that yi = 1, κ+ is the number of negative instances that have been scored higher than xi i by f . [sent-251, score-0.074]
65 From (15), we see that the contribution of i to the last term of (14) corresponds to the number + + κi (κi − 1) of ordered pairs of indices in E + (i) (similarly for κ− , with yi = −1). [sent-252, score-0.052]
66 i i ˆ The cost of computing Σqf is therefore that of sorting the scores, which has cost O(n ln n). [sent-256, score-0.238]
67 4 Applications and Discussion Here, we mention potential applications of the new empirical inequalities we have just presented. [sent-257, score-0.28]
68 Right: half the confidence interval of the Hoeffding bound and that of the empirical Bernstein bound as functions of ntest . [sent-264, score-0.243]
69 1 Test Set Bounds A direct use of the empirical Bernstein inequalities is to draw test set bounds. [sent-266, score-0.304]
70 In this scenario, a sample Z n is split into a training set Z train := Z 1:ntrain of ntrain data and a hold-out set Z test := Z ntrain +1:n of size ntest . [sent-267, score-0.25]
71 Z train is used to train a model f that minimizes an empirical risk based on a U-statistic inducing loss (such as in (1) or (2)) and Z test is used to compute a confidence interval on the expected risk of f . [sent-268, score-0.185]
72 Figure 2 displays the behavior of such test set bounds as ntest grows for the UCI banana dataset. [sent-270, score-0.189]
73 Of course, a purely linear scoring function would not make it possible to achieve good ranking accuracy so we in fact work in the reproducing kernel hilbert space associated with the Gaussian kernel k(x, x ) = exp(− x−x 2 /2). [sent-273, score-0.242]
74 We train our scoring function on ntrain = 1000 data points and evaluate the test set bound on ntest = 100, 500, 1000, 5000, 10000 data points. [sent-274, score-0.256]
75 Figure 2 (right) reports the size of half the confidence interval of the Hoeffding bound (5) and that of the empirical Bernstein bound given in (16). [sent-275, score-0.152]
76 Just as in the situation described in Example 1, the use of variance information gives rise to smaller confidence intervals, even for moderate sizes of test sets. [sent-276, score-0.108]
77 2 Online Ranking and Empirical Racing Algorithms Another application that we would like to describe is online bipartite ranking. [sent-278, score-0.143]
78 Due to space limitation, we only provide the main ideas on how we think our empirical tail inequalities and the efficient computation of the variance estimates we propose might be particularly useful in this scenario. [sent-279, score-0.466]
79 First, let us precise what we mean by online bipartite ranking. [sent-280, score-0.143]
80 Obviously, this means that Y = {−1, +1} and that the loss of interest is rank . [sent-281, score-0.116]
81 In addition, it means that given a training set Z = {Zi := (Xi , Yi )}n the learning procedure will process the data of Z incrementally to give rise to i=1 hypotheses f1 , f2 , . [sent-282, score-0.095]
82 As rank entails a kernel of order m = 2, we assume that n = 2T and that we process the data from Z pairs by pairs, i. [sent-286, score-0.126]
83 (Z1 , Z2 ) are used to learn f1 , (Z3 , Z4 ) and f1 are used to learn f2 and, more generally, (Z2t−1 , Z2t ) and ft−1 are used to produce ft (there exist more clever ways to handle the data but this goes out of the scope of the present paper). [sent-288, score-0.12]
84 Rank-1 update formulas for inverses of matrices easily provide means to incrementally solve this problem as new data arrive (this is the main reason why we have mentioned this surrogate function). [sent-290, score-0.081]
85 As evoked by [5], a nice feature of online learning is that the expected risk of hypothesis ft can be estimated on the n − 2t examples of Z it was not trained on. [sent-291, score-0.163]
86 , fτ and, for t < τ , with probability at least 1 − δ: R rank ˆ (ft ) − R rank ˆq 2Σ2f (Z 2t:2τ ) ln(4/δ) (ft , Z 2t:2τ )) ≤ τ −t + 5 4 ln . [sent-295, score-0.353]
87 τ −t δ If one wants to have these confidence intervals to simultaneously hold for all t and all τ with probability 1 − δ, basic computations to calculate the number of pairs (t, τ ), with 1 ≤ t < τ ≤ n show that it suffices to adjust δ to 4δ/(n + 1)2 . [sent-296, score-0.056]
88 Hence, with probability at least 1 − δ: ∀1 ≤ t < τ ≤ n, R rank ˆ (ft ) − R rank (ft , Z 2t:2τ )) ≤ ˆq 4Σ2f (Z 2t:2τ ) ln((n + 1)/δ) τ −t + 10 n+1 ln . [sent-297, score-0.353]
89 This corresponds to a racing algorithm as described in [12]. [sent-300, score-0.061]
90 Theoretically analyzing the relevance of such a race can be easily done with the results of [14], which deal with empirical Bernstein racing, but for non-U-statistics. [sent-301, score-0.086]
91 The remaining question is whether it is possible to have shared such structures to summarize the sorted lists of scores for various hypotheses (recall that the scores are computed on the same data). [sent-304, score-0.112]
92 5 Conclusion We have proposed new empirical Bernstein inequalities designed for U-statistics. [sent-306, score-0.262]
93 They generalize the empirical inequalities of [13] and [3] while they merely result from two applications of the same non-empirical tail inequality for U-statistics. [sent-307, score-0.449]
94 We also show how, in the bipartite ranking situation, the empirical variance can be efficiently computed. [sent-308, score-0.34]
95 We mention potential applications, with illustrative results for the case of test set bounds in the realm of bipartite ranking. [sent-309, score-0.204]
96 In addition to the possible extensions discussed in the previous section, we wonder whether it is possible to draw similar empirical inequalities for other types of rich statistics such as, e. [sent-310, score-0.283]
97 Obviously, we plan to work on establishing generalization bounds derived from the new concentration inequalities presented. [sent-313, score-0.231]
98 Such new bounds would be compared with those proposed in [1, 6, 7, 15] for the bipartite ranking and/or pairwise classification problems. [sent-315, score-0.259]
99 Finally, we also plan to carry out intensive simulations —in particular for the task of online ranking— to get even more insights on the relevance of our contribution. [sent-316, score-0.059]
100 Margin-based ranking and an equivalence between AdaBoost and RankBoost. [sent-411, score-0.12]
wordName wordTfidf (topN-words)
[('qf', 0.587), ('bernstein', 0.297), ('uq', 0.297), ('ez', 0.235), ('inequalities', 0.196), ('ln', 0.181), ('zi', 0.173), ('zj', 0.161), ('hoeffding', 0.159), ('qsym', 0.155), ('tail', 0.13), ('ranking', 0.12), ('zk', 0.114), ('zl', 0.104), ('bipartite', 0.104), ('ft', 0.1), ('ntest', 0.091), ('zim', 0.086), ('rank', 0.086), ('zm', 0.085), ('zn', 0.078), ('misranking', 0.069), ('ntrain', 0.069), ('empirical', 0.066), ('bm', 0.062), ('racing', 0.061), ('dence', 0.059), ('inequality', 0.057), ('sorting', 0.057), ('yi', 0.052), ('arcones', 0.052), ('dber', 0.052), ('duni', 0.052), ('joliot', 0.052), ('variance', 0.05), ('yj', 0.05), ('incrementally', 0.047), ('marseille', 0.045), ('curie', 0.042), ('banana', 0.042), ('scoring', 0.042), ('kernel', 0.04), ('online', 0.039), ('reversal', 0.039), ('maurer', 0.039), ('sur', 0.037), ('intervals', 0.036), ('rue', 0.035), ('bounds', 0.035), ('anthoine', 0.035), ('qmin', 0.035), ('sym', 0.035), ('surrogate', 0.034), ('bound', 0.033), ('pm', 0.031), ('remark', 0.03), ('pz', 0.03), ('qmax', 0.03), ('xj', 0.03), ('loss', 0.03), ('hypotheses', 0.028), ('universit', 0.028), ('lif', 0.028), ('symmetric', 0.027), ('realm', 0.026), ('france', 0.025), ('qm', 0.025), ('subtracted', 0.025), ('risk', 0.024), ('estimates', 0.024), ('scores', 0.023), ('algorithmic', 0.023), ('kernels', 0.023), ('im', 0.022), ('xi', 0.022), ('reasoning', 0.021), ('test', 0.021), ('pontil', 0.021), ('obviously', 0.021), ('draw', 0.021), ('con', 0.021), ('copies', 0.021), ('lists', 0.021), ('carries', 0.021), ('relevance', 0.02), ('interval', 0.02), ('computations', 0.02), ('rise', 0.02), ('goes', 0.02), ('szepesv', 0.019), ('bounded', 0.019), ('reading', 0.019), ('simplify', 0.019), ('april', 0.018), ('mention', 0.018), ('lemma', 0.017), ('sorted', 0.017), ('moderate', 0.017), ('yk', 0.017), ('iii', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
Author: Thomas Peel, Sandrine Anthoine, Liva Ralaivola
Abstract: We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernsteintype inequality for U-statistics derived by Arcones [2]. Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. We discuss potential applications of our new inequalities, especially in the realm of learning ranking/scoring functions. In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions. 1
2 0.12789418 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
3 0.11879946 144 nips-2010-Learning Efficient Markov Networks
Author: Vibhav Gogate, William Webb, Pedro Domingos
Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1
4 0.084909871 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
5 0.083522297 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
6 0.082113191 151 nips-2010-Learning from Candidate Labeling Sets
7 0.078961581 217 nips-2010-Probabilistic Multi-Task Feature Selection
8 0.077869408 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
9 0.075474799 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
10 0.072036423 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
11 0.068125226 182 nips-2010-New Adaptive Algorithms for Online Classification
12 0.067949668 243 nips-2010-Smoothness, Low Noise and Fast Rates
13 0.067114569 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
14 0.063919015 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
15 0.06323418 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
16 0.059258651 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
17 0.058311649 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
18 0.055367302 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
19 0.054004125 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
20 0.053745743 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
topicId topicWeight
[(0, 0.143), (1, 0.023), (2, 0.107), (3, -0.006), (4, -0.007), (5, 0.071), (6, -0.029), (7, -0.021), (8, -0.004), (9, 0.028), (10, -0.018), (11, -0.042), (12, -0.039), (13, 0.058), (14, -0.041), (15, 0.022), (16, 0.017), (17, -0.023), (18, 0.05), (19, 0.045), (20, 0.031), (21, -0.024), (22, -0.0), (23, -0.035), (24, 0.021), (25, 0.104), (26, -0.045), (27, -0.116), (28, 0.054), (29, 0.084), (30, -0.034), (31, -0.065), (32, -0.002), (33, -0.016), (34, 0.107), (35, 0.133), (36, 0.054), (37, 0.025), (38, -0.049), (39, 0.168), (40, -0.078), (41, 0.21), (42, 0.055), (43, 0.031), (44, -0.045), (45, -0.111), (46, -0.155), (47, 0.007), (48, 0.085), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.93079585 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
Author: Thomas Peel, Sandrine Anthoine, Liva Ralaivola
Abstract: We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernsteintype inequality for U-statistics derived by Arcones [2]. Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. We discuss potential applications of our new inequalities, especially in the realm of learning ranking/scoring functions. In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions. 1
2 0.6397531 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
3 0.62405139 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
Author: Malik Magdon-Ismail
Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.
4 0.48059532 9 nips-2010-A New Probabilistic Model for Rank Aggregation
Author: Tao Qin, Xiubo Geng, Tie-yan Liu
Abstract: This paper is concerned with rank aggregation, which aims to combine multiple input rankings to get a better ranking. A popular approach to rank aggregation is based on probabilistic models on permutations, e.g., the Luce model and the Mallows model. However, these models have their limitations in either poor expressiveness or high computational complexity. To avoid these limitations, in this paper, we propose a new model, which is defined with a coset-permutation distance, and models the generation of a permutation as a stagewise process. We refer to the new model as coset-permutation distance based stagewise (CPS) model. The CPS model has rich expressiveness and can therefore be used in versatile applications, because many different permutation distances can be used to induce the coset-permutation distance. The complexity of the CPS model is low because of the stagewise decomposition of the permutation probability and the efficient computation of most coset-permutation distances. We apply the CPS model to supervised rank aggregation, derive the learning and inference algorithms, and empirically study their effectiveness and efficiency. Experiments on public datasets show that the derived algorithms based on the CPS model can achieve state-ofthe-art ranking accuracy, and are much more efficient than previous algorithms.
5 0.45596832 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
6 0.39481455 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
7 0.37569904 108 nips-2010-Graph-Valued Regression
8 0.36384735 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
9 0.35902348 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
10 0.35863313 144 nips-2010-Learning Efficient Markov Networks
11 0.34291667 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
12 0.33745542 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
13 0.3368699 129 nips-2010-Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
14 0.33644223 183 nips-2010-Non-Stochastic Bandit Slate Problems
15 0.33441246 290 nips-2010-t-logistic regression
16 0.33204705 217 nips-2010-Probabilistic Multi-Task Feature Selection
17 0.32724726 63 nips-2010-Distributed Dual Averaging In Networks
18 0.32535011 243 nips-2010-Smoothness, Low Noise and Fast Rates
19 0.32127571 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates
20 0.32104889 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
topicId topicWeight
[(13, 0.041), (17, 0.018), (27, 0.032), (30, 0.056), (45, 0.14), (50, 0.048), (52, 0.023), (60, 0.033), (77, 0.049), (78, 0.434), (90, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.74916375 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
Author: Thomas Peel, Sandrine Anthoine, Liva Ralaivola
Abstract: We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernsteintype inequality for U-statistics derived by Arcones [2]. Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. We discuss potential applications of our new inequalities, especially in the realm of learning ranking/scoring functions. In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions. 1
2 0.68794334 151 nips-2010-Learning from Candidate Labeling Sets
Author: Jie Luo, Francesco Orabona
Abstract: In many real world applications we do not have access to fully-labeled training data, but only to a list of possible labels. This is the case, e.g., when learning visual classifiers from images downloaded from the web, using just their text captions or tags as learning oracles. In general, these problems can be very difficult. However most of the time there exist different implicit sources of information, coming from the relations between instances and labels, which are usually dismissed. In this paper, we propose a semi-supervised framework to model this kind of problems. Each training sample is a bag containing multi-instances, associated with a set of candidate labeling vectors. Each labeling vector encodes the possible labels for the instances in the bag, with only one being fully correct. The use of the labeling vectors provides a principled way not to exclude any information. We propose a large margin discriminative formulation, and an efficient algorithm to solve it. Experiments conducted on artificial datasets and a real-world images and captions dataset show that our approach achieves performance comparable to an SVM trained with the ground-truth labels, and outperforms other baselines.
3 0.6618284 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
Author: Alessandro Chiuso, Gianluigi Pillonetto
Abstract: We introduce a new Bayesian nonparametric approach to identification of sparse dynamic linear systems. The impulse responses are modeled as Gaussian processes whose autocovariances encode the BIBO stability constraint, as defined by the recently introduced “Stable Spline kernel”. Sparse solutions are obtained by placing exponential hyperpriors on the scale factors of such kernels. Numerical experiments regarding estimation of ARMAX models show that this technique provides a definite advantage over a group LAR algorithm and state-of-the-art parametric identification techniques based on prediction error minimization. 1
4 0.62319314 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
Author: Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman
Abstract: We consider the problem of retrieving the database points nearest to a given hyperplane query without exhaustively scanning the database. We propose two hashingbased solutions. Our first approach maps the data to two-bit binary keys that are locality-sensitive for the angle between the hyperplane normal and a database point. Our second approach embeds the data into a vector space where the Euclidean norm reflects the desired distance between the original points and hyperplane query. Both use hashing to retrieve near points in sub-linear time. Our first method’s preprocessing stage is more efficient, while the second has stronger accuracy guarantees. We apply both to pool-based active learning: taking the current hyperplane classifier as a query, our algorithm identifies those points (approximately) satisfying the well-known minimal distance-to-hyperplane selection criterion. We empirically demonstrate our methods’ tradeoffs, and show that they make it practical to perform active selection with millions of unlabeled points. 1
5 0.57864118 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
6 0.5273065 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
7 0.5041576 25 nips-2010-Active Learning by Querying Informative and Representative Examples
8 0.49115905 36 nips-2010-Avoiding False Positive in Multi-Instance Learning
9 0.46846628 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
10 0.46740365 23 nips-2010-Active Instance Sampling via Matrix Partition
11 0.46623844 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
12 0.4619346 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
13 0.4464353 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation
14 0.43587655 22 nips-2010-Active Estimation of F-Measures
15 0.43511885 282 nips-2010-Variable margin losses for classifier design
16 0.43472254 243 nips-2010-Smoothness, Low Noise and Fast Rates
17 0.43230897 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
18 0.42793858 1 nips-2010-(RF)^2 -- Random Forest Random Field
19 0.42680204 27 nips-2010-Agnostic Active Learning Without Constraints
20 0.42610252 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision