nips nips2009 nips2009-3 knowledge-graph by maker-knowledge-mining

3 nips-2009-AUC optimization and the two-sample problem


Source: pdf

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 AUC optimization and the two-sample problem St´ phan Cl´ mencon e e ¸ Telecom Paristech (TSI) - LTCI UMR Institut Telecom/CNRS 5141 stephan. [sent-1, score-0.116]

2 fr Abstract The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. [sent-8, score-0.313]

3 From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. [sent-10, score-0.275]

4 A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. [sent-11, score-0.186]

5 Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. [sent-12, score-0.238]

6 We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. [sent-14, score-0.24]

7 1 Introduction The statistical problem of testing homogeneity of two samples arises in a wide variety of applications, ranging from bioinformatics to psychometrics through database attribute matching for instance. [sent-16, score-0.288]

8 Practitioners may rely upon a wide range of nonparametric tests for detecting differences in distribution (or location) between two one-dimensional samples, among which tests based on linear rank statistics, such as the celebrated Mann-Whitney Wilcoxon test. [sent-17, score-0.166]

9 Being a (locally) optimal procedure, the latter is the most widely used in homogeneity testing. [sent-18, score-0.21]

10 Such rank statistics were originally introduced because they are distribution-free under the null hypothesis, thus permitting to set critical values in a non asymptotic fashion for any given level. [sent-19, score-0.141]

11 Beyond this simple fact, the crucial advantage of rank-based tests relies in their asymptotic efficiency in a variety of nonparametric situations. [sent-20, score-0.103]

12 We refer for instance to [15] for an account of asymptotically (locally) uniformly most powerful tests and a comprehensive treatment of asymptotic optimality of R-statistics. [sent-21, score-0.139]

13 The problem of ranking such data, also known as the bipartite ranking problem, has recently gained an increasing attention in the machine-learning literature, see 1 [5, 10, 19]. [sent-23, score-0.349]

14 Here, the goal is to learn, based on a pooled set of labeled examples, how to rank novel data with unknown labels, by means of a scoring function s : X → R, in order that positive ones appear on top of the list. [sent-24, score-0.255]

15 Over the last few years, this global learning problem has been the subject of intensive research, involving issues related to the design of appropriate criteria reflecting ranking performance or valid extensions of the Empirical Risk Minimization approach (ERM) to this framework [2, 6, 11]. [sent-25, score-0.153]

16 In most applications, the gold standard for measuring the capacity of a scoring function s to discriminate between the class populations however remains the area under the ROC curve criterion (AUC) and most ranking/scoring methods boil down to maximizing its empirical counterpart. [sent-26, score-0.351]

17 The empirical AUC may be viewed as the Mann-Whitney statistic based on the images of the multivariate samples by s, see [13, 9, 12, 18]. [sent-27, score-0.19]

18 The purpose of this paper is to investigate how ranking methods for multivariate data with binary labels may be exploited in order to extend the rank-based test approach for testing homogeneity between two samples to a multidimensional setting. [sent-28, score-0.565]

19 Precisely, the testing principle promoted in this paper is described through an extension of the Mann-Whitney Wilcoxon test, based on a preliminary ranking of the data through empirical AUC maximization. [sent-29, score-0.33]

20 The consistency of the test is proved to hold, as soon as the learning procedure is consistent in the AUC sense and its capacity to detect ”small” deviations from the homogeneity assumption is illustrated by a simulation example. [sent-30, score-0.321]

21 In Section 2, the homogeneity testing problem is formulated and standard approaches are recalled, with focus on the one-dimensional case. [sent-32, score-0.288]

22 In Section 4, we describe the testing procedure proposed and set preliminary grounds for its theoretical validity. [sent-34, score-0.11]

23 In particular, special attention is paid to the one-dimensional case, for which two-sample linear rank statistics allow for constructing locally optimal tests in a variety of situations. [sent-38, score-0.127]

24 The problem considered in this paper is to test the hypothesis that two inde− + − + pendent i. [sent-40, score-0.092]

25 The testing problem is tackled here from a nonparametric perspective, meaning that the distributions G(dx) and H(dx) are assumed to be unknown. [sent-52, score-0.133]

26 We suppose in addition that G(dx) and H(dx) are continuous distributions and the asymptotics are described as follows: we set N = m + n and suppose that n/N → p ∈ (0, 1) as n, m tend to infinity. [sent-53, score-0.091]

27 Formally, the problem is to test the null hypothesis H0 : G = H against the alternative H1 : G = H, based on the two data sets. [sent-54, score-0.183]

28 A possible approach is to consider a probability (pseudo)-metric D on the space of probability distributions on Rd . [sent-57, score-0.055]

29 Based on the simple observation that D(G, H) = 0 under the null hypothesis, possible testing procedures consist of computing estimates Gn and Hm of the underlying distributions and rejecting H0 for ”large” values of the statistic D(Gn , Hm ), see [3] for instance. [sent-58, score-0.341]

30 Beyond computational difficulties and the necessity of identifying a proper standardization in order to make the statistic asymptotically pivotal (i. [sent-59, score-0.15]

31 Indeed, plug-in procedures involve the consistent estimation of distributions on a feature space of possibly very large dimension d ∈ N∗ . [sent-62, score-0.088]

32 Various metrics or pseudo-metrics can be considered for measuring dissimilarity between two probability distributions. [sent-63, score-0.106]

33 The view promoted in the present paper for the two-sample problem is very different in nature and is inspired from traditional procedures in the particular one-dimensional case. [sent-67, score-0.064]

34 A classical approach to the two-sample problem in the one-dimensional setup lies in ordering the observed data using the natural order on the real line R and then basing the decision depending on the ranks of the positive instances among the pooled sample: + ∀i ∈ {1, . [sent-69, score-0.087]

35 This approach is grounded in invariance considerations, practical simplicity and optimality of tests based on R-estimates for this problem, depending on the class of alternative hypotheses considered. [sent-73, score-0.093]

36 Assuming the distributions G and H continuous, the idea underlying such tests lies in the simple fact that, under the null hypothesis, the ranks of positive instances are uniformly distributed over {1, . [sent-74, score-0.237]

37 A popular choice is to consider the sum of ”positive ranks”, leading to the well-known rank-sum Wilcoxon statistic [22] n Ri , Wn,m = i=1 which is distribution-free under H0 , see Section 6. [sent-78, score-0.114]

38 We also recall that, the validity framework of the rank-sum test classically extends to the case where some observations are tied (i. [sent-80, score-0.072]

39 We shall denote by Wn,m the distribution of the (average rank version of the) Wilcoxon statistic Wn,m under the homogeneity hypothesis. [sent-83, score-0.391]

40 Since tables for the distributions Wn,m are available, no asymptotic approximation result is thus needed for building a test of appropriate level. [sent-84, score-0.142]

41 As it will be recalled below, the test based on the R-statistic Wn,m has appealing optimality properties for certain classes of alternatives. [sent-85, score-0.094]

42 functions of the Ri ’s) form a very rich collection of statistics, but, for lack of space, we restrict our attention to the two-sample Wilcoxon statistic in this paper. [sent-88, score-0.138]

43 We may now give a first insight into the way we shall tackle the problem in the multidimensional case. [sent-90, score-0.064]

44 Now that the dimension of the problem has been brought down to 1, observations can be ranked and one may perform for instance a basic two-sample Wilcoxon + − + − test based on the data sets s(X1 ), . [sent-92, score-0.072]

45 ) We point out that it is precisely the task Linear Discriminant Analysis (LDA) tries to performs, in a restrictive Gaussian framework however (when G and H are normal distributions with same covariance structure namely). [sent-100, score-0.055]

46 As this concept is at the heart of the ranking issue in the binary setting, which forms the first stage of the testing procedure sketched above, we recall its definition precisely. [sent-104, score-0.294]

47 The ROC curve related to the distributions g(dt) and h(dt) is the graph of the mapping: ROC ((g, h), ·) : α ∈ [0, 1] → 1 − g ◦ h−1 (1 − α), denoting by f −1 (u) = inf{t ∈ R : f (t) ≥ u} the generalized inverse of any c` d-l` g function a a f : R → R. [sent-106, score-0.124]

48 When the distributions g(dt) and h(t) are continuous, it can alternatively be defined as the parametric curve t ∈ R → (1 − h(t), 1 − g(t)). [sent-107, score-0.124]

49 The notion of ROC curve provides a functional measure of dissimilarity between distributions on R: the closer to the corners of the unit square the curve ROC ((g, h), ·) is, the more dissimilar the distributions g and h are. [sent-113, score-0.305]

50 For instance, it exactly coincides with the upper left-hand corner of the unit square, namely the curve α ∈ [0, 1] → I{α ∈]0, 1]}, when there exists l ∈ R such that the support of distribution g(dt) is a subset of [l, ∞[, while ]l, −∞, ] contains the support of h. [sent-114, score-0.069]

51 Hence, distance of ROC ((g, h), ·) to the diagonal may be naturally used to quantify departure from the homogeneous situation. [sent-116, score-0.089]

52 The L1 -norm provides a convenient way of measuring such a distance, leading to the classical AUC criterion (AUC standing for area under the ROC curve): 1 AUC(g, h) = ROC ((g, h), α) dα. [sent-117, score-0.107]

53 Proposition 1 Let g and h be two distributions on R. [sent-120, score-0.055]

54 We recall that the homogeneous situation corresponds to the case where AUC(g, h) = 1/2 and the Mann-Withney statistic [16] Un,m = 1 nm n m i=1 j=1 1 − + − + I{Xj < Xi } + I{Xj = Xi } 2 is exactly the empirical counterpart of AUC(g, h). [sent-122, score-0.239]

55 For this reason, the related test of hypotheses is called Mann-Whitney Wilcoxon test (MWW). [sent-124, score-0.094]

56 In the multivariate setup, the notion of ROC curve can be extended the following way. [sent-126, score-0.109]

57 Let H(dx) and G(dx) be two given distributions on Rd and S = {s : X → R | 4 s Borel measurable}. [sent-127, score-0.055]

58 For any scoring function s ∈ S, we denote by Hs (dt) and Gs (t) the images of H(dx) and G(x) by the mapping s(x). [sent-128, score-0.186]

59 Clearly, the families of univariate distributions {Gs }s∈S and {Hs }s∈S entirely characterize the multivariate probability measures G and H. [sent-132, score-0.095]

60 One may thus consider evaluating the dissimilarity between H(dx) and G(dx) on Rd through the family of curves {ROC(s, . [sent-133, score-0.101]

61 Going back to the homogeneity testing problem, the null assumption may be reformulated as ”H0 : ∀s ∈ S, AUC(s) = 1/2” versus ”H1 : ∃s ∈ S such that AUC(s) > 1/2”. [sent-135, score-0.349]

62 ) The set of S ∗ = {T ◦ φ | T : R → R strictly increasing } defines the collection of optimal scoring functions in the sense that: ∀s ∈ S, ∀α ∈ [0, 1], ROC(s, α) ≤ ROC∗ (α) and AUC(s) ≤ AUC∗ , with the notations ROC∗ (. [sent-139, score-0.186]

63 Notice that, as dG/dH(X) = dGφ (X)/dHφ (φ(X)), replacing X by s∗ (X) with s∗ ∈ S ∗ leaves the optimal ROC curve untouched. [sent-143, score-0.069]

64 Consequently, the homogeneity testing problem may be seen as closely related to the problem of estimating the optimal AUC∗ , since it may be re-formulated as follows: ”H0 : AUC∗ = 1/2” versus ”H1 : AUC∗ > 1/2”. [sent-146, score-0.288]

65 Notice that, for any s∗ ∈ S ∗ , the scoring function S ∗ = Fs∗ ◦ s∗ is still optimal and the score variable S ∗ (X) is uniformly distributed on [0, 1] under the mixture distribution F (in addition, it may be easily shown to be independent from s∗ ∈ S ∗ ). [sent-149, score-0.186]

66 Observe in addition that AUC∗ − 1/2 may be viewed as the Earth Mover’s distance between the class distributions HS ∗ and GS ∗ for this ”normalization”: 1 AUC∗ − 1/2 = {HS∗ (t) − GS∗ (t)} dt. [sent-150, score-0.084]

67 A natural way of inferring the value of AUC∗ and/or selecting a scoring function s with AUC nearly as large as AUC∗ is to maximize an empirical version of ˆ the AUC criterion over a set S0 of scoring function candidates. [sent-152, score-0.408]

68 We recall that, under such assumptions, universal consistency results have been established for empirical AUC maximizers, together with distribution-free generalization bounds, see [2, 6] for instance. [sent-154, score-0.124]

69 We point out that this approach can be extended to other relevant ranking criteria. [sent-155, score-0.153]

70 The contours of a theory guaranteeing the statistical performance of the ERM approach for empirical risk functionals defined by R-estimates have been sketched in [8]. [sent-156, score-0.102]

71 5 4 The two-stage testing procedure Assume that data have been split into two subsamples: the first data set Dn0 ,m0 = + − + − {X1 , . [sent-157, score-0.078]

72 , Xm0 } will be used for deriving a scoring function on X and + + − − the second data set Dn1 ,m1 = {Xn0 +1 , . [sent-163, score-0.186]

73 , Xm0 +m1 } will serve to compute a pseudo- two-sample Wilcoxon test statistic from the ranked data. [sent-169, score-0.186]

74 We set N0 = n0 + m0 and N1 = n1 + m1 and suppose that ni /Ni → p as ni and mi tend to infinity for i ∈ {0, 1}. [sent-170, score-0.2]

75 The testing procedure at level α is then performed in two steps, as follows. [sent-172, score-0.078]

76 From dataset Dn0 ,m0 , perform empirical AUC maximization over S0 ⊂ S, yielding the scoring function s(x) = sn0 ,m0 (x). [sent-175, score-0.248]

77 Compute the ranks of data with positive labels among ˆ ˆ the sample Dn1 ,m1 , once sorted by increasing order of magnitude of their score: b ˆ + Ri = N1 S(Xn0 +i ) for 1 ≤ i ≤ n1 , “ ” P −1 Pn1 bˆ ˆ bˆ ˆ where Fs (t) = N1 I{ˆ(Xn0 +i ) ≤ t} + m1 I{ˆ(Xm0 +j ) ≤ t} and S = Fs ◦ s. [sent-176, score-0.058]

78 Reject the homogeneity hypothesis H0 when: c Wn1 ,m1 ≥ Qn1 ,m1 (α), c where Wn1 ,m1 = Pn1 b i=1 Ri and Qn1 ,m1 (α) denotes the (1−α)-quantile of distribution Wn1 ,m1 . [sent-179, score-0.255]

79 The next result shows that the learning step does not affect the consistency property, provided it outputs a universally consistent scoring rule. [sent-180, score-0.273]

80 Theorem 2 Let α ∈ (0, 1/2) and suppose that the ranking/scoring method involved at step 1 yields a universally consistent scoring rule s in the AUC sense. [sent-181, score-0.233]

81 The score-based rank-sum Wilcoxon test ˆ Φ = I Wn1 ,m1 ≥ Qn1 ,m1 (α) is universally consistent as ni and mi tend to ∞ for i ∈ {0, 1} at level α, in the following sense. [sent-182, score-0.237]

82 It is of level α for all ni and mi , i ∈ {0, 1}: P(H,H) {Φ = +1} ≤ α for any H(dx). [sent-184, score-0.107]

83 Its power converges to 1 as ni and mi , i ∈ {0, 1}, tend to infinity for every alternative: limni , mi →∞ P(G,H) {Φ = +1} = 1 for every pair of distinct distributions (G, H). [sent-186, score-0.248]

84 ) Under adequate complexity assumptions on the set S0 over which empirical AUC maximization or one of its variants is performed, distribution-free rate bounds for the generalization ability of scoring rules may be established in terms of AUC, see Corollary 6 in [2] or Corollary 3 in [6]. [sent-188, score-0.248]

85 ) We point out that the method presented here is by no means restricted to the case where X is of finite dimension, but may be applied to functional input data, provided an AUC-consistent ranking procedure can be applied in this context. [sent-192, score-0.153]

86 samples of equal size m = n = N/2 have been generated from two conditional 4-dimensional gaussian distributions on the hypercube [−2, 2]4 . [sent-199, score-0.055]

87 In the second example, we test homogeneity under an alternative, ”fairly far” from H0 , where µ− = µ1 , µ+ = (0. [sent-217, score-0.257]

88 2 gives Monte-Carlo estimates of the power of three testing procedures when α = 0. [sent-244, score-0.111]

89 The increasing difficulty of the testing problems considered is controlled through the euclidian distance between the means ∆µ = ||µ+ − µ− || and is described by Fig. [sent-252, score-0.107]

90 6 Conclusion We have provided a sound strategy, involving a preliminary bipartite ranking stage, to extend classical approaches for testing homogeneity based on ranks to a multidimensional setup. [sent-262, score-0.611]

91 Consistency of the extended version of the popular MWW test has been established, under the assumption of universal consistency of the ranking method in the AUC sense. [sent-263, score-0.263]

92 This principle can be applied to other R-statistics, standing as natural criteria for the bipartite ranking problem [8]. [sent-264, score-0.243]

93 Beyond the illustrative preliminary simulation example displayed in this paper, we intend to investigate the relative efficiency of such tests with respect to other tests standing as natural candidates in this setup. [sent-265, score-0.231]

94 Appendix - Proof of Theorem 2 Observe that, conditioned upon the first sample Dn0 ,m0 , the statistic Wn1 ,m1 is distributed according to Wn1 ,m1 under the null hypothesis. [sent-266, score-0.175]

95 In particular, for any pair of distributions (G, H), the centered random tend 2 variable N {Un1 ,m1 (s) − AUC(s)} is asymptotically normal with limit variance σs (G, H) = + − 2 Var(Hs (s(X1 )))/p+Var(Gs (s(X1 )))/(1−p) under P(G,H) . [sent-278, score-0.127]

96 We now place ourselves under an alternative hypothesis described by a pair of distinct distribution (G, H), so that AUC∗ > 1/2. [sent-282, score-0.075]

97 Now, the fact that type II error of Φ converges to zero as ni and mi tend to ∞ for i ∈ {0, 1} immediately follows from the assumption in regards to the AUC of s(x) universal consistency and the CLT for two-sample U -statistics combined with the theorem of ˆ dominated convergence. [sent-285, score-0.206]

98 On the asymptotic properties of a nonparametric l1 -test statistic of homogeneity. [sent-313, score-0.154]

99 Learning decision trees using the area under the roc curve. [sent-377, score-0.383]

100 On a test of whether one of two random variables is stochastically larger than the other. [sent-405, score-0.073]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('auc', 0.664), ('roc', 0.349), ('homogeneity', 0.21), ('scoring', 0.186), ('wilcoxon', 0.164), ('mww', 0.155), ('ranking', 0.153), ('dx', 0.139), ('hs', 0.137), ('gs', 0.118), ('mencon', 0.116), ('statistic', 0.114), ('dt', 0.104), ('testing', 0.078), ('cl', 0.075), ('curve', 0.069), ('tests', 0.063), ('null', 0.061), ('ranks', 0.058), ('sups', 0.058), ('ni', 0.057), ('dissimilarity', 0.057), ('distributions', 0.055), ('fs', 0.052), ('mi', 0.05), ('rd', 0.048), ('test', 0.047), ('gn', 0.047), ('ank', 0.047), ('mmd', 0.047), ('recalled', 0.047), ('standing', 0.047), ('universally', 0.047), ('hypothesis', 0.045), ('curves', 0.044), ('umr', 0.044), ('bipartite', 0.043), ('multivariate', 0.04), ('rank', 0.04), ('consistency', 0.04), ('asymptotic', 0.04), ('clt', 0.039), ('dmom', 0.039), ('ltci', 0.039), ('telecom', 0.039), ('tsi', 0.039), ('sketched', 0.038), ('multidimensional', 0.037), ('asymptotically', 0.036), ('tend', 0.036), ('empirical', 0.036), ('hm', 0.034), ('diagonals', 0.034), ('kfda', 0.034), ('area', 0.034), ('ri', 0.033), ('situation', 0.033), ('remark', 0.033), ('procedures', 0.033), ('preliminary', 0.032), ('homogeneous', 0.031), ('rbf', 0.031), ('paristech', 0.031), ('meir', 0.031), ('promoted', 0.031), ('alternative', 0.03), ('departure', 0.029), ('erm', 0.029), ('pooled', 0.029), ('distance', 0.029), ('risk', 0.028), ('auer', 0.027), ('institut', 0.027), ('ree', 0.027), ('eventually', 0.027), ('shall', 0.027), ('cortes', 0.026), ('lugosi', 0.026), ('mann', 0.026), ('stochastically', 0.026), ('student', 0.026), ('supremum', 0.026), ('measuring', 0.026), ('displayed', 0.026), ('maximization', 0.026), ('recall', 0.025), ('discriminant', 0.025), ('corollary', 0.025), ('xj', 0.025), ('ranked', 0.025), ('lecture', 0.024), ('soon', 0.024), ('bootstrap', 0.024), ('rich', 0.024), ('locally', 0.024), ('xi', 0.023), ('notes', 0.023), ('op', 0.023), ('metrics', 0.023), ('universal', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1

2 0.13567089 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Author: Manqi Zhao, Venkatesh Saligrama

Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.

3 0.12373457 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur

Abstract: A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.

4 0.091542378 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank

Author: Wei Chen, Tie-yan Liu, Yanyan Lan, Zhi-ming Ma, Hang Li

Abstract: Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking functions by minimizing loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking functions. In this work, we reveal the relationship between ranking measures and loss functions in learningto-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that the loss functions of these methods are upper bounds of the measurebased ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classification tasks, and define a so-called essential loss for ranking as the weighted sum of the classification errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modifications can lead to better ranking performances, demonstrating the correctness of our theoretical analysis. 1

5 0.081964605 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1

6 0.081047468 230 nips-2009-Statistical Consistency of Top-k Ranking

7 0.080516346 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

8 0.078492858 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

9 0.071162216 87 nips-2009-Exponential Family Graph Matching and Ranking

10 0.067375608 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

11 0.067228518 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

12 0.065631777 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

13 0.061963547 190 nips-2009-Polynomial Semantic Indexing

14 0.061183412 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

15 0.060998604 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

16 0.059954785 161 nips-2009-Nash Equilibria of Static Prediction Games

17 0.056768693 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

18 0.053638406 136 nips-2009-Learning to Rank by Optimizing NDCG Measure

19 0.053600527 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

20 0.0534526 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.156), (1, 0.06), (2, -0.029), (3, 0.018), (4, -0.007), (5, -0.08), (6, -0.109), (7, -0.033), (8, -0.033), (9, -0.082), (10, 0.057), (11, -0.025), (12, -0.004), (13, -0.063), (14, -0.018), (15, -0.026), (16, -0.028), (17, 0.069), (18, 0.04), (19, 0.008), (20, -0.045), (21, 0.084), (22, -0.068), (23, -0.068), (24, -0.164), (25, 0.128), (26, 0.025), (27, 0.03), (28, -0.001), (29, 0.019), (30, 0.02), (31, -0.058), (32, 0.124), (33, 0.045), (34, 0.052), (35, 0.066), (36, -0.082), (37, -0.17), (38, 0.025), (39, -0.155), (40, -0.104), (41, -0.074), (42, 0.21), (43, -0.044), (44, -0.074), (45, -0.066), (46, -0.148), (47, -0.022), (48, 0.059), (49, -0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95470113 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1

2 0.72894061 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Author: Manqi Zhao, Venkatesh Saligrama

Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.

3 0.53043193 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur

Abstract: A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.

4 0.5125097 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont

Abstract: We propose new methodologies to detect anomalies in discrete-time processes taking values in a probability space. These methods are based on the inference of functionals whose evaluations on successive states visited by the process are stationary and have low autocorrelations. Deviations from this behavior are used to flag anomalies. The candidate functionals are estimated in a subspace of a reproducing kernel Hilbert space associated with the original probability space considered. We provide experimental results on simulated datasets which show that these techniques compare favorably with other algorithms.

5 0.44492659 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1

6 0.42735565 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

7 0.41020593 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

8 0.40683991 230 nips-2009-Statistical Consistency of Top-k Ranking

9 0.3808161 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

10 0.37482402 206 nips-2009-Riffled Independence for Ranked Data

11 0.35432416 152 nips-2009-Measuring model complexity with the prior predictive

12 0.34634274 90 nips-2009-Factor Modeling for Advertisement Targeting

13 0.34309691 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank

14 0.33681279 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information

15 0.32083592 195 nips-2009-Probabilistic Relational PCA

16 0.30923682 97 nips-2009-Free energy score space

17 0.30312401 72 nips-2009-Distribution Matching for Transduction

18 0.29940456 190 nips-2009-Polynomial Semantic Indexing

19 0.29375163 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

20 0.29171202 25 nips-2009-Adaptive Design Optimization in Experiments with People


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.222), (7, 0.018), (24, 0.069), (25, 0.057), (35, 0.082), (36, 0.065), (39, 0.046), (49, 0.012), (58, 0.085), (61, 0.038), (71, 0.07), (81, 0.014), (86, 0.104), (91, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82061875 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1

2 0.79727072 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

Author: Peter Carbonetto, Matthew King, Firas Hamze

Abstract: We describe a new algorithmic framework for inference in probabilistic models, and apply it to inference for latent Dirichlet allocation (LDA). Our framework adopts the methodology of variational inference, but unlike existing variational methods such as mean field and expectation propagation it is not restricted to tractable classes of approximating distributions. Our approach can also be viewed as a “population-based” sequential Monte Carlo (SMC) method, but unlike existing SMC methods there is no need to design the artificial sequence of distributions. Significantly, our framework offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation. We conduct experiments on a difficult inference problem in population genetics, a problem that is related to inference for LDA. The results of these experiments suggest that our method can offer improvements in stability and accuracy over existing methods, and at a comparable cost. 1

3 0.65846992 113 nips-2009-Improving Existing Fault Recovery Policies

Author: Guy Shani, Christopher Meek

Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1

4 0.65793097 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

5 0.65242767 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Author: Keith Bush, Joelle Pineau

Abstract: Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by first principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning. We experiment with manifold embeddings to reconstruct the observable state-space in the context of offline, model-based reinforcement learning. We demonstrate that the embedding of a system can change as a result of learning, and we argue that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system. We apply this approach to learn a neurostimulation policy that suppresses epileptic seizures on animal brain slices. 1

6 0.64994305 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

7 0.64847261 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

8 0.64600915 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

9 0.6426692 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

10 0.64195138 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

11 0.64167094 16 nips-2009-A Smoothed Approximate Linear Program

12 0.64112365 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

13 0.63953322 112 nips-2009-Human Rademacher Complexity

14 0.638771 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

15 0.63847744 260 nips-2009-Zero-shot Learning with Semantic Output Codes

16 0.63836944 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

17 0.63803697 90 nips-2009-Factor Modeling for Advertisement Targeting

18 0.63787615 230 nips-2009-Statistical Consistency of Top-k Ranking

19 0.63713741 104 nips-2009-Group Sparse Coding

20 0.63703954 100 nips-2009-Gaussian process regression with Student-t likelihood