nips nips2002 nips2002-178 knowledge-graph by maker-knowledge-mining

178 nips-2002-Robust Novelty Detection with Single-Class MPM

Source: pdf

Author: Laurent E. Ghaoui, Michael I. Jordan, Gert R. Lanckriet

Abstract: In this paper we consider the problem of novelty detection, presenting an algorithm that aims to find a minimal region in input space containing a fraction 0: of the probability mass underlying a data set. This algorithm- the

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper we consider the problem of novelty detection, presenting an algorithm that aims to find a minimal region in input space containing a fraction 0: of the probability mass underlying a data set. [sent-16, score-0.404]

2 We present a robust approach to estimating the mean and covariance matrix within the general two-class MPM setting, and show how this approach specializes to the single-class problem. [sent-18, score-0.441]

3 1 Introduction Novelty detection is an important unsupervised learning problem in which test data are to be judged as having been generated from the same or a different process as that which generated the training data. [sent-20, score-0.084]

4 This formulation of novelty detection in terms of quantile estimation is to be compared to the (costly) approach of estimating a density based on the training data and thresholding the estimated density. [sent-22, score-0.509]

5 Although of reduced complexity when compared to density estimation, multivariate quantile estimation is still a challenging problem, necessitating computationally efficient methods for representing and manipulating sets in high dimensions. [sent-23, score-0.231]

6 A significant step forward in this regard was provided by Scholkopf and Smola (2001), who treated novelty detection as a "single-class" classification problem in which data are separated from the origin in feature space. [sent-24, score-0.365]

7 This allowed them to invoke the computationally-efficient technology of support vector machines. [sent-25, score-0.028]

8 In the current paper we adopt the "single-class" perspective of Scholkopf and Smola (2001), but make use of a different kernel-based technique for finding discriminant boundaries- the minimax probability machine (MPM) of Lanckriet et al. [sent-26, score-0.202]

9 To see why the MPM should be particularly appropriate for quantile estimation, consider the following theorem, which lies at the core of the MPM. [sent-28, score-0.163]

10 Given a random vector y with mean y and covariance matrix ~y , and given arbitrary constants a¥- 0, b such that aTy :S b, we have (for a proof, see Lanckriet et al. [sent-29, score-0.306]

11 , 2002): inf Pr{aTy:Sb}2::a y~(y,:Ey) {:} b-aTY2::,,;(a) /aT "5:,ya, V (1) where ,,;(a) = Ja/1 - a, and a E [0, 1). [sent-30, score-0.207]

12 Note that this is a "distribution-free" result- the infimum is taken over all distributions for y having mean y and covariance matrix "5:,y (assumed to be positive definite for simplicity). [sent-31, score-0.311]

13 (2002) were able to exploit this theorem to design a binary classification algorithm, it is clear that the theorem provides even more direct leverage on the "single-class" problem- it directly bounds the probability of an observation falling outside of a given set. [sent-33, score-0.288]

14 There is one important aspect of the MPM formulation that needs further consideration, however, if we wish to apply the approach to the novelty detection problem. [sent-34, score-0.323]

15 In particular, y and ~y are usually unknown in practice and must be estimated from data. [sent-35, score-0.029]

16 (2002) successfully made use of plug-in estimates of these quantities- in some sense the bias incurred by the use of plug-in estimates in the two classes appears to "cancel" and have diminished overall impact on the discriminant boundary. [sent-37, score-0.104]

17 In the one-class setting, however, the uncertainty due to estimation of y and ~y translates directly into movement of the discriminant boundary and cannot be neglected. [sent-38, score-0.173]

18 We begin in Section 2 by revisiting the MPM and showing how to account for uncertainty in the means and covariance matrices within the framework of robust estimation. [sent-39, score-0.412]

19 Section 3 then applies this robust estimation approach to the singleclass MPM problem. [sent-40, score-0.169]

20 We wish to determine a hyperplane H(a , b) = {z I aTz = b}, where a E jRn\{o} and b E jR, that maximizes the worst-case probability a that future data points are classified correctly with respect to all distributions having these means and covariance matrices: a S. [sent-43, score-0.389]

21 inf Pr{ aT x 2:: b} 2:: a inf max a,a,cO,b Pr{aTy:Sb} 2:: a, x~(x,:Ex) y~(y , :Ey) (2) where x '" (x, "5:,x) refers to the class of distributions that have mean x and covariance "5:,x, but are otherwise arbitrary; likewise for y. [sent-45, score-0.696]

22 The worst-case probability of misclassification is explicitly obtained and given by 1 - a. [sent-46, score-0.262]

23 Solving this optimization problem involves converting the probabilistic constraints in Eq. [sent-47, score-0.054]

24 (2) into deterministic constraints, a step which is achieved via the theorem referred to earlier in Eq. [sent-48, score-0.035]

25 This eventually leads to the following convex optimization problem, whose solution determines an optimal hyperplane H(a, b) (Lanckriet et al. [sent-50, score-0.285]

26 , 2002): (3) where b is set to the value b* = arx - x:*Jar~xa*, with a* an optimal solution of Eq. [sent-51, score-0.03]

27 The optimal worst-case misclassification probability is obtained via 1 - a* = 1/(1 + x:;). [sent-53, score-0.292]

28 Once an optimal hyperplane is found, classification of a new data point Znew is done by evaluating sign( ar Znew - b*): if this is +1, Znew is classified as belonging to class x, otherwise Zn ew is classified as belonging to class y. [sent-54, score-0.357]

29 While in our earlier work, we simply computed sample-based estimates of means and covariance matrices and plugged them into the MPM optimization problem in Eq. [sent-55, score-0.286]

30 (3), we now show how to treat this estimation problem within the framework of robust optimization. [sent-56, score-0.169]

31 Assume the mean and covariance matrix of each class are unknown but lie within specified convex sets: (x, ~x) E X, with X C jRn X {M E jRnxnlM = MT,M ~ O}, and (y,~y) E y, with Y c jRn X {M E jRnxnlM = M T , M ~ O}. [sent-57, score-0.391]

32 (2) to be robust against variations of the mean and covariance matrix within these sets: a S. [sent-59, score-0.388]

33 inf Pr{aTx2b}2aV(x,~x)EX, inf max a,a#O,b Pr{aTy::; b} 2 a V(y,~y) E y. [sent-61, score-0.488]

34 x~(x,Ex) x~(y , Ey) (4) In other words, we would like to guarantee a worst-case misclassification probability for all distributions which have unknown-but-bounded mean and covariance matrix, but which are otherwise arbitrary. [sent-62, score-0.47]

35 The complexity of this problem depends obviously on the structure of the uncertainty sets X, y. [sent-63, score-0.089]

36 The matrix norm is the Frobenius norm: IIAIIj" = Tr(AT A). [sent-66, score-0.112]

37 The covariance matrix belongs to a matrix norm ball - a convex set - centered around ~Y o. [sent-68, score-0.436]

38 This uncertainty model is perhaps less classical from a statistical viewpoint, but it will lead to a regularization term of a classical form. [sent-69, score-0.128]

39 (1) and notice that b-aTy 2 x:(ah/aT~ya, V(y, ~y) E Y {:} b- max (y,Ey)EY aTy 2 x:(a) max (y ,Ey)EY aT~ya, where the right-hand side guarantees the constraint for the worst-case estimate of the mean and covariance matrix within the bounded set y. [sent-72, score-0.475]

40 £(y, A) = 0, leading to y = yO + ~ya and A = JaT~ya/4v which eventually leads to Eq. [sent-79, score-0.037]

41 We then obtain a - aT~ °a+p max aT ~~ a - aT~ °a+paT a y y . [sent-83, score-0.074]

42 EYI IF~ l y Y , (8) using the Cauchy-Schwarz inequality and compatibility of the Frobenius matrix norm and the Euclidean vector norm: Ey : max I I Ey-EyOIlF~P aT~ aT ~~a::::: IlaI1211~~aI12 ::::: IlaI1211~~IIFllaI12 ::::: lIall~, because II~~IIF ::::: 1. [sent-87, score-0.21]

43 (1): inf y~(y , Ey) Pr{aTy ::::: b} :2: a, \fey, ~y) E Y ¢} b_aTyO :2: (",(a)+v)JaT(~/ + pln)a. [sent-92, score-0.207]

44 (4) thus shows that the optimal robust minimax probability classifier for X, Y given by Eq. [sent-94, score-0.274]

45 If ",:;-1 is the optimal value of that problem, the corresponding worst-case misclassification probability is 1 - a* = 1 1 + max(O , ("'* - V))2 . [sent-97, score-0.292]

46 With only uncertainty in the mean (p = 0), the robust hyperplane is the same as the non-robust one; the only change is in the increase in the worst-case misclassification probability. [sent-98, score-0.578]

47 Uncertainty in the covariance matrix adds a term pIn to the covariance matrices, which can be interpreted as regularization term. [sent-99, score-0.41]

48 This affects the hyperplane and increases the worst-case misclassification probability as well. [sent-100, score-0.348]

49 , "'* < v) , the robust version is not feasible: no hyperplane can be found that separates the two classes in the robust minimax probabilistic sense and the worst-case misclassification probability is 1 - a* = 1. [sent-103, score-0.683]

50 This robust approach can be readily generalized to allow nonlinear decision boundaries via the use of Mercer kernels (Lanckriet et al. [sent-104, score-0.168]

51 3 Single-class MPM for robust novelty detection We now turn to the quantile estimation problem. [sent-106, score-0.609]

52 Recall that for a E (0,1], we wish to find a small region Q such that Pr{ x E Q} = a. [sent-107, score-0.154]

53 , (x, ~x) and let us focus (for now) on the linear case where Q is a half-space not containing the origin. [sent-111, score-0.023]

54 We seek a half-space Q(a,b) = {z I aTz :2: b}, with a E JRn\{o} and b E JR, and not containing 0, such that with probability at least a, the data lies in Q, for every distribution having mean x and covariance matrix ~x. [sent-112, score-0.32]

55 We assume again that the real x, ~x are unknown but bounded in a set X as specified in Eq. [sent-113, score-0.067]

56 (5): inf x~(x , Ex) Pr{aTx:2:b}:2:a \f(x,~x)EX. [sent-114, score-0.207]

57 We want the region Q to be tight, so we maximize its Mahalanobis distance (with respect to ~x) to the origin in a robust way, i. [sent-115, score-0.222]

58 , for the worst-case estimate of ~x -the matrix that gives us the smallest Mahalanobis distance: s. [sent-117, score-0.055]

59 inf x~(x , Ex) Pr{ aT x 2:: b} 2:: a \I(x, ~x) EX. [sent-119, score-0.207]

60 (9) and get (where superscript 0 for the estimates has been omitted): JaT(~x + pIn)a mln s. [sent-127, score-0.055]

61 (a) + v)JaT(~x + pIn)a , (11) where a-::/:-O can be omitted since the constraint never holds in this case. [sent-131, score-0.034]

62 Again, we obtain a (convex) second order cone programming problem. [sent-132, score-0.024]

63 The worst-case probability of occurrence outside region Q is given by 1 - a. [sent-133, score-0.191]

64 Notice that the particular choice of a E (0,1] must be feasible , i. [sent-134, score-0.061]

65 For p -::/:- 0, ~x + pIn is certainly positive definite and the halfspace is unique. [sent-139, score-0.074]

66 In other words, the optimal a is parallel to x, in the form a = A(~x + pIn) - lx, and the problem reduces to the one-dimensional problem: mIn IAIII(~x+pIn) -1/2 xI12 : AxT (~x+pIn)-lx 2:: l+(,. [sent-148, score-0.03]

67 (a)+v) 11(~x+pIn)-1/2xIl2IAI· The constraint implies that A 2:: 0, hence the problem reduces to min A : A ((2 - (,. [sent-150, score-0.035]

68 ::::0 with (2 = xT(~x + pIn) - lx > 0 (because Eq. [sent-154, score-0.057]

69 (a) + v)( 2:: 0, which is nothing other than the feasibility condition for a: If this is fulfilled, the optimization in Eq. [sent-158, score-0.028]

70 ) It's easy to see that the optimal A is given by A* a* = (~x + pIn)-lX, b* = 1, (2 _ (,. [sent-165, score-0.03]

71 (14) Notice that the uncertainty in the covariance matrix ~x leads to the typical, wellknown regularization for inverting this matrix. [sent-170, score-0.341]

72 If the choice of a is not feasible or if x = 0 (in this case, no a E (0,1] will be feasible), Eq. [sent-171, score-0.061]

73 Future points z for which a; z :::; b* can then be considered as outliers with respect to the region Q , with worst-case probability of occurrence outside Q given by 1- 0:. [sent-173, score-0.191]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pin', 0.522), ('mpm', 0.312), ('misclassification', 0.228), ('inf', 0.207), ('aty', 0.196), ('jrn', 0.196), ('novelty', 0.193), ('ey', 0.169), ('jat', 0.163), ('quantile', 0.163), ('covariance', 0.158), ('lanckriet', 0.151), ('yo', 0.142), ('pr', 0.14), ('robust', 0.125), ('atx', 0.113), ('znew', 0.098), ('uncertainty', 0.089), ('hyperplane', 0.086), ('ya', 0.085), ('minimax', 0.085), ('detection', 0.084), ('max', 0.074), ('region', 0.07), ('atz', 0.065), ('jrnxnlm', 0.065), ('classified', 0.065), ('ex', 0.064), ('convex', 0.061), ('classification', 0.061), ('feasible', 0.061), ('norm', 0.057), ('lx', 0.057), ('matrix', 0.055), ('scholkopf', 0.054), ('outside', 0.053), ('jr', 0.052), ('eecs', 0.052), ('gert', 0.052), ('mean', 0.05), ('mahalanobis', 0.048), ('definite', 0.048), ('berkeley', 0.047), ('wish', 0.046), ('sb', 0.046), ('edu', 0.046), ('falling', 0.046), ('estimation', 0.044), ('et', 0.043), ('frobenius', 0.041), ('discriminant', 0.04), ('matrices', 0.04), ('regularization', 0.039), ('specified', 0.038), ('find', 0.038), ('smola', 0.037), ('eventually', 0.037), ('notice', 0.035), ('min', 0.035), ('theorem', 0.035), ('occurrence', 0.034), ('omitted', 0.034), ('probability', 0.034), ('estimates', 0.032), ('optimum', 0.032), ('optimal', 0.03), ('guarantees', 0.029), ('unknown', 0.029), ('zn', 0.028), ('specializes', 0.028), ('invoke', 0.028), ('plugged', 0.028), ('jrnxn', 0.028), ('pat', 0.028), ('fulfilled', 0.028), ('iif', 0.028), ('optimization', 0.028), ('generality', 0.027), ('origin', 0.027), ('centered', 0.026), ('converting', 0.026), ('laurent', 0.026), ('confidence', 0.026), ('semidefinite', 0.026), ('xo', 0.026), ('halfspace', 0.026), ('belonging', 0.025), ('estimating', 0.025), ('ii', 0.025), ('belongs', 0.024), ('nominal', 0.024), ('manipulating', 0.024), ('zt', 0.024), ('compatibility', 0.024), ('leverage', 0.024), ('cone', 0.024), ('fixed', 0.024), ('containing', 0.023), ('ghaoui', 0.023), ('superscript', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 178 nips-2002-Robust Novelty Detection with Single-Class MPM

Author: Laurent E. Ghaoui, Michael I. Jordan, Gert R. Lanckriet

2 0.080165423 156 nips-2002-On the Complexity of Learning the Kernel Matrix

Author: Olivier Bousquet, Daniel Herrmann

Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors ﬁx. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overﬁtting. We thus propose a suitable way of constraining the class. We use an efﬁcient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.

3 0.067786582 6 nips-2002-A Formulation for Minimax Probability Machine Regression

Author: Thomas Strohmann, Gregory Z. Grudic

Abstract: We formulate the regression problem as one of maximizing the minimum probability, symbolized by Ω, that future predicted outputs of the regression model will be within some ±ε bound of the true regression function. Our formulation is unique in that we obtain a direct estimate of this lower probability bound Ω. The proposed framework, minimax probability machine regression (MPMR), is based on the recently described minimax probability machine classiﬁcation algorithm [Lanckriet et al.] and uses Mercer Kernels to obtain nonlinear regression models. MPMR is tested on both toy and real world data, verifying the accuracy of the Ω bound, and the efﬁcacy of the regression models. 1

4 0.058450062 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

Author: Ron Meir, Tong Zhang

Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where suﬃciently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The ﬁnite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a ﬁnite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-deﬁned optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish ﬁnite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a ﬁnite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we deﬁne the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classiﬁcation problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classiﬁcation decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution deﬁned by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a ﬂurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classiﬁers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classiﬁer can be extended to a Bayesian mixture classiﬁer. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather uniﬁed setting for Bayesian mixture methods, where diﬀerent regularization criteria may be incorporated, and their eﬀect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and deﬁne for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider diﬀerent choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we deﬁne the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating signiﬁcantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We deﬁne the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We ﬁrst recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we ﬁnd that s(v) = log j νj ezj . From the deﬁnition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0   n   1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) .  n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that  λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoﬀ bound    (Jensen) i hj (xi )2 2 (Chernoﬀ) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are deﬁned in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class deﬁned in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a ﬁxed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the ﬁrst term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classiﬁers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately speciﬁed. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order diﬀerentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is ﬁnite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the beneﬁts of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorﬀ foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classiﬁers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classiﬁers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classiﬁcation - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.

5 0.050774783 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

6 0.049956061 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems

7 0.046745308 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

8 0.04581394 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

9 0.044482518 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

10 0.043951016 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

11 0.043838866 114 nips-2002-Information Regularization with Partially Labeled Data

12 0.043432806 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

13 0.04286842 124 nips-2002-Learning Graphical Models with Mercer Kernels

14 0.042467918 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

15 0.042225953 142 nips-2002-Maximum Likelihood and the Information Bottleneck

16 0.042050231 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

17 0.04155568 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

18 0.041244753 119 nips-2002-Kernel Dependency Estimation

19 0.039834585 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

20 0.039129984 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.125), (1, -0.048), (2, -0.01), (3, -0.03), (4, 0.006), (5, -0.019), (6, -0.02), (7, 0.027), (8, 0.001), (9, 0.046), (10, 0.063), (11, -0.067), (12, 0.037), (13, -0.065), (14, -0.009), (15, 0.002), (16, -0.003), (17, -0.029), (18, 0.034), (19, 0.024), (20, 0.022), (21, 0.078), (22, 0.054), (23, -0.013), (24, 0.023), (25, -0.069), (26, -0.018), (27, -0.023), (28, 0.047), (29, 0.001), (30, 0.015), (31, 0.026), (32, 0.01), (33, -0.074), (34, 0.012), (35, -0.01), (36, 0.064), (37, 0.057), (38, 0.219), (39, -0.014), (40, 0.05), (41, -0.076), (42, 0.128), (43, 0.0), (44, 0.091), (45, 0.026), (46, 0.118), (47, 0.035), (48, 0.103), (49, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92772949 178 nips-2002-Robust Novelty Detection with Single-Class MPM

Author: Laurent E. Ghaoui, Michael I. Jordan, Gert R. Lanckriet

2 0.58119619 6 nips-2002-A Formulation for Minimax Probability Machine Regression

Author: Thomas Strohmann, Gregory Z. Grudic

3 0.45660484 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

Author: Christopher Williams, John S. Shawe-taylor

Abstract: In this paper we analyze the relationships between the eigenvalues of the m x m Gram matrix K for a kernel k(·, .) corresponding to a sample Xl, ... ,X m drawn from a density p(x) and the eigenvalues of the corresponding continuous eigenproblem. We bound the differences between the two spectra and provide a performance bound on kernel peA. 1

4 0.43791297 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

5 0.42386675 185 nips-2002-Speeding up the Parti-Game Algorithm

Author: Maxim Likhachev, Sven Koenig

Abstract: In this paper, we introduce an efﬁcient replanning algorithm for nondeterministic domains, namely what we believe to be the ﬁrst incremental heuristic minimax search algorithm. We apply it to the dynamic discretization of continuous domains, resulting in an efﬁcient implementation of the parti-game reinforcement-learning algorithm for control in high-dimensional domains.

6 0.419375 114 nips-2002-Information Regularization with Partially Labeled Data

7 0.41584387 111 nips-2002-Independent Components Analysis through Product Density Estimation

8 0.41570592 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking

9 0.4119564 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

10 0.4111028 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems

11 0.39707088 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

12 0.37625167 124 nips-2002-Learning Graphical Models with Mercer Kernels

13 0.35884365 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory

14 0.35704783 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

15 0.35140371 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

16 0.34464443 138 nips-2002-Manifold Parzen Windows

17 0.33465934 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques

18 0.32841009 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex

19 0.32456341 42 nips-2002-Bias-Optimal Incremental Problem Solving

20 0.30894461 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.013), (42, 0.075), (54, 0.142), (55, 0.044), (68, 0.015), (72, 0.388), (74, 0.078), (92, 0.055), (98, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7450633 178 nips-2002-Robust Novelty Detection with Single-Class MPM

Author: Laurent E. Ghaoui, Michael I. Jordan, Gert R. Lanckriet

2 0.68175489 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

Author: Eleazar Eskin, Jason Weston, William S. Noble, Christina S. Leslie

Abstract: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classiﬁcation problem. These kernels measure sequence similarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efﬁciently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classiﬁer performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings. ¡ ¢

3 0.51833636 2 nips-2002-A Bilinear Model for Sparse Coding

Author: David B. Grimes, Rajesh P. Rao

Abstract: Recent algorithms for sparse coding and independent component analysis (ICA) have demonstrated how localized features can be learned from natural images. However, these approaches do not take image transformations into account. As a result, they produce image codes that are redundant because the same feature is learned at multiple locations. We describe an algorithm for sparse coding based on a bilinear generative model of images. By explicitly modeling the interaction between image features and their transformations, the bilinear approach helps reduce redundancy in the image code and provides a basis for transformationinvariant vision. We present results demonstrating bilinear sparse coding of natural images. We also explore an extension of the model that can capture spatial relationships between the independent features of an object, thereby providing a new framework for parts-based object recognition.

4 0.45526171 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classiﬁcation problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artiﬁcial data and real data (amino acid sequences).

5 0.4545612 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf

Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1

6 0.45274025 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

7 0.45251891 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

8 0.45224246 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

9 0.45222709 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

10 0.45207408 119 nips-2002-Kernel Dependency Estimation

11 0.45120645 190 nips-2002-Stochastic Neighbor Embedding

12 0.45022851 3 nips-2002-A Convergent Form of Approximate Policy Iteration

13 0.45012403 27 nips-2002-An Impossibility Theorem for Clustering

14 0.4498105 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

15 0.44819069 106 nips-2002-Hyperkernels

16 0.44776106 169 nips-2002-Real-Time Particle Filters

17 0.44754511 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

18 0.44748196 10 nips-2002-A Model for Learning Variance Components of Natural Images

19 0.44745988 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

20 0.44720843 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs