nips nips2002 nips2002-161 knowledge-graph by maker-knowledge-mining

161 nips-2002-PAC-Bayes & Margins


Source: pdf

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 shorter argument and much tighter than previous margin bounds. [sent-1, score-0.524]

2 There are two mathematical flavors of margin bound dependent upon the weights Wi of the vote and the features Xi that the vote is taken over. [sent-2, score-0.924]

3 Those ([12], [1]) with a bound on Li w~ and Li x~ ("bib" bounds). [sent-4, score-0.231]

4 Those ([11], [6]) with a bound on Li Wi and maxi Xi ("it/loo" bounds). [sent-6, score-0.231]

5 [12] and Bartlett [1] by a log(m)2 sample complexity factor and much tighter constants (1000 or unstated versus 9 or 18 as suggested by Section 2. [sent-9, score-0.1]

6 In addition, the bound here covers margin errors without weakening the error-free case. [sent-11, score-0.646]

7 Herbrich and Graepel [3] moved significantly towards the approach adopted in our paper, but the methodology adopted meant that their result does not scale well to high dimensional feature spaces as the bound here (and earlier results) do. [sent-12, score-0.354]

8 The layout of our paper is simple - we first show how to construct a stochastic classifier with a good true error bound given a margin, and then construct a margin bound. [sent-13, score-1.273]

9 Let the vote of a voting classifier be given by: vw(x) = wx = L WiXi· i The classifier is given by c(x) = sign (vw(x)). [sent-20, score-0.911]

10 Any margin bound applies to a vector W in N dimensional space. [sent-28, score-0.613]

11 For every example, we can decompose the example into a portion which is parallel to W and a portion which is perpendicular to w. [sent-29, score-0.297]

12 vw(x) XT = X - IIwl1 2 w XII =x - XT The argument is simple: we exhibit a "prior" over the weight space and a "posterior" over the weight space with an analytical form for the KL-divergence. [sent-30, score-0.128]

13 The stochastic classifier defined by the posterior has a slightly larger empirical error and a small true error bound. [sent-31, score-0.82]

14 For the next theorem, let F(x) = 1- f~oo ke-z2/2dx be the tail probability of a Gaussian with mean 0 and variance 1. [sent-32, score-0.125]

15 Also let eQ(W,1',f) = Pr (X,1I)~D,h~Q(w,1',f) (h(x) =I y) be the true error rate of a stochastic classifier with distribution Q(f, w, 7) dependent on a free parameter f, the weights w of an averaging classifier, and a margin 7. [sent-33, score-1.398]

16 1 shows that when a margin exists it is always possible to find a "posterior" distribution (in the style of [5]) which introduces only a small amount of additional training error rate. [sent-38, score-0.523]

17 The true error bound for this stochastization of the large-margin classifier is not dependent on the dimensionality except via the margin. [sent-39, score-0.761]

18 Since the Gaussian tail decreases exponentially, the value of P-l(f) is not very large for any reasonable value of f. [sent-40, score-0.125]

19 25, the bound is less than 1 around m = 100 examples and less than 0. [sent-46, score-0.231]

20 3) that the generalisation error of the original averaging classifier is only a factor 2 or 4 larger than that of the stochastic classifiers considered here. [sent-51, score-1.015]

21 This theorem is robust in the presence of noise and margin errors. [sent-55, score-0.575]

22 Since the PACBayes bound works for any "posterior" Q, we are free to choose Q dependent upon the data in any way. [sent-56, score-0.381]

23 In practice, it may be desirable to follow an approach similar to [5] and allow the data to determine the "right" posterior Q. [sent-57, score-0.08]

24 Using the data rather than the margin 7 allows the bound to take into account a fortuitous data distribution and robust behavior in the presence of a "soft margin" (a margin with errors). [sent-58, score-0.995]

25 Here we state a bound which can take into account the distribution of the training set. [sent-61, score-0.231]

26 This theorem demonstrates the flexibility of the technique since it incorporates significantly more data-dependent information into the bound calculation. [sent-64, score-0.485]

27 When applying the bound one would choose p, to make the inequality (1) an equality. [sent-65, score-0.288]

28 Hence, any choice of p, determines E and hence the overall bound. [sent-66, score-0.041]

29 We then have the freedom to choose p, to optimise the bound. [sent-67, score-0.095]

30 As noted earlier, given a weight vector w, any particular feature vector x decomposes into a portion xII which is parallel to w and a portion XT which is perpendicular to w. [sent-68, score-0.305]

31 Hence, we can write x = xllell + XTeT, where ell is a unit vector in the direction of w and eT is a unit vector in the direction of XT. [sent-69, score-0.122]

32 Note that we may have YXII < 0, if x is misclassified by w. [sent-70, score-0.031]

33 1 For all averaging classifiers c with normalized weights wand for all E > 0 stochastic error rates, If we choose p, > 0 such that - XT Ex,y~sF (YXII P, ) = (1) E then there exists a posterior distribution Q(w, p" E) such that s~! [sent-72, score-0.896]

34 ) /j ~ 1- 6 m where KL(qllp) = q In ~ + (1 - q) In ~=: = the Kullback-Leibler divergence between two coins of bias q < p. [sent-77, score-0.165]

35 The proof uses the PAC-Bayes bound, which states that for all prior distributions P, Pr S~D"' (VQ: KL(eQlleQ) ~ KL(QIIP) + In m ¥) ~ 1- 6 We choose P = N(O,I), an isotropic Gaussian1 . [sent-79, score-0.151]

36 A choice of the "posterior" Q completes the proof. [sent-80, score-0.041]

37 The Q we choose depends upon the direction w, the margin 'Y, and the stochastic error E. [sent-81, score-0.711]

38 In particular, Q equals P in every direction perpendicular to w, and a rectified Gaussian tail in the w direction2 • The distribution of a rectified Gaussian tail is given by R(p,) = 0 for x < p, and R(p,) = F(p,~. [sent-82, score-0.537]

39 2Note that we use the invariance under rotation of N(O, I) here to line up one dimension with w. [sent-86, score-0.105]

40 Thus, our choice of posterior implies the theorem if the empirical error rate is eq(w,x,. [sent-87, score-0.565]

41 Given a point x, our choice of posterior implies that we can decompose the stochastic weight vector, W = wllell +wTeT +w, where ell is parallel to w, eT is parallel to XT and W is a residual vector perpendicular to both. [sent-90, score-0.614]

42 By our definition of the stochastic generation wli ~ R(p) and WT ~ N(O, 1). [sent-91, score-0.221]

43 JJ, no error occurs if: y(pxlI + WTXT) >0 Since WT is drawn from N(O, 1) the probability of this event is: Pr (Y(I""II +WTXT) > 0) ~ 1- F (~~Ip) And so, the empirical error rate of the stochastic classifier is bounded by: eq:S Ex,. [sent-93, score-0.732]

44 (sketch) The theorem follows from a relaxation of Theorem 3. [sent-99, score-0.193]

45 In particular, we treat every example with a margin less than / as an error and use the bounds IlxT11 1 and IlxlIll ~ /. [sent-101, score-0.552]

46 In particular, the choice of "prior" is not that arbitrary as the following lemma indicates. [sent-105, score-0.138]

47 Rotational invariance together with the dimension independence imply that for all i,j,x: p;(x) =p;(x) which implies that: N P(x) = II p(x;) ;=1 for some ftmction p(. [sent-109, score-0.207]

48 Applying rotational invariance, we have that: N P(x) = 11II1(llxIl2) = IIp(x;) ;=1 This implies: 10g11111 (~,q) = ~IOgP(X;)' Taking the derivative of this equation with respect to 2 1I111 (1IxI1 ) 2xi PjIIl(llxI1 2 ) - Xi gives P'(Xi) p(Xi) . [sent-111, score-0.085]

49 _ The constant A in the previous lemma is a free parameter. [sent-114, score-0.097]

50 However, the results do not depend upon the precise value of A so we choose 1 for simplicity. [sent-115, score-0.1]

51 Some freedom in the choice of the "posterior" Q does exist and the results are dependent on this choice. [sent-116, score-0.129]

52 4 Margin Implies Margin Bound There are two methods for constructing a margin bound for the original averaging classifier. [sent-118, score-0.91]

53 The first method is simplest while the second is sometimes significantly tighter. [sent-119, score-0.061]

54 1 Simple Margin Bound First we note a trivial bound arising from a folk theorem and the relationship to our result. [sent-121, score-0.455]

55 1 (Simple Averaging bound) For any stochastic classifier with distribution Q and true error rate eQ, the averaging classifier, CQ(X) = sign ( [ h(X)dQ(h)) has true error rate: Proof. [sent-123, score-1.152]

56 For every example (x,y), every time the averaging classifier errs, the probability of the stochastic classifier erring must be at least 1/2. [sent-124, score-1.099]

57 _ This result is interesting and of practical use when the empirical error rate of the original averaging classifier is low. [sent-125, score-0.83]

58 Furthermore, we can prove that cQ(x) is the original averaging classifier. [sent-126, score-0.297]

59 For any oW that classifies an input x differently from the averaging classifier, there is a unique equiprobable paired weight vector that agrees with the averaging classifier. [sent-133, score-0.746]

60 If vw(x) ¥- 0, then there exists a nonzero measure of classifier pairs which always agrees with the averaging classifier. [sent-135, score-0.736]

61 Condition (1) is met by reversing the sign of WT and noting that either the original random vector or the reversed random vector must agree with the averaging classifier. [sent-136, score-0.451]

62 Condition (2) is met by the randomly drawn classifier W = AW and nearby classifiers for any A > O. [sent-137, score-0.56]

63 Since the example is not on the hyperplane, there exists some small sphere of paired classifiers (in the sense of condition (1)). [sent-138, score-0.318]

64 _ The simple averaging bound is elegant, but it breaks down when the empirical error is large because: e(c) ::; 2eQ = 2(€Q + 6o m ) ~ 2€-y(c) + 260 m where €Q is the empirical error rate of a stochastic classifier and 60m goes to zero as m -t 00. [sent-140, score-1.321]

65 Next, we construct a bound of the form e(cQ) ::; €-y(c) + 6o~ where 6o~ > 60 m but €-y(c) ::; 2€-y(c). [sent-141, score-0.264]

66 2 A (Sometimes) Tighter Bound By altering our choice of J. [sent-143, score-0.041]

67 L and our notion of "error" we can construct a bound which holds without randomization. [sent-144, score-0.264]

68 3 For all averaging classifiers C with normalized weights W for all E > 0 "extra" error rates and"( > 0 margins: Pr S~D"' ( VE, w,"(: KL(€-y(c) where KL(qllp) = qln ~ two coins of bias q < p. [sent-146, score-0.716]

69 + (1 - + Elle(c) - E) ::; In -c/ 1(0») + 21n F -"/-m mtl) ~ 1- 0 q) In ~::::: = the Kullback-Leibler divergence between The proof of this statement is strongly related to the proof given in [11] but noticeably simpler. [sent-147, score-0.15]

70 It is also very related to the proof of theorem 2. [sent-148, score-0.249]

71 (sketch) Instead of choosing wli so that the empirical error rate is increased by E, we instead choose wli so that the number of margin violations at margin ~ is increased by at most E. [sent-151, score-1.303]

72 This can be done by drawing from a distribution such as 1 R (E)) A WII'" (2F- "( Applying the PAC-Bayes bound to this we reach a bound on the number of margin violations at ~ for the true distribution. [sent-152, score-0.974]

73 '- (KL (",(e) + < Open Problems The bound we have obtained here is considerably tighter than previous bounds for averaging classifiers-in fact it is tight enough to consider applying to real learning problems and using the results in decision making. [sent-155, score-0.752]

74 We expect that there exists some natural theorem which does well in both regimes simultaneously. [sent-160, score-0.249]

75 hI order to verify that the margin bound is as tight as possible, it would also be instructive to study lower bounds. [sent-161, score-0.652]

76 Bartlett, "The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network," IEEE funsactiollS on Information Theory, vol. [sent-166, score-0.032]

77 [2] Thomas Cover and Joy Thomas, "Elements of fuformation Theory" Wiley, New York 1991. [sent-170, score-0.072]

78 In Advances in Neural fuformation Processing Systems 13, pages 224-230. [sent-172, score-0.072]

79 Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee, "Boosting the Margin: A new explanation for the effectiveness of voting methods" The Annals of Statistics, 26(5):1651-1686, 1998. [sent-193, score-0.048]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('margin', 0.382), ('classifier', 0.344), ('averaging', 0.297), ('bound', 0.231), ('kl', 0.225), ('theorem', 0.193), ('classifiers', 0.175), ('vw', 0.153), ('tail', 0.125), ('cq', 0.119), ('eq', 0.119), ('stochastic', 0.114), ('qllp', 0.107), ('wli', 0.107), ('wtxt', 0.107), ('pr', 0.107), ('tighter', 0.1), ('lemma', 0.097), ('coins', 0.093), ('vote', 0.093), ('xt', 0.088), ('perpendicular', 0.087), ('bounds', 0.085), ('error', 0.085), ('rectified', 0.085), ('rotational', 0.085), ('sign', 0.082), ('posterior', 0.08), ('violations', 0.079), ('langford', 0.075), ('invariance', 0.072), ('fuformation', 0.072), ('iin', 0.072), ('inp', 0.072), ('pjlll', 0.072), ('qiip', 0.072), ('xii', 0.072), ('yoav', 0.072), ('yxii', 0.072), ('matthias', 0.071), ('seeger', 0.068), ('wt', 0.068), ('portion', 0.065), ('ell', 0.062), ('implies', 0.062), ('empirical', 0.061), ('significantly', 0.061), ('bartlett', 0.058), ('choose', 0.057), ('mcallester', 0.057), ('exists', 0.056), ('proof', 0.056), ('true', 0.051), ('dependent', 0.05), ('tech', 0.05), ('sphere', 0.048), ('voting', 0.048), ('graepel', 0.045), ('parallel', 0.045), ('upon', 0.043), ('weight', 0.043), ('rate', 0.043), ('li', 0.042), ('herbrich', 0.042), ('freund', 0.042), ('argument', 0.042), ('met', 0.041), ('choice', 0.041), ('independence', 0.04), ('xi', 0.04), ('sketch', 0.039), ('paired', 0.039), ('agrees', 0.039), ('tight', 0.039), ('freedom', 0.038), ('isotropic', 0.038), ('divergence', 0.038), ('theorems', 0.037), ('schapire', 0.036), ('decompose', 0.035), ('john', 0.035), ('boosting', 0.034), ('bias', 0.034), ('robert', 0.034), ('ip', 0.033), ('errors', 0.033), ('dimension', 0.033), ('construct', 0.033), ('weights', 0.032), ('gaussian', 0.031), ('proximate', 0.031), ('jebara', 0.031), ('reversing', 0.031), ('folk', 0.031), ('equiprobable', 0.031), ('elle', 0.031), ('misclassified', 0.031), ('iogp', 0.031), ('adopted', 0.031), ('direction', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

2 0.20015423 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

3 0.19469313 140 nips-2002-Margin Analysis of the LVQ Algorithm

Author: Koby Crammer, Ran Gilad-bachrach, Amir Navot, Naftali Tishby

Abstract: Prototypes based algorithms are commonly used to reduce the computational complexity of Nearest-Neighbour (NN) classifiers. In this paper we discuss theoretical and algorithmical aspects of such algorithms. On the theory side, we present margin based generalization bounds that suggest that these kinds of classifiers can be more accurate then the 1-NN rule. Furthermore, we derived a training algorithm that selects a good set of prototypes using large margin principles. We also show that the 20 years old Learning Vector Quantization (LVQ) algorithm emerges naturally from our framework. 1

4 0.17322996 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

5 0.15726754 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

Author: Ron Meir, Tong Zhang

Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The finite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a finite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-defined optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish finite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a finite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we define the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classification problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classification decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution defined by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a flurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classifiers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classifier can be extended to a Bayesian mixture classifier. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather unified setting for Bayesian mixture methods, where different regularization criteria may be incorporated, and their effect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and define for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider different choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we define the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating significantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We define the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We first recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we find that s(v) = log j νj ezj . From the definition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0   n   1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) .  n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that  λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoff bound    (Jensen) i hj (xi )2 2 (Chernoff) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are defined in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class defined in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a fixed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the first term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classifiers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately specified. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order differentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is finite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the benefits of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorff foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classifiers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classifiers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classification - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.

6 0.13452347 156 nips-2002-On the Complexity of Learning the Kernel Matrix

7 0.13135478 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

8 0.12839547 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

9 0.12407067 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

10 0.11418522 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

11 0.11213684 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

12 0.097539723 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

13 0.079476483 137 nips-2002-Location Estimation with a Differential Update Network

14 0.074928716 135 nips-2002-Learning with Multiple Labels

15 0.072629526 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

16 0.070021264 192 nips-2002-Support Vector Machines for Multiple-Instance Learning

17 0.069090776 46 nips-2002-Boosting Density Estimation

18 0.06722676 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

19 0.058983237 195 nips-2002-The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities

20 0.05756392 181 nips-2002-Self Supervised Boosting


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.199), (1, -0.121), (2, -0.009), (3, -0.082), (4, 0.129), (5, -0.019), (6, -0.016), (7, 0.059), (8, -0.052), (9, -0.037), (10, 0.171), (11, -0.245), (12, -0.038), (13, -0.271), (14, -0.116), (15, -0.025), (16, 0.119), (17, -0.079), (18, 0.018), (19, 0.054), (20, 0.005), (21, 0.034), (22, 0.198), (23, 0.043), (24, -0.132), (25, 0.086), (26, 0.109), (27, 0.002), (28, 0.028), (29, -0.101), (30, 0.044), (31, -0.054), (32, -0.168), (33, -0.044), (34, 0.101), (35, 0.036), (36, 0.008), (37, 0.06), (38, -0.021), (39, -0.06), (40, -0.011), (41, 0.064), (42, -0.051), (43, -0.007), (44, -0.083), (45, -0.005), (46, -0.012), (47, 0.045), (48, 0.067), (49, -0.105)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97989774 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

2 0.74213904 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

3 0.73883557 140 nips-2002-Margin Analysis of the LVQ Algorithm

Author: Koby Crammer, Ran Gilad-bachrach, Amir Navot, Naftali Tishby

Abstract: Prototypes based algorithms are commonly used to reduce the computational complexity of Nearest-Neighbour (NN) classifiers. In this paper we discuss theoretical and algorithmical aspects of such algorithms. On the theory side, we present margin based generalization bounds that suggest that these kinds of classifiers can be more accurate then the 1-NN rule. Furthermore, we derived a training algorithm that selects a good set of prototypes using large margin principles. We also show that the 20 years old Learning Vector Quantization (LVQ) algorithm emerges naturally from our framework. 1

4 0.62946558 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

5 0.60744429 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee

Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefficients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classifiers.

6 0.49421459 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

7 0.48632488 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

8 0.40370551 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

9 0.36532262 156 nips-2002-On the Complexity of Learning the Kernel Matrix

10 0.33898747 178 nips-2002-Robust Novelty Detection with Single-Class MPM

11 0.33601537 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

12 0.32061753 195 nips-2002-The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities

13 0.31012911 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

14 0.30859646 139 nips-2002-Margin-Based Algorithms for Information Filtering

15 0.29981038 137 nips-2002-Location Estimation with a Differential Update Network

16 0.28309605 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

17 0.26531711 192 nips-2002-Support Vector Machines for Multiple-Instance Learning

18 0.26044801 194 nips-2002-The Decision List Machine

19 0.25855011 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

20 0.25244811 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.011), (23, 0.015), (42, 0.053), (54, 0.178), (55, 0.039), (58, 0.262), (67, 0.03), (68, 0.015), (74, 0.075), (92, 0.1), (98, 0.147)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90503162 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

Author: Ralf Schoknecht, Artur Merke

Abstract: Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. 1

2 0.89288223 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain

Author: R. J. Vogelstein, Francesco Tenore, Ralf Philipp, Miriam S. Adlerstein, David H. Goldberg, Gert Cauwenberghs

Abstract: Address-event representation (AER), originally proposed as a means to communicate sparse neural events between neuromorphic chips, has proven efficient in implementing large-scale networks with arbitrary, configurable synaptic connectivity. In this work, we further extend the functionality of AER to implement arbitrary, configurable synaptic plasticity in the address domain. As proof of concept, we implement a biologically inspired form of spike timing-dependent plasticity (STDP) based on relative timing of events in an AER framework. Experimental results from an analog VLSI integrate-and-fire network demonstrate address domain learning in a task that requires neurons to group correlated inputs.

same-paper 3 0.86591899 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

4 0.74620825 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

Author: Ralf Schoknecht

Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1

5 0.70821255 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡   £  £  £ §¤¢ £ © ¨ ¦ ¥ ©   ¡     ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤  ¡ parents B % % 9 C0A@ ! 9  @8 § ¥   ¢   2 ' % % 310  parents    ©¢   £ ¡ !    ' % #!  

6 0.69326723 27 nips-2002-An Impossibility Theorem for Clustering

7 0.69324398 140 nips-2002-Margin Analysis of the LVQ Algorithm

8 0.69113117 45 nips-2002-Boosted Dyadic Kernel Discriminants

9 0.69096762 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

10 0.69068706 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

11 0.68678325 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

12 0.68599403 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

13 0.68566096 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

14 0.68476158 53 nips-2002-Clustering with the Fisher Score

15 0.68221468 10 nips-2002-A Model for Learning Variance Components of Natural Images

16 0.68198961 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

17 0.68008453 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

18 0.68000567 6 nips-2002-A Formulation for Minimax Probability Machine Regression

19 0.67735898 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

20 0.6741035 106 nips-2002-Hyperkernels