nips nips2002 nips2002-45 knowledge-graph by maker-knowledge-mining

45 nips-2002-Boosted Dyadic Kernel Discriminants

Source: pdf

Author: Baback Moghaddam, Gregory Shakhnarovich

Abstract: We introduce a novel learning algorithm for binary classiﬁcation with hyperplane discriminants based on pairs of training points from opposite classes (dyadic hypercuts). This algorithm is further extended to nonlinear discriminants using kernel functions satisfying Mercer’s conditions. An ensemble of simple dyadic hypercuts is learned incrementally by means of a conﬁdence-rated version of AdaBoost, which provides a sound strategy for searching through the ﬁnite set of hypercut hypotheses. In experiments with real-world datasets from the UCI repository, the generalization performance of the hypercut classiﬁers was found to be comparable to that of SVMs and k-NN classiﬁers. Furthermore, the computational cost of classiﬁcation (at run time) was found to be similar to, or better than, that of SVM. Similarly to SVMs, boosted dyadic kernel discriminants tend to maximize the margin (via AdaBoost). In contrast to SVMs, however, we oﬀer an on-line and incremental learning machine for building kernel discriminants whose complexity (number of kernel evaluations) can be directly controlled (traded oﬀ for accuracy). 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce a novel learning algorithm for binary classiﬁcation with hyperplane discriminants based on pairs of training points from opposite classes (dyadic hypercuts). [sent-4, score-0.689]

2 This algorithm is further extended to nonlinear discriminants using kernel functions satisfying Mercer’s conditions. [sent-5, score-0.598]

3 An ensemble of simple dyadic hypercuts is learned incrementally by means of a conﬁdence-rated version of AdaBoost, which provides a sound strategy for searching through the ﬁnite set of hypercut hypotheses. [sent-6, score-1.06]

4 In experiments with real-world datasets from the UCI repository, the generalization performance of the hypercut classiﬁers was found to be comparable to that of SVMs and k-NN classiﬁers. [sent-7, score-0.296]

5 Similarly to SVMs, boosted dyadic kernel discriminants tend to maximize the margin (via AdaBoost). [sent-9, score-1.043]

6 In contrast to SVMs, however, we oﬀer an on-line and incremental learning machine for building kernel discriminants whose complexity (number of kernel evaluations) can be directly controlled (traded oﬀ for accuracy). [sent-10, score-0.675]

7 1 Introduction This paper introduces a novel algorithm for learning complex binary classiﬁers by superposition of simpler hyperplane-type discriminants. [sent-11, score-0.113]

8 In this algorithm, each of the simple discriminants is based on the projection of a test point onto a vector joining a dyad, deﬁned as a pair of training data points with opposite labels. [sent-12, score-0.642]

9 The learning algorithm itself is based on a real-valued variant of AdaBoost [7], and the hyperplane classiﬁers use kernels of the type used, e. [sent-13, score-0.176]

10 When the concept class consists of linear discriminants (hyperplanes), this amounts to using a hyperplane orthogonal to the vector connecting the point in a dyad. [sent-16, score-0.609]

11 We shall refer to such a classiﬁer as a hypercut. [sent-17, score-0.033]

12 By applying the same notion of linear hypercuts to a nonlinearly transformed feature space obtained by Mercertype kernels [3], we are able to implement nonlinear kernel discriminants similar in form to SVMs. [sent-18, score-1.169]

13 In each iteration of AdaBoost, the space of all dyadic hypercuts is searched. [sent-19, score-0.694]

14 It can be easily shown that this hypothesis space spans the subspace of the data and that it must include the optimal hyperplane discriminant. [sent-20, score-0.156]

15 This notion is readily extended to non-linear classiﬁers obtained by kernel transformations, by noting that in the feature space, the optimal discriminant resides in the span of the transformed data. [sent-21, score-0.285]

16 Therefore, for both linear and nonlinear classiﬁcation, searching the space of dyadic hypercuts forms an eﬃcient strategy for exploring the space of all hypotheses. [sent-22, score-0.849]

17 1 Related work The most general framework to consider is the theory of potential functions for pattern classiﬁcation [1] in which potential ﬁelds1 of the form H(x) = αi yi K(x, xi ) (1) i are thresholded to predict classiﬁcation labels, y = sign(H(x)). [sent-24, score-0.269]

18 This framework subsumes SVMs, which correspond to the simplest case F (α) = α. [sent-26, score-0.029]

19 Generalized linear models [6] can also be shown to be members of this class by considering logistic regression where F (α) becomes the binary entropy function and K is related to the covariance function of a Gaussian process classiﬁer for the GLM’s intermediate variables. [sent-27, score-0.116]

20 In this paper we propose and design classiﬁers with dyadic discriminants, which have potential functions of the form αt K(x, xp ) − αt K(x, xn ), t t H(x) = (3) t where xp and xn are positively and negatively labeled data, respectively. [sent-28, score-0.521]

21 The coefﬁcients αt are determined not by minimizing a convex quadratic function J(α) but rather by selecting an optimal classiﬁer in the t-th iteration of AdaBoost. [sent-29, score-0.03]

22 Thus the potential function is constrained to the form of a weighted sum of dyadic hypercuts, or diﬀerences of kernel functions. [sent-30, score-0.518]

23 Another way to view this is to think of a pair of opposite – polarity “basis vectors” sharing the same coeﬃcient αt . [sent-31, score-0.111]

24 The most closely related potential function technique to ours is that of SVMs [9], where the classiﬁcation margin (and thus the bound on generalization) is maximized by a simultaneous optimization with respect to all of the training points. [sent-32, score-0.141]

25 However, there are important diﬀerences between SVMs and our iterative hypercut algorithm. [sent-33, score-0.296]

26 In each step of the boosting process, we do not maximize the margin of the resulting strong classiﬁer directly, which makes for a much simpler optimization task. [sent-34, score-0.175]

27 Meanwhile, we are assured that with AdaBoost we tend to maximize (although in an asymptotic sense) the margin of the ﬁnal classiﬁer [7]. [sent-35, score-0.11]

28 the points in our dyads are not typically located near the decision boundary, as is the case with support vectors. [sent-37, score-0.039]

29 As a result, the ﬁnal set of “basis vectors” used by the boosted strong classiﬁer can be viewed as a representative subset of the data (i. [sent-38, score-0.151]

30 those points needed for classiﬁcation), whereas with SVMs the support vectors are simply the minimal number of training points needed to build (support) the decision boundary and are almost certainly not “typical” or high-likelihood members of either class. [sent-40, score-0.222]

31 2 The classiﬁcation complexity of a kernel-based classiﬁer — the cost of classifying a test point — depends on the number of kernel function evaluations on which the classiﬁer is based. [sent-41, score-0.197]

32 In our boosted hypercut algorithm, however, the number of dyadic “basis vectors”, and therefore of the required kernel evaluations, is determined by the number of iterations of the boosting algorithm and can therefore be controlled. [sent-43, score-0.883]

33 Note that we are not referring here to the complexity of training classiﬁers here, only to their run-time computational cost. [sent-44, score-0.064]

34 2 Methodology Consider a binary classiﬁcation task where we are given a training set of vectors T = {x1 , . [sent-45, score-0.11]

35 Let there be Mp samples with label +1 and Mn samples with label −1 so that M = Mp + Mn . [sent-52, score-0.062]

36 Consider a simple linear hyperplane classiﬁer deﬁned by a discriminant function of the form f (x) = w · x + b (4) where sign(f (x)) ∈ {+1, −1} gives the binary classiﬁcation. [sent-53, score-0.248]

37 Under certain assumptions, Gaussianity in particular, the optimal hyperplane, speciﬁed by the projection w∗ and bias b∗ , is easily computed using standard statistical techniques based on class means and sample covariances for linear classiﬁers. [sent-54, score-0.101]

38 However, in the absence of such assumptions, one must resort to searching for the optimal hyperplane. [sent-55, score-0.068]

39 When searching for w∗ , an eﬃcient strategy is to consider only hyperplanes whose surface normal is parallel to the line joining a dyad (xi , xj ): xi − x j wij = , yi = y j , i < j (5) c where yi = yj by deﬁnition, i < j for uniqueness, and c is a scale factor. [sent-56, score-0.841]

40 The vector wij is parallel to the line segment connecting the points in a dyad. [sent-57, score-0.247]

41 Setting c = xi − xj makes wij a unit-norm direction vector. [sent-58, score-0.296]

42 The hypothesis space to be searched consists of | {wij } |= Mp Mn hypercuts, each having a free bias parameter bij which is typically determined by minimizing the weighted classiﬁcation error (as we shall see in the next section). [sent-59, score-0.301]

43 Each hypothesis is then given by the sign of the discriminant as in (4): hij (x) = sign( wij · x + bij ) (6) Let {hij } = {wij , bij } denote the complete set of hypercuts for a given training set. [sent-60, score-1.109]

44 Strictly speaking, this set is uncountable since bij is continuous and arbitrary. [sent-61, score-0.129]

45 However, since we always select one bias parameter for each hypercut w ij , we do in fact end up with only Mp Mn classiﬁers. [sent-62, score-0.331]

46 2 Although unrelated to our technique, the Relevance Vector machine [8] is another kernel learning algorithm that tends to produce “prototypical” basis vectors in the interior as opposed to the boundary of the distributions. [sent-63, score-0.32]

47 1 AdaBoost The AdaBoost algorithm [4] provides a practical framework for combining a number of weak classiﬁers into a strong ﬁnal classiﬁer by means of linear combination and thresholding. [sent-65, score-0.096]

48 AdaBoost works by maintaining over the training set an iteratively evolving distribution (weights) Dt (i) based on the diﬃculty of classiﬁcation (i. [sent-66, score-0.058]

49 points which are harder to classify have greater weight). [sent-68, score-0.064]

50 Consequently, a “weak” hypothesis h(x) : x → {+1, −1} will have classiﬁcation error t weighted by Dt . [sent-69, score-0.077]

51 In our case, in each iteration t, we select from the complete set of Mp Mn hypercuts {hij } one which minimizes t . [sent-70, score-0.4]

52 The ﬁnal classiﬁer is a linear combination of the selected weak classiﬁers ht and has the form of a weighted “voting” scheme T H(x) = sign αt ht (x) (7) i=1 1 where αt = 2 ln( 1−t t ). [sent-72, score-0.469]

53 In [7] a framework was developed where ht (x) can be real-valued (as opposed to binary) and is interpreted as a “conﬁdence-rated prediction. [sent-73, score-0.177]

54 ” The sign of ht (x) is the predicted label while the magnitude | ht (x) | is the conﬁdence. [sent-74, score-0.398]

55 For such real-valued classiﬁers we have αt = where the “correlation” rt = t = (1 − rt )/2. [sent-75, score-0.188]

56 In the resulting “reproducing kernel Hilbert spaces”, dot products between high-dimensional mappings Φ(x) : X → F are easily evaluated using Mercer kernels k(x, x ) = Φ(x) · Φ(x ) . [sent-78, score-0.261]

57 (9) This has the desirable property that any algorithm based on dot products, e. [sent-79, score-0.03]

58 our linear hypercut classiﬁer (6), can ﬁrst nonlinearly transform its inputs (using kernels) and implicitly perform dot-products in the transformed space. [sent-81, score-0.434]

59 The preimage of the linear hyperplane solution back in the input space is thus a nonlinear hypersurface. [sent-82, score-0.198]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('discriminants', 0.407), ('hypercuts', 0.37), ('hypercut', 0.296), ('dyadic', 0.294), ('classi', 0.274), ('wij', 0.172), ('adaboost', 0.158), ('ht', 0.141), ('mp', 0.137), ('kernel', 0.134), ('svms', 0.129), ('bij', 0.129), ('mn', 0.123), ('boosted', 0.123), ('er', 0.123), ('hyperplane', 0.113), ('ers', 0.099), ('rt', 0.094), ('yi', 0.091), ('hij', 0.088), ('dt', 0.087), ('sign', 0.085), ('baback', 0.074), ('dyad', 0.074), ('joining', 0.074), ('yj', 0.071), ('cation', 0.069), ('searching', 0.068), ('xi', 0.066), ('gregory', 0.064), ('erences', 0.064), ('evaluations', 0.063), ('kernels', 0.063), ('discriminant', 0.061), ('polarity', 0.059), ('transformed', 0.058), ('xj', 0.058), ('nonlinear', 0.057), ('potential', 0.056), ('margin', 0.053), ('opposite', 0.052), ('nonlinearly', 0.052), ('di', 0.046), ('binary', 0.046), ('xp', 0.045), ('hyperplanes', 0.044), ('coe', 0.044), ('hypothesis', 0.043), ('members', 0.042), ('nal', 0.041), ('superposition', 0.041), ('weak', 0.04), ('mercer', 0.04), ('points', 0.039), ('boundary', 0.038), ('projection', 0.038), ('connecting', 0.036), ('ln', 0.036), ('boosting', 0.036), ('opposed', 0.036), ('bias', 0.035), ('products', 0.034), ('weighted', 0.034), ('shall', 0.033), ('vectors', 0.032), ('prototypical', 0.032), ('resides', 0.032), ('broadway', 0.032), ('glm', 0.032), ('referring', 0.032), ('traded', 0.032), ('maximize', 0.032), ('strategy', 0.032), ('training', 0.032), ('label', 0.031), ('dot', 0.03), ('iteration', 0.03), ('mis', 0.029), ('subsumes', 0.029), ('meanwhile', 0.029), ('mitsubishi', 0.029), ('strong', 0.028), ('xn', 0.028), ('laboratory', 0.028), ('linear', 0.028), ('interior', 0.027), ('searched', 0.027), ('unrelated', 0.027), ('charges', 0.027), ('electrostatic', 0.027), ('simpler', 0.026), ('electric', 0.026), ('evolving', 0.026), ('basis', 0.026), ('concept', 0.025), ('usa', 0.025), ('uniqueness', 0.025), ('negatively', 0.025), ('harder', 0.025), ('assured', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 45 nips-2002-Boosted Dyadic Kernel Discriminants

Author: Baback Moghaddam, Gregory Shakhnarovich

2 0.23522666 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

Author: Clayton Scott, Robert Nowak

Abstract: Classiﬁcation trees are one of the most popular types of classiﬁers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classiﬁcation trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classiﬁcation trees, called dyadic classiﬁcation trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classiﬁcation problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform signiﬁcantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.

3 0.22066589 92 nips-2002-FloatBoost Learning for Classification

Author: Stan Z. Li, Zhenqiu Zhang, Heung-yeung Shum, Hongjiang Zhang

Abstract: AdaBoost [3] minimizes an upper error bound which is an exponential function of the margin on the training set [14]. However, the ultimate goal in applications of pattern classiﬁcation is always minimum error rate. On the other hand, AdaBoost needs an effective procedure for learning weak classiﬁers, which by itself is difﬁcult especially for high dimensional data. In this paper, we present a novel procedure, called FloatBoost, for learning a better boosted classiﬁer. FloatBoost uses a backtrack mechanism after each iteration of AdaBoost to remove weak classiﬁers which cause higher error rates. The resulting ﬂoat-boosted classiﬁer consists of fewer weak classiﬁers yet achieves lower error rates than AdaBoost in both training and test. We also propose a statistical model for learning weak classiﬁers, based on a stagewise approximation of the posterior using an overcomplete set of scalar features. Experimental comparisons of FloatBoost and AdaBoost are provided through a difﬁcult classiﬁcation problem, face detection, where the goal is to learn from training examples a highly nonlinear classiﬁer to differentiate between face and nonface patterns in a high dimensional space. The results clearly demonstrate the promises made by FloatBoost over AdaBoost.

4 0.21837702 120 nips-2002-Kernel Design Using Boosting

Author: Koby Crammer, Joseph Keshet, Yoram Singer

Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efﬁcient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a ﬁxed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classiﬁcation, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classiﬁcation and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predeﬁned kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predeﬁned kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for speciﬁc tasks such as text categorization [3] and protein classiﬁcation problems [4]. Our work attempts to give a computational alternative to predeﬁned kernels by learning kernel operators from data. We start with a few deﬁnitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a ﬁnite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The speciﬁc setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to ﬁnd a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be deﬁned shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this deﬁnition, they deﬁned the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a ﬁnite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “ﬂattened” into a vector of dimension m2 . Therefore, this deﬁnition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself sufﬁces to construct an excellent classiﬁer f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to ﬁnd a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classiﬁer whose empirical loss is zero. Furthermore, the task of ﬁnding a good kernel when it is not always possible to ﬁnd a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difﬁculties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore ﬁrst discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classiﬁcation error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] uniﬁes the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classiﬁcation error. Given the prediction of a classiﬁer f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are deﬁned as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classiﬁer f (x). This classiﬁer is a weighted combination of (possibly very simple) base classiﬁers. (In the boosting framework, the base classiﬁers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to ﬁnd a good of weights αt . Our starting point in this paper is the ﬁrst sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-ﬂy. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reﬂects how difﬁcult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classiﬁcation algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspeciﬁed. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we deﬁne the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this deﬁnition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a speciﬁc base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We deﬁne the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw ﬁts the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efﬁciently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now deﬁne a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to ﬁnding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semideﬁnite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for ﬁnding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efﬁciently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the ﬁrst two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the ﬁrst two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (ﬁrst and third from left) and after learning (second and fourth). For each of the two settings we show the ﬁrst two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we ﬁrst learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classiﬁer on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classiﬁer with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the ﬁgure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the ﬁrst two features. In all plots, each bar indicates a 95% conﬁdence level. It is clear from the ﬁgure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reﬂects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the beneﬁts of using a boutique kernel. The ﬁrst and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the ﬁrst eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the ﬁrst and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the ﬁrst group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classiﬁcation problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classiﬁers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classiﬁcation problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the ﬁrst ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as signiﬁcant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classiﬁers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classiﬁcation using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classiﬁcation. In Proceedings of the Paciﬁc Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-deﬁnite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.

5 0.19248277 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

Author: Sepp Hochreiter, Klaus Obermayer

Abstract: We investigate the problem of learning a classiﬁcation task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to diﬀerent sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classiﬁer which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive deﬁnite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classiﬁcation and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classiﬁcation results. 1

6 0.16744295 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

7 0.13056242 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

8 0.13051184 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

9 0.12839383 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

10 0.12307568 156 nips-2002-On the Complexity of Learning the Kernel Matrix

11 0.11819243 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

12 0.11013524 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

13 0.10965604 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

14 0.10423885 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

15 0.1013143 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging

16 0.098847449 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

17 0.09424638 46 nips-2002-Boosting Density Estimation

18 0.093504816 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

19 0.090202764 55 nips-2002-Combining Features for BCI

20 0.089545935 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.248), (1, -0.182), (2, 0.148), (3, -0.107), (4, 0.272), (5, -0.06), (6, 0.054), (7, -0.012), (8, 0.04), (9, -0.06), (10, -0.065), (11, 0.051), (12, 0.017), (13, 0.036), (14, 0.116), (15, -0.011), (16, -0.038), (17, 0.054), (18, 0.024), (19, 0.029), (20, -0.033), (21, 0.063), (22, -0.029), (23, -0.074), (24, 0.028), (25, 0.098), (26, -0.045), (27, 0.036), (28, -0.011), (29, -0.14), (30, -0.031), (31, 0.092), (32, 0.021), (33, 0.142), (34, 0.133), (35, -0.127), (36, -0.044), (37, 0.011), (38, 0.003), (39, -0.065), (40, -0.037), (41, -0.091), (42, -0.022), (43, -0.05), (44, 0.019), (45, 0.028), (46, -0.143), (47, -0.125), (48, 0.082), (49, -0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9409644 45 nips-2002-Boosted Dyadic Kernel Discriminants

Author: Baback Moghaddam, Gregory Shakhnarovich

2 0.74247986 92 nips-2002-FloatBoost Learning for Classification

Author: Stan Z. Li, Zhenqiu Zhang, Heung-yeung Shum, Hongjiang Zhang

3 0.72702026 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

Author: Clayton Scott, Robert Nowak

4 0.67448163 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

Author: Sariel Har-Peled, Dan Roth, Dav Zimak

Abstract: The constraint classiﬁcation framework captures many ﬂavors of multiclass classiﬁcation including winner-take-all multiclass classiﬁcation, multilabel classiﬁcation and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classiﬁer in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classiﬁcation beneﬁts over existing methods of multiclass classiﬁcation.

5 0.63251323 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer

Abstract: We introduce a family of classiﬁers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classiﬁers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classiﬁcation function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-deﬁnite kernels, regularization techniques that allow for the construction of an optimal classiﬁer in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classiﬁcation tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classiﬁer outperformed standard SVMs. 1

6 0.60369712 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

7 0.58877087 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging

8 0.57808548 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

9 0.57591289 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study

10 0.57040811 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

11 0.52506411 120 nips-2002-Kernel Design Using Boosting

12 0.51756793 6 nips-2002-A Formulation for Minimax Probability Machine Regression

13 0.50985777 167 nips-2002-Rational Kernels

14 0.49508783 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

15 0.4891127 55 nips-2002-Combining Features for BCI

16 0.44742817 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

17 0.44417739 67 nips-2002-Discriminative Binaural Sound Localization

18 0.442267 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

19 0.43708017 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis

20 0.42500582 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.024), (11, 0.047), (23, 0.017), (42, 0.065), (54, 0.126), (55, 0.025), (68, 0.028), (74, 0.06), (79, 0.23), (92, 0.104), (98, 0.173)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93018109 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory

Author: Kenneth J. Malmberg, René Zeelenberg, Richard M. Shiffrin

Abstract: The benz.odiaze:pine '~1idazolam causes dense,but temporary ~ anterograde amnesia, similar to that produced by- hippocampal damage~Does the action of M'idazola:m on the hippocanlpus cause less storage, or less accurate storage, .of information in episodic. long-term menlory?- \rVe used a sinlple variant of theREJv1. JD.odel [18] to fit data collected. by IIirsbnla.n~Fisher, .IIenthorn,Arndt} and Passa.nnante [9] on the effects of Midazola.m, study time~ and normative \vQrd.. frequenc:y on both yes-no and remember-k.novv recognition m.emory. That a: simple strength. 'model fit well \\tas cont.rary to the expectations of 'flirshman et aLMore important,within the Bayesian based R.EM modeling frame\vork, the data were consistentw'ith the view that Midazolam causes less accurate storage~ rather than less storage, of infornlation in episodic mcm.ory..

2 0.87112236 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs

Author: Gianluca Pollastri, Pierre Baldi, Alessandro Vullo, Paolo Frasconi

Abstract: We develop and test new machine learning methods for the prediction of topological representations of protein structures in the form of coarse- or ﬁne-grained contact or distance maps that are translation and rotation invariant. The methods are based on generalized input-output hidden Markov models (GIOHMMs) and generalized recursive neural networks (GRNNs). The methods are used to predict topology directly in the ﬁne-grained case and, in the coarsegrained case, indirectly by ﬁrst learning how to score candidate graphs and then using the scoring function to search the space of possible conﬁgurations. Computer simulations show that the predictors achieve state-of-the-art performance. 1 Introduction: Protein Topology Prediction Predicting the 3D structure of protein chains from the linear sequence of amino acids is a fundamental open problem in computational molecular biology [1]. Any approach to the problem must deal with the basic fact that protein structures are translation and rotation invariant. To address this invariance, we have proposed a machine learning approach to protein structure prediction [4] based on the prediction of topological representations of proteins, in the form of contact or distance maps. The contact or distance map is a 2D representation of neighborhood relationships consisting of an adjacency matrix at some distance cutoﬀ (typically in the range of 6 to 12 ˚), or a matrix of pairwise Euclidean distances. Fine-grained maps A are derived at the amino acid or even atomic level. Coarse maps are obtained by looking at secondary structure elements, such as helices, and the distance between their centers of gravity or, as in the simulations below, the minimal distances between their Cα atoms. Reasonable methods for reconstructing 3D coordinates from contact/distance maps have been developed in the NMR literature and elsewhere Oi B Hi F Hi Ii Figure 1: Bayesian network for bidirectional IOHMMs consisting of input units, output units, and both forward and backward Markov chains of hidden states. [14] using distance geometry and stochastic optimization techniques. Thus the main focus here is on the more diﬃcult task of contact map prediction. Various algorithms for the prediction of contact maps have been developed, in particular using feedforward neural networks [6]. The best contact map predictor in the literature and at the last CASP prediction experiment reports an average precision [True Positives/(True Positives + False Positives)] of 21% for distant contacts, i.e. with a linear distance of 8 amino acid or more [6] for ﬁne-grained amino acid maps. While this result is encouraging and well above chance level by a factor greater than 6, it is still far from providing suﬃcient accuracy for reliable 3D structure prediction. A key issue in this area is the amount of noise that can be tolerated in a contact map prediction without compromising the 3D-reconstruction step. While systematic tests in this area have not yet been published, preliminary results appear to indicate that recovery of as little as half of the distant contacts may suﬃce for proper reconstruction, at least for proteins up to 150 amino acid long (Rita Casadio and Piero Fariselli, private communication and oral presentation during CASP4 [10]). It is important to realize that the input to a ﬁne-grained contact map predictor need not be conﬁned to the sequence of amino acids only, but may also include evolutionary information in the form of proﬁles derived by multiple alignment of homologue proteins, or structural feature information, such as secondary structure (alpha helices, beta strands, and coils), or solvent accessibility (surface/buried), derived by specialized predictors [12, 13]. In our approach, we use diﬀerent GIOHMM and GRNN strategies to predict both structural features and contact maps. 2 GIOHMM Architectures Loosely speaking, GIOHMMs are Bayesian networks with input, hidden, and output units that can be used to process complex data structures such as sequences, images, trees, chemical compounds and so forth, built on work in, for instance, [5, 3, 7, 2, 11]. In general, the connectivity of the graphs associated with the hidden units matches the structure of the data being processed. Often multiple copies of the same hidden graph, but with diﬀerent edge orientations, are used in the hidden layers to allow direct propagation of information in all relevant directions. Output Plane NE NW 4 Hidden Planes SW SE Input Plane Figure 2: 2D GIOHMM Bayesian network for processing two-dimensional objects such as contact maps, with nodes regularly arranged in one input plane, one output plane, and four hidden planes. In each hidden plane, nodes are arranged on a square lattice, and all edges are oriented towards the corresponding cardinal corner. Additional directed edges run vertically in column from the input plane to each hidden plane, and from each hidden plane to the output plane. To illustrate the general idea, a ﬁrst example of GIOHMM is provided by the bidirectional IOHMMs (Figure 1) introduced in [2] to process sequences and predict protein structural features, such as secondary structure. Unlike standard HMMs or IOHMMS used, for instance in speech recognition, this architecture is based on two hidden markov chains running in opposite directions to leverage the fact that biological sequences are spatial objects rather than temporal sequences. Bidirectional IOHMMs have been used to derive a suite of structural feature predictors [12, 13, 4] available through http://promoter.ics.uci.edu/BRNN-PRED/. These predictors have accuracy rates in the 75-80% range on a per amino acid basis. 2.1 Direct Prediction of Topology To predict contact maps, we use a 2D generalization of the previous 1D Bayesian network. The basic version of this architecture (Figures 2) contains 6 layers of units: input, output, and four hidden layers, one for each cardinal corner. Within each column indexed by i and j, connections run from the input to the four hidden units, and from the four hidden units to the output unit. In addition, the hidden units in each hidden layer are arranged on a square or triangular lattice, with all the edges oriented towards the corresponding cardinal corner. Thus the parameters of this two-dimensional GIOHMMs, in the square lattice case, are the conditional probability distributions:  NE NW SW SE  P (Oi |Ii,j , Hi,j , Hi,j , Hi,j , Hi,j, )   NE NE NE  P (Hi,j |Ii,j , Hi−1,j , Hi,j−1 )  N NW NW P (Hi,jW |Ii,j , Hi+1,j , Hi,j−1 )  SW SW SW  P (Hi,j |Ii,j , Hi+1,j , Hi,j+1 )    SE SE SE P (Hi,j |Ii,j , Hi−1,j , Hi,j+1 ) (1) In a contact map prediction at the amino acid level, for instance, the (i, j) output represents the probability of whether amino acids i and j are in contact or not. This prediction depends directly on the (i, j) input and the four-hidden units in the same column, associated with omni-directional contextual propagation in the hidden planes. In the simulations reported below, we use a more elaborated input consisting of a 20 × 20 probability matrix over amino acid pairs derived from a multiple alignment of the given protein sequence and its homologues, as well as the structural features of the corresponding amino acids, including their secondary structure classiﬁcation and their relative exposure to the solvent, derived from our corresponding predictors. It should be clear how GIOHMM ideas can be generalized to other data structures and problems in many ways. In the case of 3D data, for instance, a standard GIOHMM would have an input cube, an output cube, and up to 8 cubes of hidden units, one for each corner with connections inside each hidden cube oriented towards the corresponding corner. In the case of data with an underlying tree structure, the hidden layers would correspond to copies of the same tree with diﬀerent orientations and so forth. Thus a fundamental advantage of GIOHMMs is that they can process a wide range of data structures of variable sizes and dimensions. 2.2 Indirect Prediction of Topology Although GIOHMMs allow ﬂexible integration of contextual information over ranges that often exceed what can be achieved, for instance, with ﬁxed-input neural networks, the models described above still suﬀer from the fact that the connections remain local and therefore long-ranged propagation of information during learning remains diﬃcult. Introduction of large numbers of long-ranged connections is computationally intractable but in principle not necessary since the number of contacts in proteins is known to grow linearly with the length of the protein, and hence connectivity is inherently sparse. The diﬃculty of course is that the location of the long-ranged contacts is not known. To address this problem, we have developed also a complementary GIOHMM approach described in Figure 3 where a candidate graph structure is proposed in the hidden layers of the GIOHMM, with the two diﬀerent orientations naturally associated with a protein sequence. Thus the hidden graphs change with each protein. In principle the output ought to be a single unit (Figure 3b) which directly computes a global score for the candidate structure presented in the hidden layer. In order to cope with long-ranged dependencies, however, it is preferable to compute a set of local scores (Figure 3c), one for each vertex, and combine the local scores into a global score by averaging. More speciﬁcally, consider a true topology represented by the undirected contact graph G∗ = (V, E ∗ ), and a candidate undirected prediction graph G = (V, E). A global measure of how well E approximates E ∗ is provided by the informationretrieval F1 score deﬁned by the normalized edge-overlap F1 = 2|E ∩ E ∗ |/(|E| + |E ∗ |) = 2P R/(P + R), where P = |E ∩ E ∗ |/|E| is the precision (or speciﬁcity) and R = |E ∩ E ∗ |/|E ∗ | is the recall (or sensitivity) measure. Obviously, 0 ≤ F1 ≤ 1 and F1 = 1 if and only if E = E ∗ . The scoring function F1 has the property of being monotone in the sense that if |E| = |E | then F1 (E) < F1 (E ) if and only if |E ∩ E ∗ | < |E ∩ E ∗ |. Furthermore, if E = E ∪ {e} where e is an edge in E ∗ but not in E, then F1 (E ) > F1 (E). Monotonicity is important to guide the search in the space of possible topologies. It is easy to check that a simple search algorithm based on F1 takes on the order of O(|V |3 ) steps to ﬁnd E ∗ , basically by trying all possible edges one after the other. The problem then is to learn F1 , or rather a good approximation to F1 . To approximate F1 , we ﬁrst consider a similar local measure Fv by considering the O I(v) I(v) F B H (v) H (v) (a) I(v) F B H (v) H (v) (b) O(v) (c) Figure 3: Indirect prediction of contact maps. (a) target contact graph to be predicted. (b) GIOHMM with two hidden layers: the two hidden layers correspond to two copies of the same candidate graph oriented in opposite directions from one end of the protein to the other end. The single output O is the global score of how well the candidate graph approximates the true contact map. (c) Similar to (b) but with a local score O(v) at each vertex. The local scores can be averaged to produce a global score. In (b) and (c) I(v) represents the input for vertex v, and H F (v) and H B (v) are the corresponding hidden variables. ∗ ∗ set Ev of edges adjacent to vertex v and Fv = 2|Ev ∩ Ev |/(|Ev | + |Ev |) with the ¯ global average F = v Fv /|V |. If n and n∗ are the average degrees of G and G∗ , it can be shown that: F1 = 1 |V | v 2|Ev ∩ E ∗ | n + n∗ and 1 ¯ F = |V | v 2|Ev ∩ E ∗ | n + v + n∗ + ∗ v (2) where n + v (resp. n∗ + ∗ ) is the degree of v in G (resp. in G∗ ). In particular, if G v ¯ ¯ and G∗ are regular graphs, then F1 (E) = F (E) so that F is a good approximation to F1 . In the contact map regime where the number of contacts grows linearly with the length of the sequence, we should have in general |E| ≈ |E ∗ | ≈ (1 + α)|V | so that each node on average has n = n∗ = 2(1 + α) edges. The value of α depends of course on the neighborhood cutoﬀ. As in reinforcement learning, to learn the scoring function one is faced with the problem of generating good training sets in a high dimensional space, where the states are the topologies (graphs), and the policies are algorithms for adding a single edge to a given graph. In the simulations we adopt several diﬀerent strategies including static and dynamic generation. Within dynamic generation we use three exploration strategies: random exploration (successor graph chosen at random), pure exploitation (successor graph maximizes the current scoring function), and semi-uniform exploitation to ﬁnd a balance between exploration and exploitation [with probability (resp. 1 − ) we choose random exploration (resp. pure exploitation)]. 3 GRNN Architectures Inference and learning in the protein GIOHMMs we have described is computationally intensive due to the large number of undirected loops they contain. This problem can be addressed using a neural network reparameterization assuming that: (a) all the nodes in the graphs are associated with a deterministic vector (note that in the case of the output nodes this vector can represent a probability distribution so that the overall model remains probabilistic); (b) each vector is a deterministic function of its parents; (c) each function is parameterized using a neural network (or some other class of approximators); and (d) weight-sharing or stationarity is used between similar neural networks in the model. For example, in the 2D GIOHMM contact map predictor, we can use a total of 5 neural networks to recursively compute the four hidden states and the output in each column in the form:  NW NE SW SE  Oij = NO (Iij , Hi,j , Hi,j , Hi,j , Hi,j )   NE NE NE  Hi,j = NN E (Ii,j , Hi−1,j , Hi,j−1 )  N NW NW Hi,jW = NN W (Ii,j , Hi+1,j , Hi,j−1 )  SW SW SW  Hi,j = NSW (Ii,j , Hi+1,j , Hi,j+1 )    SE SE SE Hi,j = NSE (Ii,j , Hi−1,j , Hi,j+1 ) (3) N In the NE plane, for instance, the boundary conditions are set to Hij E = 0 for i = 0 N or j = 0. The activity vector associated with the hidden unit Hij E depends on the NE NE local input Iij , and the activity vectors of the units Hi−1,j and Hi,j−1 . Activity in NE plane can be propagated row by row, West to East, and from the ﬁrst row to the last (from South to North), or column by column South to North, and from the ﬁrst column to the last. These GRNN architectures can be trained by gradient descent by unfolding the structures in space, leveraging the acyclic nature of the underlying GIOHMMs. 4 Data Many data sets are available or can be constructed for training and testing purposes, as described in the references. The data sets used in the present simulations are extracted from the publicly available Protein Data Bank (PDB) and then redundancy reduced, or from the non-homologous subset of PDB Select (ftp://ftp.emblheidelberg.de/pub/databases/). In addition, we typically exclude structures with poor resolution (less than 2.5-3 ˚), sequences containing less than 30 amino acids, A and structures containing multiple sequences or sequences with chain breaks. For coarse contact maps, we use the DSSP program [9] (CMBI version) to assign secondary structures and we remove also sequences for which DSSP crashes. The results we report for ﬁne-grained contact maps are derived using 424 proteins with lengths in the 30-200 range for training and an additional non-homologous set of 48 proteins in the same length range for testing. For the coarse contact map, we use a set of 587 proteins of length less than 300. Because the average length of a secondary structure element is slightly above 7, the size of a coarse map is roughly 2% the size of the corresponding amino acid map. 5 Simulation Results and Conclusions We have trained several 2D GIOHMM/GRNN models on the direct prediction of ﬁne-grained contact maps. Training of a single model typically takes on the order of a week on a fast workstation. A sample of validation results is reported in Table 1 for four diﬀerent distance cutoﬀs. Overall percentages of correctly predicted contacts Table 1: Direct prediction of amino acid contact maps. Column 1: four distance cutoﬀs. Column 2, 3, and 4: overall percentages of amino acids correctly classiﬁed as contacts, non-contacts, and in total. Column 5: Precision percentage for distant contacts (|i − j| ≥ 8) with a threshold of 0.5. Single model results except for last line corresponding to an ensemble of 5 models. Cutoﬀ 6˚ A 8˚ A 10 ˚ A 12 ˚ A 12 ˚ A Contact .714 .638 .512 .433 .445 Non-Contact .998 .998 .993 .987 .990 Total .985 .970 .931 .878 .883 Precision (P) .594 .670 .557 .549 .717 and non-contacts at all linear distances, as well as precision results for distant contacts (|i − j| ≥ 8) are reported for a single GIOHMM/GRNN model. The model has k = 14 hidden units in the hidden and output layers of the four hidden networks, as well as in the hidden layer of the output network. In the last row, we also report as an example the results obtained at 12˚ by an ensemble of 5 networks A with k = 11, 12, 13, 14 and 15. Note that precision for distant contacts exceeds all previously reported results and is well above 50%. For the prediction of coarse-grained contact maps, we use the indirect GIOHMM/GRNN strategy and compare diﬀerent exploration/exploitation strategies: random exploration, pure exploitation, and their convex combination (semiuniform exploitation). In the semi-uniform case we set the probability of random uniform exploration to = 0.4. In addition, we also try a fourth hybrid strategy in which the search proceeds greedily (i.e. the best successor is chosen at each step, as in pure exploitation), but the network is trained by randomly sub-sampling the successors of the current state. Eight numerical features encode the input label of each node: one-hot encoding of secondary structure classes; normalized linear distances from the N to C terminus; average, maximum and minimum hydrophobic character of the segment (based on the Kyte-Doolittle scale with a moving window of length 7). A sample of results obtained with 5-fold cross-validation is shown in Table 2. Hidden state vectors have dimension k = 5 with no hidden layers. For each strategy we measure performances by means of several indices: micro and macroaveraged precision (mP , M P ), recall (mR, M R) and F1 measure (mF1 , M F1 ). Micro-averages are derived based on each pair of secondary structure elements in each protein, whereas macro-averages are obtained on a per-protein basis, by ﬁrst computing precision and recall for each protein, and then averaging over the set of all proteins. In addition, we also measure the micro and macro averages for speciﬁcity in the sense of percentage of correct prediction for non-contacts (mP (nc), M P (nc)). Note the tradeoﬀs between precision and recall across the training methods, the hybrid method achieving the best F 1 results. Table 2: Indirect prediction of coarse contact maps with dynamic sampling. Strategy Random exploration Semi-uniform Pure exploitation Hybrid mP .715 .454 .431 .417 mP (nc) .769 .787 .806 .834 mR .418 .631 .726 .790 mF1 .518 .526 .539 .546 MP .767 .507 .481 .474 M P (nc) .709 .767 .793 .821 MR .469 .702 .787 .843 M F1 .574 .588 .596 .607 We have presented two approaches, based on a very general IOHMM/RNN framework, that achieve state-of-the-art performance in the prediction of proteins contact maps at ﬁne and coarse-grained levels of resolution. In principle both methods can be applied to both resolution levels, although the indirect prediction is computationally too demanding for ﬁne-grained prediction of large proteins. Several extensions are currently under development, including the integration of these methods into complete 3D structure predictors. While these systems require long training periods, once trained they can rapidly sift through large proteomic data sets. Acknowledgments The work of PB and GP is supported by a Laurel Wilkening Faculty Innovation award and awards from NIH, BREP, Sun Microsystems, and the California Institute for Telecommunications and Information Technology. The work of PF and AV is partially supported by a MURST grant. References [1] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science, 294:93–96, 2001. [2] P. Baldi and S. Brunak and P. Frasconi and G. Soda and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15(11):937–946, 1999. [3] P. Baldi and Y. Chauvin. Hybrid modeling, HMM/NN architectures, and protein applications. Neural Computation, 8(7):1541–1565, 1996. [4] P. Baldi and G. Pollastri. Machine learning structural and functional proteomics. IEEE Intelligent Systems. Special Issue on Intelligent Systems in Biology, 17(2), 2002. [5] Y. Bengio and P. Frasconi. Input-output HMM’s for sequence processing. IEEE Trans. on Neural Networks, 7:1231–1249, 1996. [6] P. Fariselli, O. Olmea, A. Valencia, and R. Casadio. Prediction of contact maps with neural networks and correlated mutations. Protein Engineering, 14:835–843, 2001. [7] P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data structures. IEEE Trans. on Neural Networks, 9:768–786, 1998. [8] Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models Machine Learning, 29:245–273, 1997. [9] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577–2637, 1983. [10] A. M. Lesk, L. Lo Conte, and T. J. P. Hubbard. Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins, 45, S5:98–118, 2001. [11] G. Pollastri and P. Baldi. Predition of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Proceedings of 2002 ISMB (Intelligent Systems for Molecular Biology) Conference. Bioinformatics, 18, S1:62–70, 2002. [12] G. Pollastri, D. Przybylski, B. Rost, and P. Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and proﬁles. Proteins, 47:228–235, 2002. [13] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–153, 2002. [14] M. Vendruscolo, E. Kussell, and E. Domany. Recovery of protein structure from contact maps. Folding and Design, 2:295–306, 1997.

3 0.83653814 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

Author: Yasemin Altun, Thomas Hofmann, Mark Johnson

Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1

same-paper 4 0.83313042 45 nips-2002-Boosted Dyadic Kernel Discriminants

Author: Baback Moghaddam, Gregory Shakhnarovich

5 0.70851552 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

6 0.7046299 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

7 0.69479281 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages

8 0.6930325 41 nips-2002-Bayesian Monte Carlo

9 0.69282639 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

10 0.68782669 53 nips-2002-Clustering with the Fisher Score

11 0.68620813 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

12 0.68560028 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions

13 0.68557334 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

14 0.68473691 110 nips-2002-Incremental Gaussian Processes

15 0.68386543 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

16 0.6822691 46 nips-2002-Boosting Density Estimation

17 0.68051374 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

18 0.68021291 6 nips-2002-A Formulation for Minimax Probability Machine Regression

19 0.67935467 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

20 0.67856902 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs