nips nips2003 nips2003-178 knowledge-graph by maker-knowledge-mining

178 nips-2003-Sparse Greedy Minimax Probability Machine Classification


Source: pdf

Author: Thomas R. Strohmann, Andrei Belitski, Gregory Z. Grudic, Dennis DeCoste

Abstract: The Minimax Probability Machine Classification (MPMC) framework [Lanckriet et al., 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracy bound Ω. The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i.e. kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound Ω. SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation. Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i.e. feature weighting), based solely on maximizing the accuracy bound Ω. Experimental results indicate that we can obtain reliable bounds Ω, as well as test set accuracies that are comparable to state of the art classification algorithms.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 , 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracy bound Ω. [sent-13, score-0.127]

2 However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. [sent-15, score-0.228]

3 In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i. [sent-16, score-0.222]

4 kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound Ω. [sent-18, score-0.105]

5 SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation. [sent-19, score-0.19]

6 Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i. [sent-20, score-0.211]

7 feature weighting), based solely on maximizing the accuracy bound Ω. [sent-22, score-0.126]

8 Experimental results indicate that we can obtain reliable bounds Ω, as well as test set accuracies that are comparable to state of the art classification algorithms. [sent-23, score-0.063]

9 1 Introduction The goal of a binary classifier is to maximize the probability that unseen test data will be classified correctly. [sent-24, score-0.06]

10 Assuming that the test data is generated from the same probability distribution as the training data, it is possible to derive specific probability bounds for the case that the decision boundary is a hyperplane. [sent-25, score-0.118]

11 From (1) we note that the only required relevant information of the underlying probability distribution for each class is its mean and covariance matrix. [sent-28, score-0.045]

12 No other estimates and/or assumptions are needed, which implies that the obtained bound (which we refer to as Ω) is essentially distribution free, i. [sent-29, score-0.06]

13 The goal of this paper is to propose a kernel based MPMC algorithm that directly addresses these computational issues. [sent-33, score-0.056]

14 Towards this end, we propose a sparse greedy MPMC (SMPMC) algorithm that efficiently builds classifiers, while at the same time maintains the distribution free probability bound of MPM type algorithms. [sent-34, score-0.234]

15 To achieve this goal, we propose to use an iterative algorithm which adds basis functions (i. [sent-35, score-0.11]

16 We are considering basis functions that are induced by Mercer kernels, i. [sent-38, score-0.11]

17 functions of the following form f (z) = Kγ (z, zi ) (where zi is an input vector of the training data). [sent-40, score-0.113]

18 Bases are added in a greedy way: we select the particular zi that maximizes the MPMC objective Ω. [sent-41, score-0.084]

19 Furthermore, SMPMC chooses optimal kernel parameters that maximize this metric (hence the subscript γ in Kγ ), including automatically weighting input features by γj ≥ 0 for each kernel added, such that zi = (γ1 z1 , γ2 z2 , . [sent-42, score-0.199]

20 The proposed SMPMC algorithm automatically selects kernels and re-weights features (i. [sent-46, score-0.081]

21 does feature selection) for each new added basis function, by minimizing the error bound (i. [sent-48, score-0.202]

22 Thus the large computational cost of cross validation (typically used by SVM and MPMC) is avoided. [sent-51, score-0.084]

23 2 describes the proposed sparse greedy MPMC algorithm (SMPMC); and Sections 2. [sent-54, score-0.104]

24 4 show how we can use sparse MPMC to determine optimal kernel parameters. [sent-56, score-0.122]

25 In section 3 we compare our results to the ones described in the original MPMC paper (see [4]), showing the probability bounds and the test set accuracies for different binary classification problems. [sent-57, score-0.103]

26 html 2 Classification model In this section we develop a sparse version of the Minimax Probability Machine for binary classification. [sent-63, score-0.082]

27 We show that besides a significant reduction in computational cost, the SMPMC algorithm allows us to do automated kernel and feature selection. [sent-64, score-0.119]

28 The goal of MPMC is to find a decision boundary H(a, b) = {z|aT z = b} such that the minimum probability ΩH of classifying future data correctly is maximized. [sent-68, score-0.038]

29 Exploiting a theorem from Marshall and Olkin [1], it is possible to rewrite (2) as a closed form expression: 1 ΩH = (3) 1 + m2 where m = min a aT Σx a + aT Σy a s. [sent-70, score-0.043]

30 aT (¯ − y) = 1 x ¯ (4) The optimal hyperplane parameter a∗ is the vector that minimizes (4). [sent-72, score-0.058]

31 The hyperplane parameter b∗ can then be computed as: aT Σ x a∗ ∗ (5) b∗ = aT x − ∗¯ m T A new data point znew is classified according to sign(a∗ znew − b∗ ); if this yields +1, znew is classified as belonging to class x, otherwise it is classified as belonging to class y. [sent-73, score-0.223]

32 The models obtained from the kernelized MPMC, however, use all of the training examples (see [4]), i. [sent-76, score-0.055]

33 the decision hyperplane will look like: Ny Nx (x) (y) ai K(xi , z) + where in general all i=1 (x) (y) ai , ai ai K(yi , z) = b (6) i=1 = 0. [sent-78, score-0.162]

34 This brings up the question whether one can construct sparse models for the MPMC where (x) (y) most of the coefficients ai or ai are zero. [sent-79, score-0.118]

35 In this paper we propose to do this by starting with an initially ”empty” model and then adding basis functions one by one. [sent-80, score-0.14]

36 As we will see shortly, this approach is speeding up both learning and evaluation time while it is still maintaining the distribution free probability bound of the MPMC. [sent-81, score-0.102]

37 Before we outline the algorithm we introduce some notation: N = Nx + Ny the total number of training examples = ( 1 , . [sent-82, score-0.055]

38 , N )T ∈ RN output of the model after adding the kth basis function (k) a = the MPMC hyperplane coefficients when adding the kth basis function b(k) = the MPMC hyperplane offset when adding the kth basis function Kb = (Kv (v, x1 ), . [sent-88, score-0.602]

39 , Kv (v, yNy ))T basis function evaluated on all training examples (empirical map) Kxv = (Kv (v, x1 ), . [sent-94, score-0.159]

40 , Kv (v, xNx ))T evaluated only on positive examples Kyv = (Kv (v, y1 ), . [sent-97, score-0.038]

41 , Kv (v, yNy ))T evaluated only on negative examples Note that (k) is a vector of real numbers (the distances of the training data to the hyperplane before applying the sign function). [sent-100, score-0.13]

42 v ∈ Rd is the training vector generating the basis (k) (k) function Kv 1 . [sent-101, score-0.121]

43 We will simply write K (k) , Kx , Ky for the kth basis function. [sent-102, score-0.132]

44 For the first basis we are solving the one dimensional MPMC: m = min a aσ 2 (1) a + (1) aσ 2 (1) a s. [sent-105, score-0.179]

45 Kx (1) a(Kx − Ky ) = 1 Ky (1) (7) (1) where Kx and σ 2 (1) are the mean and variance of the vector Kx (which is the first basis Kx function evaluated on all positive training examples). [sent-107, score-0.138]

46 We set up the two dimensional classification problem: (k) (k+1) x(k+1) = [ x , Kx ] ∈ RNx ×2 (10) (k) (k+1) (k+1) y = [ y , Ky ] ∈ RNy ×2 And solve the following optimization problem: m = min a aT Σx(k+1) a + (1) aT Σy(k+1) a aT (x(k+1) − y(k+1) ) = 1 s. [sent-109, score-0.109]

47 Another benefit is that we have to solve only one and two dimensional MPMC problems. [sent-116, score-0.07]

48 As seen in (8) the one dimensional solution is trivial to compute. [sent-117, score-0.07]

49 An analysis of the two dimensional problem shows that it can be reduced to the problem of finding the roots of a fourth order polynomial. [sent-118, score-0.07]

50 In the standard MPMC algorithm (see [4]), however, the solution a for equation (4) has N dimensions and can therefore only be found by expensive numerical methods. [sent-122, score-0.047]

51 It may seem that the values of Ω = 1/(1 + m2 ) which we obtain from (11) are not true for the whole model since we are considering only two dimensional problems and not all of the k + 1 dimensions we have added so far through our basis functions. [sent-123, score-0.195]

52 But it turns out that the ”local” bound (from the 2D MPMC) is indeed equal to the ”global” bound (when considering all k + 1 dimensions). [sent-124, score-0.12]

53 + ck K (k) be the sparse MPMC model at the (k+1) (k+1) (k+1) kth iteration (k ≥ 1) and let a1 , a2 ,b be the solution of the two dimensional (k+1) (k) (k+1) (k+1) (k+1) (k+1) MPMC: = a1 + a2 K −b . [sent-128, score-0.237]

54 Then the values of Ω for the two dimensional MPMC and for the k + 1 dimensional MPMC are the same. [sent-129, score-0.14]

55 3 Selection of bases and Gaussian Kernel widths In our experiments we are using the Gaussian kernel which looks like: ||u − v||2 2 ) (14) Kσ (u, v) = exp(− 2σ 2 where σ is the so called kernel width. [sent-131, score-0.164]

56 As mentioned before, one typically has to choose σ manually or determine it by cross validation (see [4]). [sent-132, score-0.068]

57 The SMPMC algorithm greedily selects a basis function – out of a randomly chosen candidate set – to maximize Ω which is equivalent to minimizing the value of m in (7) and (11). [sent-133, score-0.144]

58 (1) (1) a(Kx − Ky ) = 1 (16) (1) note that – even though we did not state it explicitly – the statistics σ 2 (1) , σ 2 (1) , Kx , and Kx Ky (1) Ky (and consequently the coefficient a) all depend on the kernel parameter γ. [sent-136, score-0.056]

59 The two dimensional problem that has to be solved for all subsequent iterations k ≥ 2 turns into the following optimization problem for γ: min m(γ) = min γ a aT Σx(k+1) a+ aT Σy(k+1) a s. [sent-137, score-0.131]

60 aT (x(k+1) −y(k+1) ) = 1 (17) Again, x(k+1) , y(k+1) , Σx(k+1) , and Σy(k+1) all depend on the kernel parameter γ and from these four statistics we can compute the minimizer a ∈ R2 analytically. [sent-139, score-0.056]

61 4 Feature selection For doing feature selection with Gaussian kernels one has to replace the uniform kernel width γ with a d dimensional vector γ of kernel weightings: d (18) Kγ (u, v) = exp(− l=1 γl (ul − vl )2 ) (γl ≥ 0 l = 1, . [sent-141, score-0.404]

62 , d) Note that the optimization problems (16) and (17) for the one respectively two dimensional MPMC are now d dimensional instead of just one dimensional. [sent-144, score-0.157]

63 The data sets were randomly divided into 90% training data and 10% test data and the results were averaged over 50 runs for each of the five problems (see table 1). [sent-148, score-0.054]

64 In all the experiments listed in table 1 we used the feature selection algorithm (with the exception of Sonar where width selection was used) and had a candidate set of size 5, i. [sent-149, score-0.217]

65 at each iteration the best basis out of 5 randomly chosen candidates was selected. [sent-151, score-0.156]

66 Note that for all of the data sets SMPMC uses significantly less basis functions than MPMC does which directly translates into an accordingly smaller evaluation cost. [sent-153, score-0.11]

67 The differences in training cost are shown in table 2. [sent-154, score-0.05]

68 The total training time for standard MPMC takes into account the 50-fold cross validation and 10 candidates for the kernel parameter. [sent-155, score-0.213]

69 We observe that for all of the five data sets the training cost of sparse MPMC is only a fraction of the one for standard MPMC. [sent-156, score-0.116]

70 The two plots in figure 1 show what typical learning curves for sparse MPMC look like. [sent-157, score-0.066]

71 As the number of basis function increases, both the bound Ω and the test set accuracy start Table 1: Bound Ω, Test set accuracy (TSA), number of bases (B) for sparse and standard MPMC SMPMC Standard MPMC (Lanckriet et al. [sent-158, score-0.34]

72 9% 270 614 315 691 187 Table 2: training time (in seconds) for Matlab implementations of SMPMC and MPMC # training SMPMC Standard MPMC (Lanckriet et al. [sent-199, score-0.106]

73 ) Dataset examples training time one optimization total training time Twonorm 270 125. [sent-200, score-0.106]

74 The stabilization point usually occurs earlier when one does full feature selection (a γ weight for each input dimension) instead of kernel width selection (one uniform γ for all dimensions). [sent-216, score-0.232]

75 We also experimented with different sizes for the candidate set. [sent-217, score-0.041]

76 The overall behavior is that the test set accuracy as well as the Ω value converge earlier for larger candidate sets (but note that a larger candidate set also increases the computational cost per iteration). [sent-219, score-0.147]

77 As seen in figure 1, feature selection gives usually better results in terms of the bound Ω and the test set accuracy. [sent-220, score-0.176]

78 Furthermore, a feature selection algorithm should indicate which features are relevant and which are not. [sent-221, score-0.116]

79 We set up an experiment for the Twonorm data (which has 20 input features) where we added 20 additional noisy features that were not related to the output. [sent-222, score-0.038]

80 The results are shown in figure 3 and demonstrate that the feature selection algorithm obtained from SMPMC is able to distinguish between relevant and irrelevant features. [sent-223, score-0.096]

81 4 Conclusion & future work This paper introduces a new algorithm (Sparse Minimax Probability Machine Classification - SMPMC) for building sparse classification models that provide a lower bound on the probability of classifying future data correctly. [sent-224, score-0.164]

82 We have shown that the method of iteratively adding basis functions has significant computational advantages over the standard MPMC, while it still maintains the distribution free probability bound Ω. [sent-225, score-0.256]

83 Future research on sparse greedy MPMC will focus on establishing a theoretical framework for a stopping criterion, when adding more basis functions (kernels) will not significantly reduce error rates, and may lead to overfitting. [sent-227, score-0.244]

84 Also, experiments have so far focused on using Gaussian kernels as basis functions. [sent-228, score-0.133]

85 From the experience with other kernel algorithms, it is known that other type of kernels (polynomial, tanh) can yield better results for certain applications. [sent-229, score-0.102]

86 1 0 5 10 15 20 25 0 0 10 20 30 basis functions 40 50 60 70 80 basis functions Figure 1: Bound Ω and Test Set accuracy (TSA) for width selection (WS) and feature selection (FS). [sent-247, score-0.425]

87 6 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 20 basis functions 20 basis functions Figure 2: Accuracy and bound for the Diabetes data set using 1,5 or 10 basis candidates per iteration. [sent-267, score-0.422]

88 Again, the Ω bound is a true lower bound on the test set accuracy. [sent-268, score-0.14]

89 1 0 5 10 15 20 25 30 35 40 feature i Figure 3: Average feature weighting for the Twonorm data set over 50 test runs. [sent-277, score-0.118]

90 The first 20 features are the original inputs, the last 20 features are additional noisy inputs of basis functions are also worth investigating. [sent-278, score-0.15]

91 [7] uses boosting to construct a suitable kernel matrix iteratively. [sent-280, score-0.056]

92 An interesting open question is how this approach relates to sparse greedy MPMC. [sent-281, score-0.104]

93 Optimal inequalities in probability theory: A convex optimization approach. [sent-291, score-0.041]

94 Appendix: Proof of Theorem 1 We have to show that the values of m are equal for the two dimensional √ MPMC and the k + 1 dimensional MPMC. [sent-345, score-0.14]

95 + ck Kx ) x = σ (k) (k+1) x Kx = k i=1 k j=1 Cov(c0 + k (i) (j) ci cj Cov(Kx , Kx ) (1) c 1 Kx + . [sent-353, score-0.042]

96 For the k + 1 dimensional MPMC let us first determine the k + 1 coefficients: (k+1) (1) (k) (k+1) (k+1) (k+1) = a1 (c0 + c1 Kx + . [sent-357, score-0.07]

97 + ck Kx ) + a2 Kx − b(k+1) (k+1) (1) (k+1) (k) (k+1) (k+1) (k+1) = a1 c1 Kx + . [sent-360, score-0.042]

98 σ 2 (k) σK (k) K (k+1)   a(k+1) ck  (21)  Kx Kx k Kx 1 1 x x (k+1) (k+1) σK (k+1) K (1) . [sent-387, score-0.042]

99 σK (k+1) K (k) σ 2 (k+1) a2 a2 Kx x x x x Multiplying out (21) and substituting according to the equations in (20) yields exactly expression (19) (which is the aT Σx a term of the two dimensional MPM). [sent-390, score-0.07]

100 Since this equivalence will hold likewise for the aT Σy a term in m, we have shown that m (and therefore Ω) is equal for the two dimensional and the k + 1 dimensional MPMC. [sent-391, score-0.14]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mpmc', 0.671), ('kx', 0.532), ('smpmc', 0.284), ('ky', 0.143), ('tsa', 0.119), ('kv', 0.106), ('lanckriet', 0.106), ('basis', 0.087), ('minimax', 0.075), ('dimensional', 0.07), ('sparse', 0.066), ('bound', 0.06), ('twonorm', 0.06), ('selection', 0.059), ('hyperplane', 0.058), ('kernel', 0.056), ('candidates', 0.055), ('ws', 0.055), ('classi', 0.051), ('sonar', 0.047), ('fs', 0.047), ('cov', 0.047), ('kernels', 0.046), ('kth', 0.045), ('boulder', 0.045), ('colorado', 0.045), ('znew', 0.045), ('ck', 0.042), ('candidate', 0.041), ('marshall', 0.039), ('mpm', 0.039), ('diabetes', 0.039), ('greedy', 0.038), ('cross', 0.038), ('feature', 0.037), ('training', 0.034), ('ionosphere', 0.033), ('validation', 0.03), ('adding', 0.03), ('bhattacharyya', 0.03), ('crc', 0.03), ('grudic', 0.03), ('olkin', 0.03), ('popescu', 0.03), ('yny', 0.03), ('accuracy', 0.029), ('zi', 0.028), ('bases', 0.028), ('accuracies', 0.027), ('expensive', 0.027), ('automated', 0.026), ('decoste', 0.026), ('xnx', 0.026), ('ai', 0.026), ('probability', 0.024), ('weighting', 0.024), ('looks', 0.024), ('pima', 0.024), ('functions', 0.023), ('nx', 0.022), ('breast', 0.022), ('min', 0.022), ('rewrite', 0.021), ('width', 0.021), ('et', 0.021), ('cation', 0.021), ('examples', 0.021), ('covariance', 0.021), ('ghaoui', 0.021), ('dimensions', 0.02), ('test', 0.02), ('cancer', 0.02), ('crammer', 0.02), ('features', 0.02), ('mercer', 0.019), ('appendix', 0.018), ('coef', 0.018), ('added', 0.018), ('free', 0.018), ('optimization', 0.017), ('computationally', 0.017), ('implementations', 0.017), ('evaluated', 0.017), ('cost', 0.016), ('greedily', 0.016), ('binary', 0.016), ('bounds', 0.016), ('automatically', 0.015), ('belonging', 0.015), ('inf', 0.015), ('iteration', 0.014), ('builds', 0.014), ('empty', 0.014), ('dietterich', 0.014), ('rd', 0.014), ('cients', 0.014), ('svm', 0.014), ('maintains', 0.014), ('extensive', 0.014), ('classifying', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

Author: Thomas R. Strohmann, Andrei Belitski, Gregory Z. Grudic, Dennis DeCoste

Abstract: The Minimax Probability Machine Classification (MPMC) framework [Lanckriet et al., 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracy bound Ω. The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i.e. kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound Ω. SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation. Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i.e. feature weighting), based solely on maximizing the accuracy bound Ω. Experimental results indicate that we can obtain reliable bounds Ω, as well as test set accuracies that are comparable to state of the art classification algorithms.

2 0.048471294 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1

3 0.048222441 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

Author: Jan Eichhorn, Andreas Tolias, Alexander Zien, Malte Kuss, Jason Weston, Nikos Logothetis, Bernhard Schölkopf, Carl E. Rasmussen

Abstract: We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms. 1

4 0.046828002 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

Author: Kenji Fukumizu, Francis R. Bach, Michael I. Jordan

Abstract: We propose a novel method of dimensionality reduction for supervised learning. Given a regression or classification problem in which we wish to predict a variable Y from an explanatory vector X, we treat the problem of dimensionality reduction as that of finding a low-dimensional “effective subspace” of X which retains the statistical relationship between X and Y . We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y . 1

5 0.046492029 128 nips-2003-Minimax Embeddings

Author: Matthew Brand

Abstract: Spectral methods for nonlinear dimensionality reduction (NLDR) impose a neighborhood graph on point data and compute eigenfunctions of a quadratic form generated from the graph. We introduce a more general and more robust formulation of NLDR based on the singular value decomposition (SVD). In this framework, most spectral NLDR principles can be recovered by taking a subset of the constraints in a quadratic form built from local nullspaces on the manifold. The minimax formulation also opens up an interesting class of methods in which the graph is “decorated” with information at the vertices, offering discrete or continuous maps, reduced computational complexity, and immunity to some solution instabilities of eigenfunction approaches. Apropos, we show almost all NLDR methods based on eigenvalue decompositions (EVD) have a solution instability that increases faster than problem size. This pathology can be observed (and corrected via the minimax formulation) in problems as small as N < 100 points. 1 Nonlinear dimensionality reduction (NLDR) . Spectral NLDR methods are graph embedding problems where a set of N points X = [x1 , · · · , xN ] ∈ RD×N sampled from a low-dimensional manifold in a ambient space RD is reparameterized by imposing a neighborhood graph G on X and embedding the graph with minimal distortion in a “parameterization” space Rd , d < D. Typically the graph is sparse and local, with edges connecting points to their immediate neighbors. The embedding must keep these edges short or preserve their length (for isometry) or angles (for conformality). The graph-embedding problem was first introduced as a least-squares problem by Tutte [1], and as an eigenvalue problem by Fiedler [2]. The use of sparse graphs to generate metrics for least-squares problems has been studied intensely in the following three decades (see [3]). Modern NLDR methods use graph constraints to generate a metric in a space of embeddings RN . Eigenvalue decomposition (EVD) gives the directions of least or greatest variance under this metric. Typically a subset of d extremal eigenvectors gives the embedding of N points in Rd parameterization space. This includes the IsoMap family [4], the locally linear embedding (LLE) family [5,6], and Laplacian methods [7,8]. Using similar methods, the Automatic Alignment [6] and Charting [9] algorithms embed local subspaces instead of points, and by combining subspace projections thus obtain continuous maps between RD and Rd . This paper introduces a general algebraic framework for computing optimal embeddings directly from graph constraints. The aforementioned methods can can be recovered as special cases. The framework also suggests some new methods with very attractive properties, including continuous maps, reduced computational complexity, and control over the degree of conformality/isometry in the desired map. It also eliminates a solution instability that is intrinsic to EVD-based approaches. A perturbational analysis quantifies the instability. 2 Minimax theorem for graph embeddings We begin with neighborhood graph specified by a nondiagonal weighted adjacency matrix M ∈ RN×N that has the data-reproducing property XM = X (this can be relaxed to XM ≈ X in practice). The graph-embedding and NLDR literatures offer various constructions of M, each appropriate to different sets of assumptions about the original embedding and its sampling X (e.g., isometry, local linearity, noiseless samples, regular sampling, etc.). Typically Mi j = 0 if points i, j are nearby on the intrinsic manifold and |Mi j | is small or zero otherwise. Each point is taken to be a linear or convex combination of its neighbors, and thus M specifies manifold connectivity in the sense that any nondegenerate embedding Y that satisfies YM ≈ Y with small residual YM − Y F will preserve this connectivity and the structure of local neighborhoods. For example, in barycentric embeddings, each point is the average of its neighbors and thus Mi j = 1/k if vertex i is connected to vertex j (of degree k). We will also consider three optional constraints on the embedding : 1. A null-space restriction, where the solution must be outside to the column-space of C ∈ RN×M , M < N. For example, it is common to stipulate that the solution Y be centered, i.e., YC = 0 for C = 1, the constant vector. 2. A basis restriction, where the solution must be a linear combination of the rows of basis Z ∈ RK×N , K ≤ N. This can be thought of as information placed at the vertices of the graph that serves as example inputs for a target NLDR function. We will use this to construct dimension-reducing radial basis function networks. 3. A metric Σ ∈ RN×N that determines how error is distributed over the points. For example, it might be important that boundary points have less error. We assume that Σ is symmetric positive definite and has factorization Σ = AA (e.g., A could be a Cholesky factor of Σ). In most settings, the optional matrices will default to the identity matrix. In this context, we define the per-dimension embedding error of row-vector yi ∈ rows(Y) to be . EM (yi ) = max yi ∈range(Z),, K∈RM×N (yi (M + CD) − yi )A yi A (1) where D is a matrix constructed by an adversary to maximize the error. The optimizing yi is a vector inside the subspace spanned by the rows of Z and outside the subspace spanned by the columns of C, for which the reconstruction residual yi M−yi has smallest norm w.r.t. the metric Σ. The following theorem identifies the optimal embedding Y for any choice of M, Z, C, Σ: Minimax solution: Let Q ∈ SK×P be a column-orthonormal basis of the null-space of the rows of ZC, with P = K − rank(C). Let B ∈ RP×P be a square factor satisfying B B = Q ZΣZ Q, e.g., a Cholesky factor (or the “R” factor in QR-decomposition of (Q ZA) ). Compute the left singular vectors U ∈ SN×N of Udiag(s)V = B− Q Z(I − M)A, with . singular values s = [s1 , · · · , sP ] ordered s1 ≤ s2 ≤ · · · ≤ s p . Using the leading columns U1:d of U, set Y = U1:d B− Q Z. Theorem 1. Y is the optimal (minimax) embedding in Rd with error [s1 , · · · , sd ] 2 : . Y = U1:d B− Q Z = arg min ∑ EM (yi )2 with EM (yi ) = si . Y∈Rd×N y ∈rows(Y) i (2) Appendix A develops the proof and other error measures that are minimized. Local NLDR techniques are easily expressed in this framework. When Z = A = I, C = [], and M reproduces X through linear combinations with M 1 = 1, we recover LLE [5]. When Z = I, C = [], I − M is the normalized graph Laplacian, and A is a diagonal matrix of vertex degrees, we recover Laplacian eigenmaps [7]. When further Z = X we recover locally preserving projections [8]. 3 Analysis and generalization of charting The minimax construction of charting [9] takes some development, but offers an interesting insight into the above-mentioned methods. Recall that charting first solves for a set of local affine subspace axes S1 ∈ RD×d , S2 , · · · at offsets µ1 ∈ RD , µ2 , · · · that best cover the data and vary smoothly over the manifold. Each subspace offers a chart—a local parameterization of the data by projection onto the local axes. Charting then constructs a weighted mixture of affine projections that merges the charts into a global parameterization. If the data manifold is curved, each projection will assign a point a slightly different embedding, so the error is measured as the variance of these proposed embeddings about their mean. This maximizes consistency and tends to produce isometric embeddings; [9] discusses ways to explicitly optimize the isometry of the embedding. Under the assumption of isometry, the charting error is equivalent to the sumsquared displacements of an embedded point relative to its immediate neighbors (summed over all neighborhoods). To construct the same error criteria in the minimax setting, let xi−k , · · · , xi , · · · , xi+k denote points in the ith neighborhood and let the columns of Vi ∈ R(2k+1)×d be an orthonormal basis of rows of the local parameterization Si [xi−k , · · · , xi , · · · , xi+k ]. Then a nonzero reparameterization will satisfy [yi−k , · · · , yi , · · · , yi+k ]Vi Vi = [yi−k , · · · , yi , · · · , yi+k ] if and only if it preserves the relative position of the points in the local parameterization. Conversely, any relative displacements of the points are isolated by the formula [yi−k , · · · , yi , · · · , yi+k ](I − Vi Vi ). Minimizing the Frobenius norm of this expression is thus equivalent to minimizing the local error in charting. We sum these constraints over all neighborhoods to obtain the constraint matrix M = I − ∑i Fi (I − Vi Vi )Fi , where (Fi )k j = 1 iff the jth point of the ith neighborhood is the kth point of the dataset. Because Vi Vi and (I − Vi Vi ) are complementary, it follows that the error criterion of any local NLDR method (e.g., LLE, Laplacian eigenmaps, etc.) must measure the projection of the embedding onto some subspace of (I − Vi Vi ). To construct a continuous map, charting uses an overcomplete radial basis function (RBF) representation Z = [z(x1 ), z(x2 ), · · · z(xN )], where z(x) is a vector that stacks z1 (x), z2 (x), etc., and pm (x) . Km (x − µm ) , zm (x) = 1 ∑m pm (x) −1 . pm (x) = N (x|µm , Σm ) ∝ e−(x−µm ) Σm (x−µm )/2 (3) (4) and Km is any local linear dimensionality reducer, typically Sm itself. Each column of Z contains many “views” of the same point that are combined to give its low-dimensional embedding. Finally, we set C = 1, which forces the embedding of the full data to be centered. Applying the minimax solution to these constraints yields the RBF network mixing ma. trix, f (x) = U1:d B− Q z(x). Theorem 1 guarantees that the resulting embedding is leastsquares optimal w.r.t. Z, M, C, A at the datapoints f (xi ), and because f (·) is an affine transform of z(·) it smoothly interpolates the embedding between points. There are some interesting variants: Kernel embeddings of the twisted swiss roll generalized EVD minimax SVD UR corner detail LL corner detail Fig. 1. Minimax and generalized EVD solution for kernel eigenmap of a non-developable swiss roll. Points are connected into a grid which ideally should be regular. The EVD solution shows substantial degradation. Insets detail corners where the EVD solution crosses itself repeatedly. The border compression is characteristic of Laplacian constraints. One-shot charting: If we set the local dimensionality reducers to the identity matrix (all Km = I), then the minimax method jointly optimizes the local dimensionality reduction to charts and the global coordination of the charts (under any choice of M). This requires that rows(Z) ≤ N for a fully determined solution. Discrete isometric charting: If Z = I then we directly obtain a discrete isometric embedding of the data, rather than a continuous map, making this a local equivalent of IsoMap. Reduced basis charting: Let Z be constructed using just a small number of kernels randomly placed on the data manifold, such that rows(Z) N. Then the size of the SVD problem is substantially reduced. 4 Numerical advantage of minimax method Note that the minimax method projects the constraint matrix M into a subspace derived from C and Z and decomposes it there. This suppresses unwanted degrees of freedom (DOFs) admitted by the problem constraints, for example the trivial R0 embedding where all points are mapped to a single point yi = N −1/2 . The R0 embedding serves as a translational DOF in the solution. LLE- and eigenmap-based methods construct M to have a constant null-space so that the translational DOF will be isolated in the EVD as null eigenvalue paired to a constant eigenvector, which is then discarded. However, section 4.1 shows that this construction makes the EVD increasingly unstable as problem size grows and/or the data becomes increasing amenable to low-residual embeddings, ultimately causing solution collapse. As the next paragraph demonstrates, the problem is exacerbated when embedding w.r.t. a basis Z (via the equivalent generalized eigenproblem), partly because the eigenvector associated with the unwanted DOF can have arbitrary structure. In all cases the problem can be averted by using the minimax formulation with C = 1 to suppress the DOF. A 2D plane was embedded in 3D with a curl, a twist, and 2.5% Gaussian noise, then regularly sampled at 900 points. We computed a kernelized Laplacian eigenmap using 70 random points as RBF centers, i.e., a continous map using M derived from the graph Laplacian and Z constructed as above. The map was computed both via the minimax (SVD) method and via the equivalent generalized eigenproblem, where the translational degree of freedom must be removed by discarding an eigenvector from the solution. The two solutions are algebraically equivalent in every other regard. A variety of eigensolvers were tried; we took −5 excess energy x 10 Eigen spectrum compared to minimax spectrum 15 10 5 0 −5 Eigen spectrum compared to minimax spectrum 2 15 deviation excess energy x 10 10 5 100 200 eigenvalue Error in null embedding −5 x 10 0 −2 −4 −6 −8 0 100 −5 eigenvalue Error in null embedding 200 100 200 300 400 500 point 600 700 800 900 Fig. 2. Excess energy in the eigenspectrum indicates that the translational DOF has contam2 inated many eigenvectors. If the EVD had successfully isolated the unwanted DOF, then its 0 remaining eigenvalues should be identical to those derived from the minimax solution. The −2 −4 graph at left shows the difference in the eigenspectra. The graph at right shows the EVD −6 solution’s deviation from the translational vector y0 = 1 · N −1/2 ≈ .03333. If the numer−8 ics were100 200 the line would be flat, but in practice the deviation is significant enough perfect 300 400 500 600 700 800 900 point (roughly 1% of the diameter of the embedding) to noticably perturb points in figure 1. deviation x 10 the best result. Figure 1 shows that the EVD solution exhibits many defects, particularly a folding-over of the manifold at the top and bottom edges and at the corners. Figure 2 shows that the noisiness of the EVD solution is due largely to mutual contamination of numerically unstable eigenvectors. 4.1 Numerical instability of eigen-methods The following theorem uses tools of matrix perturbation theory to show that as the problem size increases, the desired and unwanted eigenvectors become increasingly wobbly and gradually contaminate each other, leading to degraded solutions. More precisely, the low-order eigenvalues are ill-conditioned and exhibit multiplicities that may be true (due to noiseless samples from low-curvature manifolds) or false (due to numerical noise). Although in many cases some post-hoc algebra can “filter” the unwanted components out of the contaminated eigensolution, it is not hard to construct cases where the eigenvectors cannot be cleanly separated. The minimax formulation is immune to this problem because it explicitly suppresses the gratuitous component(s) before matrix decomposition. Theorem 2. For any finite numerical precision, as the number of points N increases, the Frobenius norm of numerical noise in the null eigenvector v0 can grow as O(N 3/2 ), and the eigenvalue problem can approach a false multiplicity at a rate as fast as O(N 3/2 ), at which point the eigenvectors of interest—embedding and translational—are mutually contaminated and/or have an indeterminate eigenvalue ordering. Please see appendix B for the proof. This theorem essentially lower-bounds an upperbound on error; examples can be constructed in which the problem is worse. For example, it can be shown analytically that when embedding points drawn from the simple curve xi = [a, cos πa] , a ∈ [0, 1] with K = 2 neighbors, instabilities cannot be bounded better than O(N 5/2 ); empirically we see eigenvector mixing with N < 100 points and we see it grow at the rate ≈ O(N 4 )—in many different eigensolvers. At very large scales, more pernicious instabilities set in. E.g., by N = 20000 points, the solution begins to fold over. Although algebraic multiplicity and instability of the eigenproblem is conceptually a minor oversight in the algorithmic realizations of eigenfunction embeddings, as theorem 2 shows, the consequences are eventually fatal. 5 Summary One of the most appealing aspects of the spectral NLDR literature is that algorithms are usually motivated from analyses of linear operators on smooth differentiable manifolds, e.g., [7]. Understandably, these analysis rely on assumptions (e.g., smoothness or isometry or noiseless sampling) that make it difficult to predict what algorithmic realizations will do when real, noisy data violates these assumptions. The minimax embedding theorem provides a complete algebraic characterization of this discrete NLDR problem, and provides a solution that recovers numerically robustified versions of almost all known algorithms. It offers a principled way of constructing new algorithms with clear optimality properties and good numerical conditioning—notably the construction of a continuous NLDR map (an RBF network) in a one-shot optimization ( SVD ). We have also shown how to cast several local NLDR principles in this framework, and upgrade these methods to give continuous maps. Working in the opposite direction, we sketched the minimax formulation of isometric charting and showed that its constraint matrix contains a superset of all the algebraic constraints used in local NLDR techniques. References 1. W.T. Tutte. How to draw a graph. Proc. London Mathematical Society, 13:743–768, 1963. 2. Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech. Math. Journal, 25:619–633, 1975. 3. Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997. 4. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000. 5. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. 6. Yee Whye Teh and Sam T. Roweis. Automatic alignment of hidden representations. In Proc. NIPS-15, 2003. 7. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. volume 14 of Advances in Neural Information Processing Systems, 2002. 8. Xiafei He and Partha Niyogi. Locality preserving projections. Technical Report TR-2002-09, University of Chicago Computer Science, October 2002. 9. Matthew Brand. Charting a manifold. volume 15 of Advances in Neural Information Processing Systems, 2003. 10. G.W. Stewart and Ji-Guang Sun. Matrix perturbation theory. Academic Press, 1990. A Proof of minimax embedding theorem (1) The burden of this proof is carried by supporting lemmas, below. To emphasize the proof strategy, we give the proof first; supporting lemmas follow. Proof. Setting yi = li Z, we will solve for li ∈ columns(L). Writing the error in terms of li , EM (li ) = max K∈RM×N li Z(I − M − CK)A li Z(I − M)A − li ZCKA = max . M×N li ZA li ZA K∈R (5) The term li ZCKA produces infinite error unless li ZC = 0, so we accept this as a constraint and seek li Z(I − M)A min . (6) li ZA li ZC=0 By lemma 1, that orthogonality is satisfied by solving the problem in the space orthogonal . to ZC; the basis for this space is given by columns of Q = null((ZC) ). By lemma 2, the denominator of the error specifies the metric in solution space to be ZAA Z ; when the problem is projected into the space orthogonal to ZC it becomes Q (ZAA Z )Q. Nesting the “orthogonally-constrained-SVD” construction of lemma 1 inside the “SVD-under-a-metric” lemma 2, we obtain a solution that uses the correct metric in the orthogonal space: B B = Q ZAA Z Q − Udiag(s)V = B (7) {Q(Z(I − M)A)} (8) L = QB−1 U (9) where braces indicate the nesting of lemmas. By the “best-projection” lemma (#3), if we order the singular values by ascending magnitude, L1:d = arg min J∈RN×d ∑ji ∈cols(J) ( j Z(I − M)A / j )2 ZΣZ The proof is completed by making the substitutions L Z → Y and x A → x Σ = AA ), and leaving off the final square root operation to obtain (Y )1:d = arg min ∑ji ∈cols(J) j (I − M) Σ / j J∈RN×d (10) Σ (for 2 Σ . (11) Lemma 1. Orthogonally constrained SVD: The left singular vectors L of matrix M under . SVD the constraint U C = 0 are calculated as Q = null(C ), Udiag(s)V ← Q M, L = QU. Proof. First observe that L is orthogonal to C: By definition, the null-space basis satisfies Q C = 0, thus L C = U Q C = 0. Let J be an orthonormal basis for C, with J J = I and Q J = 0. Then Ldiag(s)V = QQ M = (I − JJ )M, the orthogonal projector of C applied to M, proving that the SVD captures the component of M that is orthogonal to C. Lemma 2. SVD with respect to a metric: The vectors li ∈ L, vi ∈ V that diagonalize matrix M with respect to positive definite column-space metric Σ are calculated as B B ← Σ, SVD . Udiag(s)V ← B− M, L = B−1 U satisfy li M / li Σ = si and extremize this form for the extremal singular values smin , smax . Proof. By construction, L and V diagonalize M: L MV = (B−1 U) MV = U (B− M)V = diag(s) (12) B− and diag(s)V = M. Forming the gram matrices of both sides of the last line, we obtain the identity Vdiag(s)2 V = M B−1 B− M = M Σ−1 M, which demonstrates that si ∈ s are the singular values of M w.r.t. column-space metric Σ. Finally, L is orthonormal w.r.t. the metric Σ, because L 2 = L ΣL = U B− B BB−1 U = I. Consequently, Σ l M / l Σ = l M /1 = si vi and by the Courant-Hilbert theorem, smax = max l M / l Σ ; = si . smin = min l M / l Σ . l l (13) (14) Lemma 3. Best projection: Taking L and s from lemma 2, let the columns of L and elements of s be sorted so that s1 ≥ s2 ≥ · · · ≥ sN . Then for any dimensionality 1 ≤ d ≤ N, . L1:d = [l1 , · · · , ld ] = arg max J M (J ΣJ)−1 (15) J∈RN×d = arg max F (16) ∑ji ∈cols(J) ( j M / j Σ )2 (17) J∈RN×d |J ΣJ=I = arg max J∈RN×d J M with the optimum value of all right hand sides being (∑d s2 )1/2 . If the sort order is rei=1 i versed, the minimum of this form is obtained. Proof. By the Eckart-Young-Mirsky theorem, if U MV = diag(s) with singular values . sorted in descending order, then U1:d = [u1 , · · · , ud ] = arg maxU∈SN×d U M F . We first extend this to a non-orthonogonal basis J under a Mahalonobis norm: maxJ∈RN×d J M (J J)−1 = maxU∈SN×d U M F (18) because J M 2 (J J)−1 = trace(M J(J J)−1 J M) = trace(M JJ+ (JJ+ ) M) = (JJ+ )M 2 = UU M 2 = U M 2 since JJ+ is a (symmetric) orthogonal proF F F jector having binary eigenvalues λ ∈ {0, 1} and therefore it is the gram of an thin orthogonal matrix. We then impose a metric Σ on the column-space of J to obtain the first criterion (equation 15), which asks what maximizes variance in J M while minimizing the norm of J w.r.t. metric Σ. Here it suffices to substitute in the leading (resp., trailing) columns of L and verify that the norm is maximized (resp., minimized). Expanding, L1:d M 2 ΣL )−1 = trace((L1:d M) (L1:d ΣL1:d )−1 (L1:d M)) = (L 1:d 1:d trace((L1:d M) I(L1:d M)) = trace((diag(s1:d )V1:d ) (diag(s1:d )V1:d )) = s1:d 2 . Again, by the Eckart-Young-Mirsky theorem, these are the maximal variance-preserving projections, so the first criterion is indeed maximized by setting J to the columns in L corresponding to the largest values in s. Criterion #2 restates the first criterion with the set of candidates for J restricted to (the hyperelliptical manifold of) matrices that reduce the metric on the norm to the identity matrix (thereby recovering the Frobenius norm). Criterion #3 criterion merely expands the above trace by individual singular values. Note that the numerator and denominator can have different metrics because they are norms in different spaces, possibly of different dimension. Finally, that the trailing d eigenvectors minimize these criteria follows directly from the fact that leading N − d singular values account for the maximal part of the variance. B Proof of instability theorem (2) Proof. When generated from a sparse graph with average degree K, weighted connectivity matrix W is sparse and has O(NK) entries. Since the graph vertices represent samples from a smooth manifold, increasing the sampling density N does not change the distribution of magnitudes in W. Consider a perturbation of the nonzero values in W, e.g., W → W + E due to numerical noise E created by finite machine precision. By the weak law of large √ numbers, the Frobenius norm of the sparse perturbation grows as E F ∼ O( N). However the t th -smallest nonzero eigenvalue λt (W) grows as λt (W) = vt Wvt ∼ O(N −1 ), because elements of corresponding eigenvector vt grow as O(N −1/2 ) and only K of those elements are multiplied by nonzero values to form each element of Wvt . In sum, the perturbation E F grows while the eigenvalue λt (W) shrinks. In linear embedding algorithms, . the eigengap of interest is λgap = λ1 − λ0 . The tail eigenvalue λ0 = 0 by construction but it is possible that λ0 > 0 with numerical error, thus λgap ≤ λ1 . Combining these facts, the ratio between the perturbation and the eigengap grows as E F /λgap ∼ O(N 3/2 ) or faster. Now consider the shifted eigenproblem I − W with leading (maximal) eigenvalues 1 − λ0 ≥ 1 − λ1 ≥ · · · and unchanged eigenvectors. From matrix perturbation the. ory [10, thm. V.2.8], when W is perturbed to W = W + E, the change in the lead√ ing eigenvalue from 1 − λ0 to 1 − λ0 is bounded as |λ0 − λ0 | ≤ 2 E F and similarly √ √ 1 − λ1 ≤ 1 − λ1 + 2 E F . Thus λgap ≥ λgap − 2 E F . Since E F /λgap ∼ O(N 3/2 ), the right hand side of the gap bound goes negative at a supralinear rate, implying that the eigenvalue ordering eventually becomes unstable with the possibility of the first and second eigenvalue/vector pairs being swapped. Mutual contamination of the eigenvectors happens well before: Under general (dense) conditions, the change in the eigenvector v0 is bounded E F as v0 − v0 ≤ |λ −λ4 |−√2 E [10, thm. V.2.8]. (This bound is often tight enough to serve F 0 1 as a good approximation.) Specializing this to the sparse embedding matrix, we find that √ √ O( N) O( N) the bound weakens to v0 − 1 · N −1/2 ∼ O(N −1 )−O(√N) > O(N −1 ) = O(N 3/2 ).

6 0.043698896 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

7 0.042116754 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

8 0.039964058 112 nips-2003-Learning to Find Pre-Images

9 0.039853327 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

10 0.038246341 73 nips-2003-Feature Selection in Clustering Problems

11 0.037661765 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

12 0.036044125 113 nips-2003-Learning with Local and Global Consistency

13 0.034122027 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

14 0.034007624 124 nips-2003-Max-Margin Markov Networks

15 0.033834856 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

16 0.032965939 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

17 0.032530624 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects

18 0.03239353 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification

19 0.031494018 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

20 0.030733075 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.11), (1, -0.04), (2, -0.01), (3, -0.06), (4, 0.01), (5, 0.035), (6, -0.019), (7, -0.032), (8, -0.006), (9, 0.032), (10, 0.016), (11, -0.014), (12, -0.011), (13, -0.022), (14, 0.013), (15, 0.027), (16, -0.012), (17, 0.051), (18, 0.052), (19, -0.008), (20, 0.028), (21, -0.01), (22, 0.003), (23, 0.016), (24, -0.026), (25, 0.014), (26, -0.023), (27, 0.066), (28, -0.037), (29, -0.022), (30, -0.003), (31, 0.051), (32, -0.006), (33, 0.025), (34, -0.047), (35, -0.081), (36, 0.007), (37, 0.025), (38, 0.009), (39, 0.0), (40, 0.055), (41, 0.012), (42, 0.063), (43, -0.061), (44, -0.077), (45, 0.047), (46, -0.01), (47, -0.129), (48, -0.054), (49, 0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87202054 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

Author: Thomas R. Strohmann, Andrei Belitski, Gregory Z. Grudic, Dennis DeCoste

Abstract: The Minimax Probability Machine Classification (MPMC) framework [Lanckriet et al., 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracy bound Ω. The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i.e. kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound Ω. SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation. Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i.e. feature weighting), based solely on maximizing the accuracy bound Ω. Experimental results indicate that we can obtain reliable bounds Ω, as well as test set accuracies that are comparable to state of the art classification algorithms.

2 0.62175876 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

Author: Kenji Fukumizu, Francis R. Bach, Michael I. Jordan

Abstract: We propose a novel method of dimensionality reduction for supervised learning. Given a regression or classification problem in which we wish to predict a variable Y from an explanatory vector X, we treat the problem of dimensionality reduction as that of finding a low-dimensional “effective subspace” of X which retains the statistical relationship between X and Y . We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y . 1

3 0.52216095 99 nips-2003-Kernels for Structured Natural Language Data

Author: Jun Suzuki, Yutaka Sasaki, Eisaku Maeda

Abstract: This paper devises a novel kernel function for structured natural language data. In the field of Natural Language Processing, feature extraction consists of the following two steps: (1) syntactically and semantically analyzing raw data, i.e., character strings, then representing the results as discrete structures, such as parse trees and dependency graphs with part-of-speech tags; (2) creating (possibly high-dimensional) numerical feature vectors from the discrete structures. The new kernels, called Hierarchical Directed Acyclic Graph (HDAG) kernels, directly accept DAGs whose nodes can contain DAGs. HDAG data structures are needed to fully reflect the syntactic and semantic structures that natural language data inherently have. In this paper, we define the kernel function and show how it permits efficient calculation. Experiments demonstrate that the proposed kernels are superior to existing kernel functions, e.g., sequence kernels, tree kernels, and bag-of-words kernels. 1

4 0.51967376 181 nips-2003-Statistical Debugging of Sampled Programs

Author: Alice X. Zheng, Michael I. Jordan, Ben Liblit, Alex Aiken

Abstract: We present a novel strategy for automatically debugging programs given sampled data from thousands of actual user runs. Our goal is to pinpoint those features that are most correlated with crashes. This is accomplished by maximizing an appropriately defined utility function. It has analogies with intuitive debugging heuristics, and, as we demonstrate, is able to deal with various types of bugs that occur in real programs. 1

5 0.51746142 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1

6 0.50976533 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

7 0.47935805 187 nips-2003-Training a Quantum Neural Network

8 0.4750568 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

9 0.44139701 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects

10 0.44110233 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

11 0.43461043 112 nips-2003-Learning to Find Pre-Images

12 0.4107106 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

13 0.40833241 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

14 0.39428934 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

15 0.39310384 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

16 0.38824844 118 nips-2003-Link Prediction in Relational Data

17 0.38804349 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

18 0.37903777 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification

19 0.3725985 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

20 0.36753306 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (11, 0.019), (30, 0.011), (35, 0.059), (53, 0.081), (71, 0.05), (76, 0.499), (85, 0.058), (91, 0.064), (99, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95173383 155 nips-2003-Perspectives on Sparse Bayesian Learning

Author: Jason Palmer, Bhaskar D. Rao, David P. Wipf

Abstract: Recently, relevance vector machines (RVM) have been fashioned from a sparse Bayesian learning (SBL) framework to perform supervised learning using a weight prior that encourages sparsity of representation. The methodology incorporates an additional set of hyperparameters governing the prior, one for each weight, and then adopts a specific approximation to the full marginalization over all weights and hyperparameters. Despite its empirical success however, no rigorous motivation for this particular approximation is currently available. To address this issue, we demonstrate that SBL can be recast as the application of a rigorous variational approximation to the full model by expressing the prior in a dual form. This formulation obviates the necessity of assuming any hyperpriors and leads to natural, intuitive explanations of why sparsity is achieved in practice. 1

2 0.93629891 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation

Author: Chen Yanover, Yair Weiss

Abstract: Loopy belief propagation (BP) has been successfully used in a number of difficult graphical models to find the most probable configuration of the hidden variables. In applications ranging from protein folding to image analysis one would like to find not just the best configuration but rather the top M . While this problem has been solved using the junction tree formalism, in many real world problems the clique size in the junction tree is prohibitively large. In this work we address the problem of finding the M best configurations when exact inference is impossible. We start by developing a new exact inference algorithm for calculating the best configurations that uses only max-marginals. For approximate inference, we replace the max-marginals with the beliefs calculated using max-product BP and generalized BP. We show empirically that the algorithm can accurately and rapidly approximate the M best configurations in graphs with hundreds of variables. 1

3 0.91669148 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

Author: Jan Eichhorn, Andreas Tolias, Alexander Zien, Malte Kuss, Jason Weston, Nikos Logothetis, Bernhard Schölkopf, Carl E. Rasmussen

Abstract: We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms. 1

same-paper 4 0.89203256 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

Author: Thomas R. Strohmann, Andrei Belitski, Gregory Z. Grudic, Dennis DeCoste

Abstract: The Minimax Probability Machine Classification (MPMC) framework [Lanckriet et al., 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracy bound Ω. The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i.e. kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound Ω. SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation. Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i.e. feature weighting), based solely on maximizing the accuracy bound Ω. Experimental results indicate that we can obtain reliable bounds Ω, as well as test set accuracies that are comparable to state of the art classification algorithms.

5 0.59623575 112 nips-2003-Learning to Find Pre-Images

Author: Jason Weston, Bernhard Schölkopf, Gökhan H. Bakir

Abstract: We consider the problem of reconstructing patterns from a feature map. Learning algorithms using kernels to operate in a reproducing kernel Hilbert space (RKHS) express their solutions in terms of input points mapped into the RKHS. We introduce a technique based on kernel principal component analysis and regression to reconstruct corresponding patterns in the input space (aka pre-images) and review its performance in several applications requiring the construction of pre-images. The introduced technique avoids difficult and/or unstable numerical optimization, is easy to implement and, unlike previous methods, permits the computation of pre-images in discrete input spaces. 1

6 0.56501395 189 nips-2003-Tree-structured Approximations by Expectation Propagation

7 0.55873764 176 nips-2003-Sequential Bayesian Kernel Regression

8 0.55390787 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

9 0.53127569 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

10 0.5115878 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

11 0.51113915 122 nips-2003-Margin Maximizing Loss Functions

12 0.50937271 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

13 0.50554419 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

14 0.50232029 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

15 0.50224274 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

16 0.50220054 152 nips-2003-Pairwise Clustering and Graphical Models

17 0.50024122 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

18 0.49560952 107 nips-2003-Learning Spectral Clustering

19 0.49530783 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution

20 0.49316406 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds