nips nips2007 nips2007-11 knowledge-graph by maker-knowledge-mining

11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators


Source: pdf

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 be Abstract This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. [sent-8, score-0.161]

2 It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. [sent-9, score-0.461]

3 A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. [sent-10, score-0.244]

4 This analysis is related to Support Vector Machines by means of a margin transformation. [sent-11, score-0.261]

5 The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. [sent-12, score-0.273]

6 1 Introduction The quest for efficient machine learning techniques which (a) have favorable generalization capacities, (b) are flexible for adaptation to a specific task, and (c) are cheap to implement is a pervasive theme in literature, see e. [sent-13, score-0.059]

7 This paper introduces a novel concept for designing a learning algorithm, namely the Maximal Average Margin (MAM) principle. [sent-16, score-0.027]

8 It closely resembles the classical notion of maximal margin as lying on the basis of perceptrons, Support Vector Machines (SVMs) and boosting algorithms, see a. [sent-17, score-0.433]

9 It however optimizes the average margin of points to the (hypothesis) hyperplane, instead of the worst case margin as traditional. [sent-20, score-0.56]

10 The full margin distribution was studied earlier in e. [sent-21, score-0.261]

11 In [10], the principle was already shown to underlie the approach of mincuts for transductive inference over a weighted undirected graph. [sent-26, score-0.085]

12 Further, consider the modelclass consisting of all models with bounded average margin (or classes with a fixed Rademacher complexity as we will indicate lateron). [sent-27, score-0.339]

13 The set of such classes is clearly nested, enabling structural risk minimization [8]. [sent-28, score-0.111]

14 On a practical level, we show how the optimality principle can be used for designing a computationally fast approach to (large-scale) classification and ordinal regression tasks, much along the same 1 Acknowledgements - K. [sent-29, score-0.335]

15 It becomes clear that this result enables researchers on Parzen windows to benefit directly from recent advances in kernel machines, two fields which have evolved mostly separately. [sent-49, score-0.159]

16 It must be emphasized that the resulting learning rules were already studied in different forms and motivated by asymptotic and geometric arguments, as e. [sent-50, score-0.032]

17 the Parzen window classifier [4], the ’simple classifier’ as in [12] chap. [sent-52, score-0.083]

18 1, probabilistic neural networks [15], while in this paper we show how an (empirical) risk based optimality criterion underlies this approach. [sent-53, score-0.153]

19 A number of experiments confirm the use of the resulting cheap learning rules for providing a reasonable (baseline) performance in a small time-window. [sent-54, score-0.034]

20 The next section illustrates the principle of maximal average margin for classification problems. [sent-69, score-0.504]

21 Section 3 investigates the close relationship with Rademacher complexities, Section 4 develops the maximal average margin principle for ordinal regression, and Section 5 reports experimental results of application of the MAM to classification and ordinal regression tasks. [sent-70, score-0.828]

22 1 Maximal Average Margin for Classifiers The Linear Case Let the class of hypotheses be defined as H = f (·) : Rd → R, w ∈ Rd ∀x ∈ Rd : f (x) = wT x, w 2 =1 . [sent-72, score-0.026]

23 (1) Consequently, the signed distance of a sample (X, Y ) to the hyper-plane wT x = 0, or the margin M (w) ∈ R, can be defined as Y (wT X) M (w) = . [sent-73, score-0.261]

24 We instead focus on the first moment of the margin distribution. [sent-75, score-0.261]

25 Maximizing the expected (average) margin follows from solving Y (wT X) = max E [Y f (X)] . [sent-76, score-0.292]

26 (3) M ∗ = max E w f ∈H w 2 Remark that the non-separable case does not require the need for slack-variables. [sent-77, score-0.031]

27 The empirical counterpart becomes n Yi (wT Xi ) 1 ˆ M = max , (4) w n w 2 i=1 n 1 which can be written as a constrained convex problem as minw − n i=1 Yi (wT Xi ) s. [sent-78, score-0.104]

28 The Lagrangian with multiplier λ ≥ 0 becomes L(w, λ) = − n i=1 Yi (wT Xi ) + λ (wT w − 1). [sent-81, score-0.072]

29 2 By switching the minimax problem to a maximin problem (application of Slater’s condition), the first order condition for optimality ∂L(w,λ) = 0 gives ∂w wn = 1 λn n Yi Xi = i=1 1 T X y, λn (5) where wn ∈ Rd denotes the optimum to (4). [sent-82, score-0.729]

30 The corresponding parameter λ can be found by n 1 1 substituting (5) in the constraint wT w = 1, or λ = n yT XXT y since the i=1 Yi Xi 2 = n T optimum is obviously taking place when w w = 1. [sent-83, score-0.043]

31 It becomes clear that the above derivations remain valid as n → ∞, resulting in the following theorem. [sent-84, score-0.031]

32 Theorem 1 (Explicit Actual Optimum for the MAMC) The function f (x) = wT x in H maximizing the expected margin satisfies Y (wT X) 1 (6) arg max E = E[XY ] w∗ , w 2 λ w where λ is a normalization constant such that w∗ 2 = 1. [sent-85, score-0.318]

33 2 Kernel-based Classifier and Parzen Window It becomes straightforward to recast the resulting classifier as a kernel classifier by mapping the input data-samples X in a feature space ϕ : Rd → Rdϕ where dϕ is possibly infinite. [sent-87, score-0.152]

34 Specifically, T wn ϕ(X) = 1 λn n Yi K(Xi , X), (7) i=1 where K : Rd × Rd → R is defined as the inner product such that ϕ(X)T ϕ(X ′ ) = K(X, X ′ ) for any X, X ′ . [sent-91, score-0.256]

35 Conversely, any function K corresponds with the inner product of a valid map ϕ 1 if the function K is positive definite. [sent-92, score-0.026]

36 As previously, the term λ becomes λ = n yT Ωy with kernel matrix Ω ∈ Rn×n where Ωij = K(Xi , Xj ) for all i, j = 1, . [sent-93, score-0.127]

37 Now the class of positive definite Mercer kernels can be used as they induce a proper mapping ϕ. [sent-97, score-0.026]

38 A classical choice is the use of a linear kernel (or K(X, X ′ ) = X T X ′ ), a polynomial kernel of degree p ∈ N0 (or K(X, X ′ ) = (X T X ′ + b)p ), an RBF kernel (or K(X, X ′ ) = exp(− X − X ′ 2 /σ)), or a dedicated 2 kernel for a specific application (e. [sent-98, score-0.506]

39 a depicts an example of a nonlinear classifier based on the well-known Ripley dataset, and the contourlines score the ’certainty of prediction’ as explained in the next section. [sent-104, score-0.087]

40 The expression (7) is similar (proportional) to the classical Parzen window for classification, but differs in the use of a positive definite (Mercer) kernel K instead of the pdf κ( X−· ) with bandwidth h h > 0, and in the form of the denominator. [sent-105, score-0.231]

41 The classical motivation of statistical kernel estimators is based on asymptotic theory in low dimensions (i. [sent-106, score-0.212]

42 The novel element from above result is the derivation of a clear (both theoretical and empirical) optimality principle of the rule, as opposed to the asymptotic results of [4] and the geometric motivations in [12, 15]. [sent-114, score-0.193]

43 As a direct byproduct, it becomes straightforward to extend the Parzen window classifier easily with an additional intercept term or other parametric parts, or towards additive (structured) models as in [9]. [sent-115, score-0.139]

44 The empirical Rademacher complexity is then defined [8, 1] as ˆ Rn (H) 2 n f ∈H n Eσ sup σi f (Xi ) X1 , . [sent-119, score-0.082]

45 , Xn , (9) i=1 where the expectation is taken over the choice of the binary vector σ = (σ1 , . [sent-122, score-0.029]

46 It is observed that the empirical Rademacher complexity defines a natural complexity measure to study the maximal average margin classifier, as both the definitions of the empirical Rademacher complexity and the maximal average margin resemble closely (see also [8]). [sent-126, score-1.042]

47 The following result was given in [1], Lemma 22, but we give an alternative proof by exploiting the structure of the optimal estimate explicitly. [sent-127, score-0.027]

48 (10) n 3 Proof: The proof goes along the same lines as the classical bound on the empirical Rademacher complexity for kernel machines outlined in [1], Lemma 22. [sent-132, score-0.32]

49 Specifically, once a vector σ ∈ {−1, 1}n n 1 is fixed, it is immediately seen that the maxf ∈H n i=1 σi f (Xi ) equals the solution as in (7) or √ T n σ maxw i=1 σi (wT ϕ(Xi )) = √ TΩσ = σ T Ωσ. [sent-133, score-0.094]

50 Remark that in the case of a kernel with constant trace (as e. [sent-136, score-0.127]

51 in the case of the RBF kernel √ where tr(Ω) = n), it follows from this result that also the (expected) Rademacher complexity ˆ E[Rn (H)] ≤ tr(Ω). [sent-138, score-0.136]

52 In general, one has that E[K(X, X)] equals the trace of the integral operator TK defined on L2 (PX ) defined as TK (f ) = K(X, Y )f (X)dPX (X) as in [1]. [sent-139, score-0.075]

53 Application of n 1 McDiarmid’s inequality on the variable Z = supf ∈H E[Y (wT ϕ(X))] − n i=1 Yi (wT ϕ(Xi )) gives as in [8, 1]. [sent-140, score-0.087]

54 Lemma 2 (Deviation Inequality) Let 0 < Bϕ < ∞ be a fixed constant such that supz ϕ(z) 2 = supz K(z, z) ≤ Bϕ such that |wT ϕ(z)| ≤ Bφ , and let δ ∈ R+ be fixed. [sent-141, score-0.152]

55 Then with probability 0 exceeding 1 − δ, one has for any w ∈ Rd that E[Y (wT ϕ(X))] ≥ 1 n n i=1 ˆ Yi (wT ϕ(Xi )) − Rn (H) − 3Bϕ 2 ln n 2 δ . [sent-142, score-0.137]

56 (12) Therefore it follows that one maximizes the expected margin by maximizing the empirical average margin, while controlling the empirical Rademacher complexity by choice of the model class (kernel). [sent-143, score-0.504]

57 It is now illustrated T how one can obtain a practical upper-bound to the ’certainty of prediction’ using f (x) = wn x. [sent-145, score-0.23]

58 sample Dn = B ∈ R such that supz K(z, z) ≤ Bϕ , and a fixed δ ∈ R+ . [sent-149, score-0.076]

59 Then, 0 1 − δ, one has for all w ∈ Rd that  Bϕ − E[Y (wT ϕ(X))] yT Ωy P Y (wT ϕ(X)) ≤ 0 ≤ ≤1− + Bϕ nBϕ {(Xi , Yi )}n , a constant i=1 with probability exceeding ˆ Rn (H) +3 Bϕ 2 ln n 2 δ  . [sent-150, score-0.137]

60 (13) Proof: The proof follows directly from application of Markov’s inequality on the positive random variable Bϕ − Y (wT ϕ(X)), with expectation Bϕ − E[Y (wT ϕ(X))], estimated accurately by the sample average as in the previous theorem. [sent-151, score-0.137]

61 More generally, one obtains that with probability exceeding 1 − δ that for any w ∈ Rd and for any ρ such that −Bϕ < ρ < Bϕ that   ˆ 2 ln 2 Rn (H) 3Bϕ Bϕ yT Ωy δ  , (14) P Y (wT ϕ(X)) ≤ −ρ ≤ − + + Bϕ + ρ n(Bϕ + ρ) Bϕ + ρ Bϕ + ρ n with probability exceeding 1 − δ < 1. [sent-152, score-0.245]

62 This results in a practical assessment of the ’certainty’ of a T prediction as follows. [sent-153, score-0.046]

63 Therefore P (Y (wn ϕ(x)) ≤ 0) = P (Y (wn ϕ(x)) = 4 Class prediction class 1 class 2 1 1 0. [sent-155, score-0.098]

64 The contourlines represent the estimate of certainty of prediction (’scores’) as derived in Theorem 2 for the MAM classifier for (a), and as in Corollary 1 for the case of SVMs with g(z) = min(1, max(−1, z)) where |z| < 1 corresponds with the inner part of the margin of the SVM (b). [sent-184, score-0.558]

65 While the contours in (a) give an overall score of the predictions, the scores given in (b) focus towards the margin of the SVM. [sent-185, score-0.29]

66 When asserting this for a number nv ∈ N of samples X ∼ PX with nv → ∞, a misprediction would occur less than δnv times. [sent-189, score-0.199]

67 In this sense, one can use the latent variable wT ϕ(x∗ ) as an indication of how ’certain’ the prediction is. [sent-190, score-0.046]

68 a gives an example of the MAM classifier, together with the level plots indicating the certainty of prediction. [sent-192, score-0.198]

69 Remark however that the described ’certainty of prediction’ statement differs from a conditional statement of the risk given as P (Y (wT ϕ(X)) < 0 | X = x∗ ). [sent-193, score-0.129]

70 The essential difference with the probabilistic estimates based on the density estimates resulting from the Parzen window estimator is that results become independent of the data dimension, as one avoids estimating the joint distribution. [sent-194, score-0.083]

71 Then, a transformation of the random variable Y (wT X) can be fruitful using a monotone ′ ′ increasing function g : R → R with a constant Bϕ ≪ B such that |g(z)| ≤ Bϕ , and g(0) = 0. [sent-197, score-0.037]

72 In the choice of a proper transformation, two counteracting effects should be traded properly. [sent-198, score-0.029]

73 At first, a small choice of B improves the bound as e. [sent-199, score-0.029]

74 On the other hand, such a transformation would make the expected value E[g(Y (wT ϕ(X)))] smaller than E[Y (wT ϕ(X))]. [sent-202, score-0.037]

75 Modifying Theorem 2 gives Corollary 1 (Occurrence of Mistakes, bis) Given i. [sent-203, score-0.031]

76 1 Similar as in the previous section, corollary 1 can be used to score the certainty of prediction by considering for each X = x∗ the value of g(wT x∗ ) and g(−wT x∗ ). [sent-212, score-0.278]

77 b gives an example by ′ considering the clipping transformation g(z) = min(1, max(−1, z)) ∈ [−1, 1] such that Bϕ = 1. [sent-214, score-0.068]

78 Note that this a-priori choice of the function g is not dependent on the (empirical) optimality criterion at hand. [sent-216, score-0.105]

79 3 Soft-margin SVMs and MAM classifiers Except the margin-based mechanisms, the MAM classifier shares other properties with the softmargin maximal margin classifier (SVM) as well. [sent-218, score-0.381]

80 Application of this function to the MAM formulation of (4), one obtains for a C > 0 n max − w i=1 1 − Yi (wT ϕ(Xi )) + s. [sent-220, score-0.031]

81 Thus, omission of the slack constraints ξi ≥ 0 in the SVM formulation results in the Parzen window classifier. [sent-237, score-0.083]

82 4 Maximal Average Margin for Ordinal Regression Along the same lines as [6], the maximal average margin principle can be applied to ordinal regression tasks. [sent-238, score-0.651]

83 The w ∈ Rd maximizing P (I(wT (ϕ(X) − ϕ(X)′ )(Y − Y ′ ) > 0)) can be found by solving for the maximal average margin between pairs as follows M ∗ = max E w sign(Y − Y ′ )wT (ϕ(X) − ϕ(X)′ ) . [sent-243, score-0.476]

84 samples {(Xi , Yi )}n , empirical risk minimization is obtained by solving i=1 n min − w 1 sign(Yj − Yi )wT (ϕ(Xj ) − ϕ(Xi )) s. [sent-247, score-0.153]

85 (20) 1 The Lagrangian with multiplier λ ≥ 0 becomes L(w, λ) = − n i,j wT sign(Yj − Yi )(ϕ(Xj ) − ′ ϕ(Xi ))+ λ (wT w −1). [sent-250, score-0.072]

86 Let Dy ∈ {−1, 0, 1}n ×n such that Dy,ki = 1 2 and Dy,kj = −1 if the kth couple equals (i, j). [sent-252, score-0.044]

87 Then, by switching the minimax problem to a maximin problem, the first order condition for optimality ∂L(w,λ) = 0 gives the expression. [sent-253, score-0.226]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.502), ('rademacher', 0.302), ('mam', 0.29), ('margin', 0.261), ('wn', 0.23), ('parzen', 0.224), ('certainty', 0.167), ('rd', 0.145), ('rn', 0.137), ('yi', 0.133), ('pxy', 0.129), ('maximal', 0.12), ('xi', 0.115), ('er', 0.113), ('ordinal', 0.111), ('exceeding', 0.108), ('kernel', 0.096), ('classi', 0.093), ('nv', 0.087), ('principle', 0.085), ('window', 0.083), ('complexities', 0.077), ('risk', 0.077), ('supz', 0.076), ('lg', 0.076), ('optimality', 0.076), ('machines', 0.063), ('contourlines', 0.058), ('fwo', 0.058), ('leuven', 0.058), ('maximin', 0.058), ('moor', 0.058), ('pelckmans', 0.058), ('classical', 0.052), ('ripley', 0.051), ('svms', 0.05), ('prediction', 0.046), ('suykens', 0.046), ('remark', 0.044), ('rbf', 0.044), ('equals', 0.044), ('sign', 0.043), ('optimum', 0.043), ('dn', 0.043), ('empirical', 0.042), ('application', 0.041), ('yt', 0.041), ('multiplier', 0.041), ('tr', 0.04), ('complexity', 0.04), ('grants', 0.039), ('svm', 0.039), ('mistakes', 0.039), ('average', 0.038), ('transformation', 0.037), ('xj', 0.037), ('lemma', 0.036), ('regression', 0.036), ('tk', 0.036), ('corollary', 0.036), ('cheap', 0.034), ('mercer', 0.034), ('minimization', 0.034), ('estimators', 0.032), ('asymptotic', 0.032), ('px', 0.032), ('windows', 0.032), ('max', 0.031), ('trace', 0.031), ('minimax', 0.031), ('becomes', 0.031), ('inequality', 0.031), ('gives', 0.031), ('switching', 0.03), ('occurrence', 0.03), ('ln', 0.029), ('choice', 0.029), ('lagrangian', 0.029), ('score', 0.029), ('designing', 0.027), ('proof', 0.027), ('ers', 0.027), ('inner', 0.026), ('statement', 0.026), ('maximizing', 0.026), ('class', 0.026), ('investigates', 0.025), ('asserting', 0.025), ('belgium', 0.025), ('dpx', 0.025), ('federal', 0.025), ('goa', 0.025), ('intercept', 0.025), ('johan', 0.025), ('maxf', 0.025), ('maxw', 0.025), ('perceptrons', 0.025), ('quest', 0.025), ('recast', 0.025), ('supf', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

2 0.19081676 40 nips-2007-Bundle Methods for Machine Learning

Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log(1/ )) steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach. 1

3 0.16378103 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

4 0.13699286 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

5 0.12859844 203 nips-2007-The rat as particle filter

Author: Aaron C. Courville, Nathaniel D. Daw

Abstract: Although theorists have interpreted classical conditioning as a laboratory model of Bayesian belief updating, a recent reanalysis showed that the key features that theoretical models capture about learning are artifacts of averaging over subjects. Rather than learning smoothly to asymptote (reflecting, according to Bayesian models, the gradual tradeoff from prior to posterior as data accumulate), subjects learn suddenly and their predictions fluctuate perpetually. We suggest that abrupt and unstable learning can be modeled by assuming subjects are conducting inference using sequential Monte Carlo sampling with a small number of samples — one, in our simulations. Ensemble behavior resembles exact Bayesian models since, as in particle filters, it averages over many samples. Further, the model is capable of exhibiting sophisticated behaviors like retrospective revaluation at the ensemble level, even given minimally sophisticated individuals that do not track uncertainty in their beliefs over trials. 1

6 0.1173546 118 nips-2007-Learning with Transformation Invariant Kernels

7 0.10344908 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

8 0.098343112 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

9 0.09761592 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

10 0.095739096 134 nips-2007-Multi-Task Learning via Conic Programming

11 0.091997646 62 nips-2007-Convex Learning with Invariances

12 0.090786643 32 nips-2007-Bayesian Co-Training

13 0.090130515 160 nips-2007-Random Features for Large-Scale Kernel Machines

14 0.089754574 166 nips-2007-Regularized Boost for Semi-Supervised Learning

15 0.086918809 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

16 0.084167033 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

17 0.083962724 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

18 0.079982623 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

19 0.07953684 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

20 0.075798713 187 nips-2007-Structured Learning with Approximate Inference


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.221), (1, -0.003), (2, -0.143), (3, 0.204), (4, -0.017), (5, 0.06), (6, 0.115), (7, -0.084), (8, -0.039), (9, 0.084), (10, 0.116), (11, -0.108), (12, -0.02), (13, 0.056), (14, 0.019), (15, -0.04), (16, 0.032), (17, 0.014), (18, 0.021), (19, 0.056), (20, 0.102), (21, -0.044), (22, 0.043), (23, 0.016), (24, 0.035), (25, -0.032), (26, -0.282), (27, 0.055), (28, -0.008), (29, -0.066), (30, -0.009), (31, 0.014), (32, -0.192), (33, -0.031), (34, 0.016), (35, 0.091), (36, 0.082), (37, -0.068), (38, 0.048), (39, 0.041), (40, 0.124), (41, 0.063), (42, -0.084), (43, 0.09), (44, 0.056), (45, -0.074), (46, 0.009), (47, -0.111), (48, 0.147), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96776015 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

2 0.75024527 40 nips-2007-Bundle Methods for Machine Learning

Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log(1/ )) steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach. 1

3 0.67210543 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

4 0.55508095 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1

5 0.53772277 62 nips-2007-Convex Learning with Invariances

Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola

Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1

6 0.53413492 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

7 0.51435047 159 nips-2007-Progressive mixture rules are deviation suboptimal

8 0.50983709 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

9 0.49970677 24 nips-2007-An Analysis of Inference with the Universum

10 0.47593015 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

11 0.4755941 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

12 0.45269483 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

13 0.44146037 203 nips-2007-The rat as particle filter

14 0.43029803 32 nips-2007-Bayesian Co-Training

15 0.42960748 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

16 0.41819635 187 nips-2007-Structured Learning with Approximate Inference

17 0.40545338 69 nips-2007-Discriminative Batch Mode Active Learning

18 0.39950609 166 nips-2007-Regularized Boost for Semi-Supervised Learning

19 0.37434945 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

20 0.37274319 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.027), (13, 0.03), (16, 0.011), (21, 0.075), (34, 0.033), (35, 0.036), (47, 0.094), (49, 0.4), (83, 0.14), (85, 0.012), (90, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89486176 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

Author: Moulines Eric, Francis R. Bach, Zaïd Harchaoui

Abstract: We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial and real datasets is provided. 1

2 0.87514704 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

same-paper 3 0.82258558 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

4 0.81977892 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

Author: Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, Santiago Fernández

Abstract: In online handwriting recognition the trajectory of the pen is recorded during writing. Although the trajectory provides a compact and complete representation of the written output, it is hard to transcribe directly, because each letter is spread over many pen locations. Most recognition systems therefore employ sophisticated preprocessing techniques to put the inputs into a more localised form. However these techniques require considerable human effort, and are specific to particular languages and alphabets. This paper describes a system capable of directly transcribing raw online handwriting data. The system consists of an advanced recurrent neural network with an output layer designed for sequence labelling, combined with a probabilistic language model. In experiments on an unconstrained online database, we record excellent results using either raw or preprocessed data, well outperforming a state-of-the-art HMM based system in both cases. 1

5 0.54146636 7 nips-2007-A Kernel Statistical Test of Independence

Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola

Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.

6 0.51535642 43 nips-2007-Catching Change-points with Lasso

7 0.51489735 108 nips-2007-Kernel Measures of Conditional Dependence

8 0.5013231 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

9 0.50008702 40 nips-2007-Bundle Methods for Machine Learning

10 0.49527085 185 nips-2007-Stable Dual Dynamic Programming

11 0.4870522 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

12 0.48179331 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

13 0.47711438 179 nips-2007-SpAM: Sparse Additive Models

14 0.47709227 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

15 0.47663063 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

16 0.4764955 200 nips-2007-The Tradeoffs of Large Scale Learning

17 0.47485229 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

18 0.47274917 24 nips-2007-An Analysis of Inference with the Universum

19 0.47209734 156 nips-2007-Predictive Matrix-Variate t Models

20 0.47111517 186 nips-2007-Statistical Analysis of Semi-Supervised Regression