nips nips2007 nips2007-10 knowledge-graph by maker-knowledge-mining

10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning


Source: pdf

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract This paper investigates the application of randomized algorithms for large scale SVM learning. [sent-9, score-0.192]

2 The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. [sent-11, score-0.488]

3 Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. [sent-13, score-0.358]

4 1 Introduction Consider a training dataset D = {(xi , yi )}, i = 1 . [sent-15, score-0.247]

5 n and yi = {+1, −1}, where xi ∈ Rd are data points and yi specify the class labels. [sent-18, score-0.479]

6 The SVM formulation for classification, which will be called C − SV M , for determining {w, b} is given by [1] C-SVM-1: n 1 M inimize(w,b,ξ) ||w||2 + C ξi 2 i=1 Subject to : yi (w · xi + b) ≥ 1 − ξi , , ξi ≥ 0, i = 1 . [sent-20, score-0.319]

7 n At optimality w is given by w = i:αi >0 αi yi x i , 0 ≤ α i ≤ C 1 Consider the set S = {xi |αi > 0}; the elements of this set are called the Support vectors. [sent-23, score-0.154]

8 Every AOP has a combinatorial dimension associated with it; the combinatorial dimension captures the notion of number of free variables for that AOP. [sent-31, score-0.284]

9 An AOP can be solved by a randomized algorithm by selecting subsets of size greater than the combinatorial dimension of the problem [2]. [sent-32, score-0.365]

10 For SVM, ∆ is the combinatorial dimension of the problem; by iterating over subsets of size greater than ∆, the subsets chosen using random sampling, the problem can be solved efficiently [3, 4]; this algorithm was called RandSVM by the authors. [sent-33, score-0.296]

11 Apriori the value of ∆ is not known, but for linearly separable classification problems the following holds: 2 ≤ ∆ ≤ d + 1. [sent-34, score-0.355]

12 When the problem is not linearly separable, the authors use the reduced convex hull formulation [5] to come up with an estimate of the combinatorial dimension; this estimate is not very clear and much higher than d1 . [sent-36, score-0.238]

13 This work overcomes the above problems using ideas from random projections[6, 7] and randomized algorithms[8, 9, 2, 10],. [sent-39, score-0.23]

14 The main contribution is, using ideas from random projections, the conjecture that if RandSVM is solved using ∆ equal to O(log n), then the solution obtained is close to optimal with high probability(Theorem 3), in particular for almost separable datasets. [sent-41, score-0.552]

15 Almost separable datasets are those which become linearly separable when a small number of properly chosen data points are deleted from them. [sent-42, score-0.752]

16 The second contribution is an algorithm which, using ideas from randomized algorithms for Linear Programming(LP), solves the SVM problem by using samples of size linear in ∆. [sent-43, score-0.228]

17 2 A NEW RANDOMIZED ALGORITHM FOR CLASSIFICATION This section uses results from random projections, and randomized algorithms for linear programming, to develop a new algorithm for learning large scale SVM problems. [sent-45, score-0.227]

18 1, we discuss the case of linearly separable data and estimate the number of support vectors required such that the margin is preserved with high probability, and show that this number is much smaller than the data dimension d, using ideas from random projections. [sent-47, score-1.135]

19 2, we look how the analysis applies to almost separable data and present the main result of the paper(Theorem 2. [sent-49, score-0.332]

20 3, we present shows the randomized algorithm from SVM learning. [sent-53, score-0.144]

21 1 Linearly separable data We start with determining the dimension k of the target space such that on performing a random projection to the space, the Euclidean distances and dot products are preserved. [sent-55, score-0.522]

22 The appendix contains a few results from random projections which will be used in this section. [sent-56, score-0.242]

23 1 2 Details of this calculation are present in the supplementary material Presented in supplementary material 2 For a linearly separable dataset D = {(xi , yi ), i = 1, . [sent-57, score-0.62]

24 , n}, xi ∈ Rd , yi ∈ {+1, −1}, the C-SVM formulation is the same as C-SVM-1 with ξi = 0, i = 1 . [sent-60, score-0.319]

25 By dividing all the constraints by ||w||, the problem can be reformulated as follows: C-SVM-2a: M aximize(w,b,l) l; Subject to : yi (w · xi + ˆ ≥ l, i = 1 . [sent-64, score-0.318]

26 l is the margin induced by the separating hyperplane, ˆ b that is, it is the distance between the 2 supporting hyperplanes, h1 : yi (w · xi + b) = 1 and h2 : yi (w · xi + b) = −1. [sent-68, score-0.834]

27 First, for any given value of k, we show the change in the margin as a function of k, if the data points are projected onto the k dimensional subspace and the problem solved. [sent-70, score-0.642]

28 From this, we determine the value k(k << d) which will preserve margin with a very high probability. [sent-71, score-0.283]

29 In a k dimensional subspace, there are at the most k + 1 support vectors. [sent-72, score-0.22]

30 Using the idea of orthogonal extensions(definition appears later in this section), we prove that when the problem is solved in the original space, using an estimate of k + 1 on the number of support vectors, the margin is preserved with a very high probability. [sent-73, score-0.687]

31 , n respectively onto a k ˆ dimensional subspace (as in Lemma 2, Appendix A). [sent-80, score-0.232]

32 The classification problem in the projected space with the dataset being D = {(xi , yi ), i = 1, . [sent-81, score-0.289]

33 , n}, xi ∈ Rk , yi ∈ {+1, −1} can be written as follows: C-SVM-2b: M aximize(w ,ˆ ) l ; Subject to : yi (w · xi + ˆ ≥ l , i = 1 . [sent-84, score-0.554]

34 The following lemma predicts, for a given value of γ, the k such that the margin is preserved with a high probability upon projection. [sent-88, score-0.533]

35 be solved with the optimal margin obtained close to the optimal margin for the original problem is given by the following lemma. [sent-89, score-0.601]

36 , n and k ≥ γ82 (1 + (1+L ) )2 log 4n , 0 < γ < 1, 0 < δ < 1, then the following bound holds 2l∗ δ on the optimal margin lP obtained by solving the problem C-SVM-2b: P (lP ≥ l∗ (1 − γ)) ≥ 1 − δ Proof. [sent-96, score-0.424]

37 From Corollary 1 of Lemma 2(Appendix A), we have w∗ · xi − (1 + L2 ) ≤ w · xi ≤ w∗ · xi + (1 + L2 ) 2 2 2k which holds with probability at least 1 − 4e− 8 , for some > 0. [sent-97, score-0.543]

38 Then the following holds with probability at least 1 − 2e− 8 w · xi + b∗ ≥ w∗ · xi − (1 + L2 ) + b∗ ≥ l∗ − (1 + L2 ) 2 2 w·x +b∗ e l∗ − (1+L2 ) i 2 Dividing the above by ||w||, we have ||w|| ≥ . [sent-99, score-0.391]

39 The above analysis guarantees that by projecting onto a k dimensional space, there ∗ w e exists at least one hyperplane ( ||w|| , ||be ), which guarantees a margin of l ∗ (1 − γ) where e w|| 1 + L2 ) (1) 2l∗ 2k with probability at least 1 − n4e− 8 . [sent-104, score-0.659]

40 The margin obtained by solving the problem C-SVM-2b, lP can only be better than this. [sent-105, score-0.29]

41 So the value of k is given by: γ ≤ (1 + n4e − γ2 (1+ 1+L 2l∗ k 2 2 8 ) ≤δ ⇒ k≥ 8(1 + (1+L2 ) 2 2l∗ ) γ2 log 4n δ (2) As seen above, by randomly projecting the points onto a k dimensional subspace, the margin is preserved with a high probability. [sent-106, score-0.739]

42 We use Theorem 1 to determine an estimate on the number of support vectors such that margin is preserved with a high probability, when the problem is solved in the original space. [sent-109, score-0.667]

43 The theorem is based on the following fact: in a k dimensional space, the number of support vectors is upper bounded by k + 1. [sent-111, score-0.379]

44 We show that this k + 1 can be used as an estimate of the number of support vectors in the original space such that the solution obtained preserves the margin with a high probability. [sent-112, score-0.472]

45 Definition An orthogonal extension of a k − 1-dimensional flat( a k − 1 dimensional flat is a k − 1-dimensional affine space) hp = (wp , b), where wp = (w1 , . [sent-114, score-0.359]

46 , wk ), in a subspace Sk of dimension k to a d − 1-dimensional hyperplane h = (w, b) in d-dimensional space, is defined as follows. [sent-117, score-0.337]

47 Let xi = RT xi and xi = R k xi as follows: Let wp = (w1 , . [sent-120, score-0.744]

48 , wk ) be the optimal hyperplane ˆ classifier with margin lP for the points x1 , . [sent-123, score-0.564]

49 If (wp , b) is a separator with margin lp for the projected points, then its orthogonal extension (w, b) is a separator with margin lp for the original points,that is, if, yi (wp · xi + b) ≥ l, i = 1, . [sent-135, score-1.235]

50 , n ˆ An important point to note, which will be required when extending orthogonal extensions to nonlinear kernels, is that dot products between the points are preserved upon doing orthogonal projections, that is, xiT xj = xi T xj . [sent-141, score-0.758]

51 Given k ≥ γ82 (1 + (1+L ) )2 log 4n and n training points with maximum norm L in d 2l∗ δ dimensional space and separable by a hyperplane with margin l ∗ , there exists a subset of k training points x1 . [sent-145, score-1.091]

52 xk where k ≤ k and a hyperplane h satisfying the following conditions: 1. [sent-148, score-0.231]

53 h has margin at least l∗ (1 − γ) with probability at least 1 − δ 2. [sent-149, score-0.329]

54 xk are the only training points which lie either on h1 or on h2 Proof. [sent-153, score-0.198]

55 Let w ∗ , b∗ denote the normal to a separating hyperplane with margin l ∗ , that is, yi (w∗ · xi + b∗ ) ≥ l∗ for all xi and ||w∗ || = 1. [sent-154, score-0.872]

56 , xn to a k dimensional √ space and let w , z1 , . [sent-158, score-0.15]

57 By Theorem 1, yi (w · zi + b∗ ) ≥ l∗ (1 − γ) holds for all zi with probability at least 1 − δ. [sent-165, score-0.27]

58 Let h be the orthogonal extension of w , b∗ to the full d dimensional space. [sent-166, score-0.223]

59 Then h has margin at least l ∗ (1 − γ), as required. [sent-167, score-0.29]

60 To prove the second part, consider the projected training points which lie on w , b∗ (that is, they lie on either of the two sandwiching hyperplanes). [sent-169, score-0.289]

61 Clearly, these will be the only points which lie on the orthogonal extension h, by definition. [sent-171, score-0.228]

62 4 From the above analysis, it is seen that if k << d, then we can estimate that the number of support vectors is k + 1, and the algorithm RandSVM would take on average O(k log n) iterations to solve the problem [3, 4]. [sent-172, score-0.245]

63 2 Almost separable data In this section, we look at how the above analysis can be applied to almost separable datasets. [sent-174, score-0.602]

64 We call a dataset almost separable if by removing a fraction κ = O( log n ) of the points, the dataset n becomes linearly separable. [sent-175, score-0.637]

65 The C-SVM formulation when the data is not linearly separable(and almost separable) was given in C-SVM-1. [sent-176, score-0.189]

66 The following theorem proves a result for almost separable data similar to the one proved in Claim 1 for separable data. [sent-180, score-0.675]

67 Subject to : yi (w · xi + b) ≥ l − ξi , ξi ≥ 0, i = 1 . [sent-181, score-0.277]

68 Given k ≥ 8 γ 2 (1 + (1+L2 ) 2 2l∗ ) log 4n δ + κn, l∗ being the margin at optimality, l the lower bound on l∗ as in the Generalized Optimal Hyperplane formulation and κ = O( log n ), there n exists a subset of k training points x1 . [sent-185, score-0.522]

69 xk , k ≤ k and a hyperplane h satisfying the following conditions: 1. [sent-188, score-0.231]

70 h has margin at least l(1 − γ) with probability at least 1 − δ 2. [sent-189, score-0.329]

71 At the most 8(1+ (1+L2 ) 2 ) 2l∗ γ2 log 4n δ points lie on the planes h1 or on h2 3. [sent-190, score-0.178]

72 , xk are the only points which define the hyperplane h, that is, they are the support vectors of h. [sent-194, score-0.465]

73 Let the optimal solution for the generalized optimal hyperplane formulation be (w ∗ , b∗ , ξ ∗ ). [sent-196, score-0.265]

74 1 w∗ = αi yi xi , and l∗ = ||w∗ || as mentioned before. [sent-197, score-0.277]

75 The set of support vectors can be split i:αi >0 ∗ into to 2 disjoint sets,SV1 = {xi : αi > 0 and ξi = 0}(unbounded SVs), and SV2 = {xi : αi > ∗ 0 and ξi > 0}(bounded SVs). [sent-198, score-0.189]

76 Then the dataset becomes linearly separable with margin l∗ . [sent-200, score-0.688]

77 When all the points in SV2 are added back, at most all these points are added to the set of support vectors and the margin does not change. [sent-202, score-0.594]

78 The margin not changing is guaranteed by the fact that for proving the conditions 1 and 2, we have assumed the worst possible margin, and any value lower than this would violate the constraints of the problem. [sent-203, score-0.251]

79 Let the points be projected onto a random k dimensional subspace as before. [sent-206, score-0.426]

80 The orthogonal extensions can be 5 considered as a projection from the k dimensional space to the Φ-space, such that the kernel function values are preserved. [sent-208, score-0.356]

81 This section presents another randomized algorithm which only requires that the sample size be greater than the number of support vectors. [sent-212, score-0.247]

82 Let xi be a violator(xi is a non-sampled point such that yi (wT xi + b) < 1). [sent-222, score-0.429]

83 Solving the problem with the set of constraints as SV ∪ xi will only result, since SVM is an instance of AOP, in the increase(decrease) of the objective function of the primal(dual). [sent-223, score-0.152]

84 This can be handled only be solving for k as a function of where is the maximum allowed distortion in the L2 norms of the vectors upon projection. [sent-226, score-0.154]

85 The checkerboard dataset consists of vectors in a 2 dimensional space. [sent-271, score-0.285]

86 The points are labelled as follows - if( x1 %2 = x2 %2), then the point is labelled negative, else the point is labelled positive. [sent-274, score-0.197]

87 For each of the synthetic datasets, a training set of 10,00,000 points and a test set of 10,000 points was generated. [sent-275, score-0.237]

88 From now on, the smaller training set will have a subscript of 1 and the larger training set will have a subscript of 2, for example, ringnorm1 and ringnorm2 . [sent-277, score-0.142]

89 Real world dataset The RCV1 dataset consists of 804,414 documents, with each document consisting of 47,236 features. [sent-278, score-0.164]

90 For linearly separable datasets, k was set to (16 log(4n/δ))/ 2 . [sent-287, score-0.355]

91 Discussion of results: Table 1, which has the timing and classification accuracy comparisons, shows that RandSVM can scale up SVM solvers for very large datasets. [sent-289, score-0.16]

92 Using just a small wrapper around the solvers, RandSVM has scaled up SVMLight so that its performance is comparable to that of state of the art solvers such as SVMPerf and SVMLin. [sent-290, score-0.153]

93 Furthermore, it is clear, from the experiments on the synthetic datasets, that the execution times taken for training with 105 examples and 106 examples are not too far apart; this is a clear indication that the execution time does not increase rapidly with the increase in the dataset size. [sent-292, score-0.293]

94 Since the classification accuracies obtained by using RandSVM and the baseline solvers are very close, it is clear that Theorem 2 holds in practice. [sent-294, score-0.16]

95 4 Further Research It is clear from the experimental evaluations that randomized algorithms can be used to scale up SVM solvers to large scale classification problems. [sent-295, score-0.352]

96 If an estimate of the number of support vectors is obtained then algorithm RandSVM-1 can be used for other SVM learning problems also, as they are usually instances of an AOP. [sent-296, score-0.189]

97 7 A Some Results from Random Projections Here we review a few lemmas from random projections [7]. [sent-298, score-0.17]

98 The following lemma discusses how the L2 norm of a vector is preserved when it is projected on a random subspace. [sent-299, score-0.338]

99 Let u = R ku and v = R ku be the projections of u and v to Rk via a random matrix R whose entries are chosen independently from N (0, 1) or U (−1, 1). [sent-306, score-0.336]

100 A random sampling technique for training support vector machines. [sent-323, score-0.178]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('randsvm', 0.416), ('sv', 0.315), ('separable', 0.27), ('margin', 0.251), ('aop', 0.234), ('svm', 0.176), ('hyperplane', 0.163), ('preserved', 0.156), ('xi', 0.152), ('randomized', 0.144), ('wp', 0.136), ('projections', 0.135), ('yi', 0.125), ('dimensional', 0.117), ('solvers', 0.112), ('orthogonal', 0.106), ('support', 0.103), ('lp', 0.098), ('rd', 0.09), ('vectors', 0.086), ('linearly', 0.085), ('libsvm', 0.083), ('ku', 0.083), ('projected', 0.082), ('dataset', 0.082), ('ccat', 0.078), ('points', 0.077), ('svmlight', 0.077), ('combinatorial', 0.076), ('theorem', 0.073), ('appendix', 0.072), ('svmperf', 0.068), ('kernels', 0.066), ('dimension', 0.066), ('lemma', 0.065), ('subspace', 0.065), ('execution', 0.064), ('violators', 0.062), ('almost', 0.062), ('projection', 0.061), ('log', 0.056), ('corollary', 0.053), ('aximize', 0.052), ('balcazar', 0.052), ('inimize', 0.052), ('nonsampled', 0.052), ('osamu', 0.052), ('ramesh', 0.052), ('randomsubset', 0.052), ('svmlearn', 0.052), ('svmlin', 0.052), ('ideas', 0.051), ('datasets', 0.05), ('onto', 0.05), ('holds', 0.048), ('dot', 0.048), ('scale', 0.048), ('lie', 0.045), ('dai', 0.045), ('svs', 0.045), ('classi', 0.043), ('wk', 0.043), ('synthetic', 0.043), ('formulation', 0.042), ('extensions', 0.042), ('solver', 0.042), ('products', 0.042), ('hyperplanes', 0.041), ('reformulated', 0.041), ('art', 0.041), ('labelled', 0.04), ('training', 0.04), ('subsets', 0.04), ('solving', 0.039), ('least', 0.039), ('solved', 0.039), ('jose', 0.036), ('rij', 0.036), ('separator', 0.036), ('xk', 0.036), ('hull', 0.035), ('indian', 0.035), ('random', 0.035), ('contribution', 0.033), ('violates', 0.033), ('let', 0.033), ('yang', 0.032), ('satisfying', 0.032), ('high', 0.032), ('geometry', 0.031), ('subscript', 0.031), ('kernel', 0.03), ('automation', 0.03), ('optimal', 0.03), ('optimality', 0.029), ('zi', 0.029), ('euclidean', 0.029), ('supplementary', 0.029), ('separating', 0.029), ('upon', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

2 0.16378103 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

3 0.13445736 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1

4 0.13310128 160 nips-2007-Random Features for Large-Scale Kernel Machines

Author: Ali Rahimi, Benjamin Recht

Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1

5 0.11836845 62 nips-2007-Convex Learning with Invariances

Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola

Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1

6 0.11421918 161 nips-2007-Random Projections for Manifold Learning

7 0.10937785 187 nips-2007-Structured Learning with Approximate Inference

8 0.1025399 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

9 0.10080915 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

10 0.095924973 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

11 0.094853319 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

12 0.089592673 112 nips-2007-Learning Monotonic Transformations for Classification

13 0.085812964 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

14 0.083308786 116 nips-2007-Learning the structure of manifolds using random projections

15 0.080259062 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

16 0.080142379 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

17 0.077027448 40 nips-2007-Bundle Methods for Machine Learning

18 0.076609083 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

19 0.073373333 32 nips-2007-Bayesian Co-Training

20 0.07248725 193 nips-2007-The Distribution Family of Similarity Distances


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.225), (1, 0.013), (2, -0.158), (3, 0.168), (4, -0.057), (5, -0.049), (6, 0.094), (7, 0.053), (8, -0.053), (9, 0.126), (10, 0.052), (11, -0.032), (12, -0.028), (13, 0.102), (14, 0.048), (15, -0.043), (16, 0.036), (17, 0.006), (18, -0.014), (19, 0.058), (20, 0.072), (21, -0.041), (22, 0.001), (23, 0.074), (24, -0.068), (25, -0.135), (26, -0.222), (27, 0.035), (28, -0.003), (29, -0.041), (30, 0.098), (31, 0.028), (32, -0.047), (33, -0.075), (34, -0.061), (35, -0.021), (36, -0.044), (37, -0.14), (38, -0.072), (39, 0.055), (40, 0.004), (41, -0.116), (42, -0.026), (43, -0.011), (44, -0.027), (45, -0.011), (46, -0.072), (47, -0.01), (48, 0.112), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95467961 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

2 0.74450701 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1

3 0.73919988 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

4 0.68088007 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

5 0.66584188 62 nips-2007-Convex Learning with Invariances

Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola

Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1

6 0.66210479 24 nips-2007-An Analysis of Inference with the Universum

7 0.62134385 160 nips-2007-Random Features for Large-Scale Kernel Machines

8 0.54674023 112 nips-2007-Learning Monotonic Transformations for Classification

9 0.53327256 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

10 0.53107232 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

11 0.53045356 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

12 0.52804399 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

13 0.51151496 40 nips-2007-Bundle Methods for Machine Learning

14 0.49015725 187 nips-2007-Structured Learning with Approximate Inference

15 0.48068142 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

16 0.47445083 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

17 0.46789533 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

18 0.44168907 32 nips-2007-Bayesian Co-Training

19 0.43349871 16 nips-2007-A learning framework for nearest neighbor search

20 0.4129287 109 nips-2007-Kernels on Attributed Pointsets with Applications


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.021), (13, 0.081), (16, 0.014), (19, 0.018), (21, 0.087), (31, 0.031), (35, 0.051), (47, 0.093), (49, 0.024), (77, 0.24), (83, 0.163), (85, 0.021), (87, 0.012), (90, 0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78272575 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

2 0.67638415 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

3 0.67585075 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

4 0.67578518 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu

Abstract: We consider the problem of Support Vector Machine transduction, which involves a combinatorial problem with exponential computational complexity in the number of unlabeled examples. Although several studies are devoted to Transductive SVM, they suffer either from the high computation complexity or from the solutions of local optimum. To address this problem, we propose solving Transductive SVM via a convex relaxation, which converts the NP-hard problem to a semi-definite programming. Compared with the other SDP relaxation for Transductive SVM, the proposed algorithm is computationally more efficient with the number of free parameters reduced from O(n2 ) to O(n) where n is the number of examples. Empirical study with several benchmark data sets shows the promising performance of the proposed algorithm in comparison with other state-of-the-art implementations of Transductive SVM. 1

5 0.67283303 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

6 0.67266744 84 nips-2007-Expectation Maximization and Posterior Constraints

7 0.67208511 24 nips-2007-An Analysis of Inference with the Universum

8 0.67034757 134 nips-2007-Multi-Task Learning via Conic Programming

9 0.66800445 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

10 0.6673336 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

11 0.66723371 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

12 0.66707605 102 nips-2007-Incremental Natural Actor-Critic Algorithms

13 0.66561395 86 nips-2007-Exponential Family Predictive Representations of State

14 0.66491568 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

15 0.66369033 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

16 0.66268522 156 nips-2007-Predictive Matrix-Variate t Models

17 0.66246182 16 nips-2007-A learning framework for nearest neighbor search

18 0.66245461 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

19 0.66198087 40 nips-2007-Bundle Methods for Machine Learning

20 0.66159642 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering