nips nips2007 nips2007-152 knowledge-graph by maker-knowledge-mining

152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers


Source: pdf

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Chang∗ Kaihua Zhu, Hao Wang, Hongjie Bai, , Jian Li, Zhihuan Qiu, & Hang Cui Google Research, Beijing, China Abstract Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. [sent-2, score-0.131]

2 To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. [sent-3, score-0.5]

3 Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. [sent-4, score-0.122]

4 PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). [sent-5, score-0.143]

5 Given a set of training data X = {(xi , yi )|xi ∈ Rd }n , where xi is an obseri=1 vation vector, yi ∈ {−1, 1} is the class label of xi , and n is the size of X , we apply SVMs on X to train a binary classifier. [sent-11, score-0.16]

6 1 − yi (wT φ(xi ) + b) ≤ ξi , ξi > 0, where w is a weighting vector, b is a threshold, C a regularization hyperparameter, and φ(·) a basis function which maps xi to an RKHS space. [sent-15, score-0.057]

7 0 ≤ α ≤ C, yT α = 0, min D(α) = (2) where [Q]ij = yi yj φT (xi )φ(xj ), and α ∈ Rn is the Lagrangian multiplier variable (or dual n variable). [sent-21, score-0.05]

8 SVMs utilize the kernel trick by specifying a kernel function to define the inner-product K(xi , xj ) = φT (xi )φ(xj ). [sent-25, score-0.068]

9 When the given kernel function K is psd (positive semidefinite), the dual problem D(α) is a convex Quadratic Programming (QP) problem with linear constraints, which can be solved via the Interior-Point method (IPM) (Mehrotra, 1992). [sent-27, score-0.062]

10 Both the computational and memory bottlenecks of the SVM training are the IPM solver to the dual formulation of SVMs in (2). [sent-28, score-0.18]

11 The computational cost is O(n3 ) and the memory usage O(n2 ). [sent-31, score-0.082]

12 In this work, we propose a parallel SVM algorithm (PSVM) to reduce memory use and to parallelize both data loading and computation. [sent-32, score-0.305]

13 Given n training instances each with d dimensions, PSVM first loads the training data in a round-robin fashion onto m machines. [sent-33, score-0.205]

14 Next, PSVM performs a parallel row-based Incomplete Cholesky Factorization (ICF) on the loaded data. [sent-35, score-0.189]

15 At the end of parallel ICF, each machine stores only a fraction of the factorized matrix, which takes up space of O(np/m), where p is the column dimension of √ the factorized matrix. [sent-36, score-0.297]

16 ) PSVM reduces memory use of IPM from O(n2 ) to O(np/m), where p/m is much smaller than n. [sent-38, score-0.082]

17 PSVM then performs parallel IPM to solve the quadratic optimization problem in (2). [sent-39, score-0.161]

18 This work’s main contributions are: (1) PSVM achieves memory reduction and computation speedup via a parallel ICF algorithm and parallel IPM. [sent-44, score-0.707]

19 (3) We have implemented PSVM on our parallel computing infrastructures. [sent-47, score-0.161]

20 PSVM effectively speeds up training time for large-scale tasks while maintaining high training accuracy. [sent-48, score-0.092]

21 PSVM is a practical, parallel approximate implementation to speed up SVM training on today’s distributed computing infrastructures for dealing with Web-scale problems. [sent-49, score-0.207]

22 , 2006) can be more effective when memory is not a constraint or kernels are not used. [sent-53, score-0.082]

23 ) 2 PSVM Algorithm The key step of PSVM is parallel ICF (PICF). [sent-60, score-0.161]

24 Traditional column-based ICF (Fine & Scheinberg, 2001; Bach & Jordan, 2005) can reduce computational cost, but the initial memory requirement is O(np), and hence not practical for very large data set. [sent-61, score-0.108]

25 PSVM devises parallel row-based ICF (PICF) as its initial step, which loads training instances onto parallel machines and performs factorization simultaneously on these machines. [sent-62, score-0.677]

26 Once PICF has loaded n training data distributedly on m machines, and reduced the size of the kernel matrix through factorization, IPM can be solved on parallel machines simultaneously. [sent-63, score-0.576]

27 3: end for 4: k ← 0; H ← 0; v ← the fraction of the diagonal vector of Q that resides in local machine. [sent-79, score-0.128]

28 (v(i)(i ∈ Im ) can be obtained from xi ) 5: Initialize master to be machine 0. [sent-80, score-0.174]

29 6: while k < p do 7: Each machine c ∈ M selects its local pivot value, which is the largest element in v: lpvk,c = max v(i). [sent-81, score-0.209]

30 i∈Ic and records the local pivot index, the row index corresponds to lpvk,c : lpik,c = arg max v(i). [sent-82, score-0.23]

31 The master selects the largest local pivot value as global pivot value gpvk and records in ik , row index corresponding to the global pivot value. [sent-84, score-0.944]

32 c∈M 10: 11: 12: 13: 14: 15: 16: 17: The master broadcasts gpvk and ik . [sent-86, score-0.36]

33 The master broadcasts the pivot instance xik and the pivot row H(ik , :). [sent-89, score-0.668]

34 (Only the first k + 1 values of the pivot row need to be broadcast, since the remainder are zeros. [sent-90, score-0.23]

35 Our row-based parallel ICF (PICF) works as follows: Let vector v be the diagonal of Q and suppose the pivots (the largest diagonal values) are {i1 , i2 , . [sent-97, score-0.231]

36 , ik }, the k th iteration of ICF computes three equations: H(ik , k) = v(ik ) (3) H(Jk , j)H(ik , j))/H(ik , k) (4) k−1 H(Jk , k) = (Q(Jk , k) − j=1 v(Jk ) = v(Jk ) − H(Jk , k)2 , (5) where Jk denotes the complement of {i1 , i2 , . [sent-100, score-0.109]

37 Golub, a parallelized ICF algorithm can be obtained by constraining the parallelized Cholesky Factorization algorithm, iterating at most p times. [sent-106, score-0.12]

38 However, in the proposed algorithm (Golub & Loan, 1996), matrix H is distributed by columns in a round-robin way on m machines (hence we call it column-based parallelized ICF). [sent-107, score-0.254]

39 Therefore, each machine must be able to store a local copy of the training data. [sent-111, score-0.094]

40 The calculation of pivot selection, the summation of local inner product result, column calculation in (4), and the vector update in (5) must be performed on one single machine. [sent-115, score-0.279]

41 Our row-based approach starts by initializing variables and loading training data onto m machines in a round-robin fashion (Steps 1 to 5). [sent-117, score-0.272]

42 In the main loop, PICF performs five tasks in each iteration k: • Distributedly find a pivot, which is the largest value in the diagonal v of matrix Q (steps 7 to 10). [sent-121, score-0.05]

43 Notice that PICF computes only needed elements in Q from training data, and it does not store Q. [sent-122, score-0.069]

44 • Set the machine where the pivot resides as the master (step 11). [sent-123, score-0.382]

45 • The master then broadcasts the pivot instance xik and the pivot row H(ik , :) (step 13). [sent-125, score-0.668]

46 At the end of the algorithm, H is stored distributedly on m machines, ready for parallel IPM (presented in the next section). [sent-127, score-0.296]

47 PICF enjoys three advantages: parallel memory use (O(np/m)), parallel computation (O(p2 n/m)), and low communication overhead (O(p2 log(m))). [sent-128, score-0.669]

48 Particularly on the communication overhead, its fraction of the entire computation time shrinks as the problem size grows. [sent-129, score-0.155]

49 This pattern permits a larger problem to be solved on more machines to take advantage of parallel memory use and computation. [sent-131, score-0.411]

50 t αi C − αi (6) (7) (8) (9) (10) (11) (12) The computation bottleneck is on matrix inverse, which takes place on Σ for solving ν in (8) and x in (10). [sent-137, score-0.097]

51 Therefore, the bottleneck of the Newton step can be sped up from O(n3 ) to O(p2 n), and be parallelized to O(p2 n/m). [sent-139, score-0.096]

52 Distributed Data Loading To minimize both storage and communication cost, PIPM stores data distributedly as follows: 4 • Distribute matrix data. [sent-140, score-0.245]

53 An interesting observation is that parallelizing Σ−1 z (or Σ−1 y) is simpler than parallelizing Σ−1 . [sent-149, score-0.14]

54 Let us explain how parallelizing Σ−1 z works, and parallelizing Σ−1 y can follow suit. [sent-150, score-0.14]

55 The results are sent to the master (which can be a randomly picked machine for all PIPM iterations) to aggregate into t1 for the next step. [sent-160, score-0.139]

56 Compute D−1 Ht2 All machines have a copy of t2 , and can compute D−1 Ht2 locally to solve for Σ−1 z. [sent-170, score-0.215]

57 3 Computing b and Writing Back When the IPM iteration stops, we have the value of α and hence the classification function Ns αi yi k(si , x) + b f (x) = i=1 Here Ns is the number of support vectors and si are support vectors. [sent-174, score-0.088]

58 According to the SVM model, given a support vector s, we obtain one of the two results for f (s): f (s) = +1, if ys = +1, or f (s) = −1, if ys = −1. [sent-176, score-0.103]

59 In practice, we can select M , say 1, 000, support vectors and compute the average of the bs in parallel using MapReduce (Dean & Ghemawat, 2004). [sent-177, score-0.194]

60 3 Experiments We conducted experiments on PSVM to evaluate its 1) class-prediction accuracy, 2) scalability on large datasets, and 3) overheads. [sent-178, score-0.049]

61 The experiments were conducted on up to 500 machines in our data center. [sent-179, score-0.168]

62 Not all machines are identically configured; however, each machine is configured with a CPU faster than 2GHz and memory larger than 4GBytes. [sent-180, score-0.25]

63 9264 Class-prediction Accuracy PSVM employs PICF to approximate an n × n kernel matrix Q with an n × p matrix H. [sent-220, score-0.086]

64 Moreover, CVM’s training time has been shown unpredictable by (Loosli & Canu, 2006), since the training time is sensitive to the selection of stop criteria and hyper-parameters. [sent-236, score-0.092]

65 Table 2 reports the speedup of PSVM on up to m = 500 machines. [sent-240, score-0.268]

66 Since when a dataset size is large, a single machine cannot store the factorized matrix H in its local memory, we cannot obtain the running time of PSVM on one machine. [sent-241, score-0.076]

67 We thus used 10 machines as the baseline to measure the speedup of using more than 10 machines. [sent-242, score-0.436]

68 To quantify speedup, we made an assumption that the speedup of using 10 machines is 10, compared to using one machine. [sent-243, score-0.436]

69 This assumption is reasonable for our experiments, since PSVM does enjoy linear speedup when the number of machines is up to 30. [sent-244, score-0.436]

70 Table 2: Speedup (p is set to Machines 10 30 50 100 150 200 250 500 LIBSVM √ n); LIBSVM training time is reported on the last row for reference. [sent-245, score-0.067]

71 The speedup reported in the table is the average of three runs with standard deviation provided in brackets. [sent-268, score-0.268]

72 The observed variance in speedup was caused by the variance of machine loads, as all machines were shared with other tasks 6 running on our data centers. [sent-269, score-0.436]

73 Figures 1(a), (b) and (c) plot the speedup of Image, CoverType, and RCV, respectively. [sent-271, score-0.268]

74 All datasets enjoy a linear speedup when the number of machines is moderate. [sent-272, score-0.436]

75 For instance, PSVM achieves linear speedup on RCV when running on up to around 100 machines. [sent-273, score-0.268]

76 This result led to our examination on the overheads of PSVM, presented next. [sent-276, score-0.08]

77 (a) Image (200k) speedup (b) Covertype (500k) speedup (c) RCV (800k) speedup (d) Image (200k) overhead (e) Covertype (500k) overhead (f) RCV (800k) overhead (g) Image (200k) fraction (h) Covertype (500k) fraction (i) RCV (800k) fraction Figure 1: Speedup and Overheads of Three Datasets. [sent-277, score-1.422]

78 3 Overheads PSVM cannot achieve linear speedup when the number of machines continues to increase beyond a data-size-dependent threshold. [sent-279, score-0.436]

79 This is expected due to communication and synchronization overheads. [sent-280, score-0.136]

80 Synchronization overhead is incurred when the master machine waits for task completion on the slowest machine. [sent-282, score-0.297]

81 (The master could wait forever if a child machine fails. [sent-283, score-0.139]

82 ) The running time consists of three parts: computation (Comp), communication (Comm), and synchronization (Sync). [sent-285, score-0.171]

83 Figures 1(d), (e) and (f) show how Comm and Sync overheads influence the speedup curves. [sent-286, score-0.348]

84 In the figures, we draw on the top the computation only line (Comp), which approaches the linear speedup line. [sent-287, score-0.303]

85 Computation speedup can become sublinear when adding machines beyond a threshold. [sent-288, score-0.436]

86 This is because the computation bottleneck of the unparallelizable step 12 in Algorithm 1 (which computation time is O(p2 )). [sent-289, score-0.154]

87 When m is small, this bottleneck is insignificant in the total computation time. [sent-290, score-0.071]

88 According to the Amdahl’s law; however, even a small fraction of unparallelizable computation can cap speedup. [sent-291, score-0.131]

89 Therefore, more machines (larger m) can be employed for larger datasets (larger n) to gain speedup. [sent-293, score-0.168]

90 7 When communication overhead or synchronization overhead is accounted for (the Comp + Comm line and the Comp + Comm + Sync line), the speedup deteriorates. [sent-294, score-0.72]

91 Between the two overheads, the synchronization overhead does not impact speedup as much as the communication overhead does. [sent-295, score-0.72]

92 The synchronization overhead maintains about the same percentage when m increases, whereas the percentage of communication overhead grows with m. [sent-297, score-0.452]

93 1, the communication overhead is O(p2 log(m)), growing sub-linearly with m. [sent-299, score-0.23]

94 But since the computation time per node decreases as m increases, the fraction of the communication overhead grows with m. [sent-300, score-0.313]

95 4 Conclusion In this paper, we have shown how SVMs can be parallelized to achieve scalable performance. [sent-302, score-0.06]

96 PSVM distributedly loads training data on parallel machines, reducing memory requirement through approximate factorization on the kernel matrix. [sent-303, score-0.582]

97 PSVM solves IPM in parallel by cleverly arranging computation order. [sent-304, score-0.196]

98 Comments on the core vector machines: Fast svm training on very large data sets (Technical Report). [sent-398, score-0.139]

99 Sequential minimal optimization: A fast algorithm for training support vector machines (Technical Report MSR-TR-98-14). [sent-406, score-0.269]

100 Core vector machines: Fast svm training on very large data sets. [sent-415, score-0.139]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('psvm', 0.595), ('icf', 0.28), ('speedup', 0.268), ('ipm', 0.241), ('pivot', 0.209), ('picf', 0.193), ('machines', 0.168), ('parallel', 0.161), ('overhead', 0.158), ('master', 0.139), ('rcv', 0.129), ('distributedly', 0.113), ('ik', 0.109), ('covertype', 0.102), ('libsvm', 0.092), ('memory', 0.082), ('comm', 0.08), ('overheads', 0.08), ('pipm', 0.08), ('communication', 0.072), ('svm', 0.071), ('parallelizing', 0.07), ('loads', 0.07), ('jk', 0.07), ('broadcasts', 0.064), ('ggt', 0.064), ('sync', 0.064), ('synchronization', 0.064), ('svms', 0.063), ('parallelized', 0.06), ('comp', 0.056), ('ic', 0.056), ('mehrotra', 0.056), ('hh', 0.051), ('na', 0.05), ('factorization', 0.05), ('scalability', 0.049), ('joachims', 0.049), ('ghemawat', 0.048), ('gpvk', 0.048), ('unparallelizable', 0.048), ('fraction', 0.048), ('training', 0.046), ('scheinberg', 0.042), ('cvm', 0.042), ('hk', 0.039), ('fine', 0.036), ('golub', 0.036), ('loading', 0.036), ('bottleneck', 0.036), ('xi', 0.035), ('computation', 0.035), ('kernel', 0.034), ('resides', 0.034), ('tsang', 0.034), ('stores', 0.034), ('support', 0.033), ('graf', 0.032), ('mangasarian', 0.032), ('mapreduce', 0.032), ('smw', 0.032), ('vec', 0.031), ('chang', 0.031), ('chu', 0.03), ('vapnik', 0.028), ('dual', 0.028), ('dean', 0.028), ('loaded', 0.028), ('canu', 0.028), ('loosli', 0.028), ('factorized', 0.027), ('matrix', 0.026), ('requirement', 0.026), ('vishwanathan', 0.026), ('parallelize', 0.026), ('calculates', 0.026), ('xik', 0.026), ('loan', 0.026), ('lin', 0.025), ('copy', 0.025), ('calculation', 0.024), ('figures', 0.024), ('ys', 0.024), ('distribute', 0.024), ('bottlenecks', 0.024), ('diagonal', 0.024), ('store', 0.023), ('diag', 0.023), ('yt', 0.023), ('broadcast', 0.023), ('newton', 0.022), ('yi', 0.022), ('fashion', 0.022), ('stored', 0.022), ('locally', 0.022), ('vector', 0.022), ('row', 0.021), ('gured', 0.021), ('instances', 0.021), ('algorithmic', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

2 0.076435529 160 nips-2007-Random Features for Large-Scale Kernel Machines

Author: Ali Rahimi, Benjamin Recht

Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1

3 0.067883141 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray

Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1

4 0.060695063 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

5 0.060113899 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢

6 0.051724177 134 nips-2007-Multi-Task Learning via Conic Programming

7 0.049873605 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

8 0.04909654 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

9 0.044698592 99 nips-2007-Hierarchical Penalization

10 0.043184072 69 nips-2007-Discriminative Batch Mode Active Learning

11 0.039664362 62 nips-2007-Convex Learning with Invariances

12 0.038998347 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

13 0.037970539 118 nips-2007-Learning with Transformation Invariant Kernels

14 0.037748039 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

15 0.03655795 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

16 0.036454402 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

17 0.034773186 112 nips-2007-Learning Monotonic Transformations for Classification

18 0.033597484 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

19 0.033495016 40 nips-2007-Bundle Methods for Machine Learning

20 0.033358246 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.114), (1, 0.008), (2, -0.054), (3, 0.041), (4, -0.019), (5, -0.02), (6, 0.039), (7, -0.008), (8, -0.028), (9, 0.072), (10, 0.032), (11, -0.023), (12, 0.019), (13, 0.05), (14, -0.037), (15, 0.004), (16, -0.036), (17, -0.059), (18, 0.015), (19, 0.007), (20, 0.024), (21, -0.015), (22, 0.01), (23, 0.059), (24, -0.0), (25, -0.051), (26, -0.079), (27, 0.068), (28, -0.072), (29, -0.026), (30, 0.052), (31, -0.07), (32, -0.001), (33, -0.062), (34, -0.035), (35, 0.019), (36, 0.078), (37, 0.013), (38, -0.013), (39, 0.051), (40, 0.008), (41, 0.022), (42, 0.085), (43, -0.051), (44, -0.044), (45, 0.009), (46, -0.067), (47, 0.016), (48, 0.046), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91806757 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

2 0.63980883 160 nips-2007-Random Features for Large-Scale Kernel Machines

Author: Ali Rahimi, Benjamin Recht

Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1

3 0.62415928 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

4 0.56308782 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

5 0.54911929 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1

6 0.54543805 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

7 0.52394158 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

8 0.52261406 62 nips-2007-Convex Learning with Invariances

9 0.52014303 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

10 0.50938404 24 nips-2007-An Analysis of Inference with the Universum

11 0.50517553 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

12 0.49756014 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

13 0.44927916 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

14 0.4471783 112 nips-2007-Learning Monotonic Transformations for Classification

15 0.44302565 109 nips-2007-Kernels on Attributed Pointsets with Applications

16 0.42057079 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

17 0.41391671 40 nips-2007-Bundle Methods for Machine Learning

18 0.40078354 70 nips-2007-Discriminative K-means for Clustering

19 0.39424711 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

20 0.38686743 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.035), (11, 0.012), (13, 0.044), (16, 0.017), (21, 0.036), (31, 0.017), (34, 0.014), (35, 0.041), (47, 0.07), (49, 0.447), (83, 0.088), (85, 0.027), (87, 0.017), (90, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80604017 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

Author: Moulines Eric, Francis R. Bach, Zaïd Harchaoui

Abstract: We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial and real datasets is provided. 1

same-paper 2 0.79698855 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

3 0.73539644 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

Author: Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, Santiago Fernández

Abstract: In online handwriting recognition the trajectory of the pen is recorded during writing. Although the trajectory provides a compact and complete representation of the written output, it is hard to transcribe directly, because each letter is spread over many pen locations. Most recognition systems therefore employ sophisticated preprocessing techniques to put the inputs into a more localised form. However these techniques require considerable human effort, and are specific to particular languages and alphabets. This paper describes a system capable of directly transcribing raw online handwriting data. The system consists of an advanced recurrent neural network with an output layer designed for sequence labelling, combined with a probabilistic language model. In experiments on an unconstrained online database, we record excellent results using either raw or preprocessed data, well outperforming a state-of-the-art HMM based system in both cases. 1

4 0.71103793 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

5 0.41055435 7 nips-2007-A Kernel Statistical Test of Independence

Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola

Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.

6 0.39159387 108 nips-2007-Kernel Measures of Conditional Dependence

7 0.37031078 43 nips-2007-Catching Change-points with Lasso

8 0.36410457 185 nips-2007-Stable Dual Dynamic Programming

9 0.36355281 40 nips-2007-Bundle Methods for Machine Learning

10 0.35966188 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.35250843 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

12 0.34538233 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

13 0.33732927 200 nips-2007-The Tradeoffs of Large Scale Learning

14 0.33721778 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

15 0.33649975 24 nips-2007-An Analysis of Inference with the Universum

16 0.33610907 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

17 0.33501273 49 nips-2007-Colored Maximum Variance Unfolding

18 0.3349289 179 nips-2007-SpAM: Sparse Additive Models

19 0.33443445 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

20 0.33200395 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models