nips nips2007 nips2007-160 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ali Rahimi, Benjamin Recht
Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. [sent-5, score-0.71]
2 The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. [sent-6, score-0.272]
3 1 Introduction Kernel machines such as the Support Vector Machine are attractive because they can approximate any function or decision boundary arbitrarily well with enough training data. [sent-8, score-0.192]
4 Unfortunately, methods that operate on the kernel matrix (Gram matrix) of the data scale poorly with the size of the training dataset. [sent-9, score-0.299]
5 On the other hand, linear machines can be trained very quickly on large datasets when the dimensionality of the data is small [1, 2, 3]. [sent-11, score-0.164]
6 One way to take advantage of these linear training algorithms for training nonlinear machines is to approximately factor the kernel matrix and to treat the columns of the factor matrix as features in a linear machine (see for example [4]). [sent-12, score-0.574]
7 Instead, we propose to factor the kernel function itself. [sent-13, score-0.234]
8 This factorization does not depend on the data, and allows us to convert the training and evaluation of a kernel machine into the corresponding operations of a linear machine by mapping data into a relatively low-dimensional randomized feature space. [sent-14, score-0.591]
9 Our experiments show that these random features, combined with very simple linear learning techniques, compete favorably in speed and accuracy with state-of-the-art kernel-based classification and regression algorithms, including those that factor the kernel matrix. [sent-15, score-0.347]
10 The kernel trick is a simple way to generate features for algorithms that depend only on the inner product between pairs of input points. [sent-16, score-0.531]
11 It relies on the observation that any positive definite function k(x, y) with x, y ∈ Rd defines an inner product and a lifting φ so that the inner product between lifted datapoints can be quickly computed as φ(x), φ(y) = k(x, y). [sent-17, score-0.413]
12 The cost of this convenience is that the algorithm accesses the data only through evaluations of k(x, y), or through the kernel matrix consisting of k applied to all pairs of datapoints. [sent-18, score-0.234]
13 Thus, we can simply transform the input with z, and then apply fast linear learning methods to approximate the answer of the corresponding nonlinear kernel machine. [sent-22, score-0.404]
14 In what follows, we show how to construct feature spaces that uniformly approximate popular shift-invariant kernels k(x − y) to within with only D = O(d −2 log 1 ) 2 dimensions, and empirically show that excellent regression and classification performance can be obtained for even smaller D. [sent-23, score-0.255]
15 In addition to giving us access to extremely fast learning algorithms, these randomized feature maps also provide a way to quickly evaluate the machine. [sent-24, score-0.361]
16 With the kernel trick, evaluating the machine N at a test point x requires computing f (x) = i=1 ci k(xi , x), which requires O(N d) operations to compute and requires retaining much of the dataset unless the machine is very sparse. [sent-25, score-0.321]
17 On the other hand, after learning a hyperplane w, a linear machine can be evaluated by simply computing f (x) = w z(x), which, with the randomized feature maps presented here, requires only O(D + d) operations and storage. [sent-27, score-0.299]
18 We demonstrate two randomized feature maps for approximating shift invariant kernels. [sent-28, score-0.308]
19 Our first randomized map, presented in Section 3, consists of sinusoids randomly drawn from the Fourier transform of the kernel function we seek to approximate. [sent-29, score-0.53]
20 Our second randomized map, presented in Section 4, partitions the input space using randomly shifted grids at randomly chosen resolutions. [sent-31, score-0.49]
21 This mapping is not smooth, but leverages the proximity between input points, and is well-suited for approximating kernels that depend on the L1 distance between datapoints. [sent-32, score-0.148]
22 Our experiments in Section 5 demonstrate that combining these randomized maps with simple linear learning algorithms competes favorably with state-of-the-art training algorithms in a variety of regression and classification scenarios. [sent-33, score-0.332]
23 2 Related Work The most popular methods for large-scale kernel machines are decomposition methods for solving Support Vector Machines (SVM). [sent-34, score-0.39]
24 These methods iteratively update a subset of the kernel machine’s coefficients using coordinate ascent until KKT conditions are satisfied to within a tolerance [5, 6]. [sent-35, score-0.234]
25 To extend learning with kernel machines to these scales, several approximation schemes have been proposed for speeding up operations involving the kernel matrix. [sent-37, score-0.642]
26 The evaluation of the kernel function can be sped up using linear random projections [7]. [sent-38, score-0.279]
27 Throwing away individual entries [7] or entire rows [8, 9, 10] of the kernel matrix lowers the storage and computational cost of operating on the kernel matrix. [sent-39, score-0.468]
28 These approximations either preserve the separability of the data [8], or produce good low-rank or sparse approximations of the true kernel matrix [7, 9]. [sent-40, score-0.234]
29 Fast nearest neighbor lookup with KD-Trees has been used to approximate multiplication with the kernel matrix, and in turn, a variety of other operations [12]. [sent-43, score-0.281]
30 The feature map we present in Section 4 is reminiscent of KD-trees in that it partitions the input space using multi-resolution axis-aligned grids similar to those developed in [13] for embedding linear assignment problems. [sent-44, score-0.324]
31 3 Random Fourier Features Our first set of random features project data points onto a randomly chosen line, and then pass the resulting scalar through a sinusoid (see Figure 1 and Algorithm 1). [sent-45, score-0.213]
32 The random lines are drawn from a distribution so as to guarantee that the inner product of two transformed points approximates the desired shift-invariant kernel. [sent-46, score-0.301]
33 A continuous kernel k(x, y) = k(x − y) on Rd is positive definite if and only if k(δ) is the Fourier transform of a non-negative measure. [sent-48, score-0.297]
34 Each component of the feature map z(x) projects x onto a random direction ω drawn from the Fourier transform p(ω) of k(∆), and wraps this line onto the unit circle in R2 . [sent-50, score-0.262]
35 After transforming two points x and y in this way, their inner product is an unbiased estimator of k(x, y). [sent-51, score-0.222]
36 If the kernel k(δ) is properly scaled, Bochner’s theorem guarantees that its Fourier transform p(ω) is a proper probability distribution. [sent-54, score-0.334]
37 To obtain a real-valued random feature for k, note that both the probability distribution p(ω) and the kernel k(∆) are real, so the integrand ejω (x−y) may be replaced with cos ω (x − y). [sent-56, score-0.421]
38 Defining zω (x) = [ cos(x) sin(x) ] gives a real-valued mapping that satisfies the condition E[zω (x) zω (y)] = √ k(x, y), since zω (x) zω (y) = cos ω (x − y). [sent-57, score-0.13]
39 The inner product of points featureized by the D 1 2D-dimensional random feature z, z(x) z(y) = D j=1 zωj (x)zωj (y) is a sample average of zωj (x)zωj (y) and is therefore a lower variance approximation to the expectation (2). [sent-60, score-0.272]
40 Since zω (x) zω (y) is bounded between -1 and 1, for a fixed pair of points x and y, Hoeffding’s inequality guarantees exponentially fast convergence in D between z(x) z(y) and k(x, y): Pr [|z(x) z(y) − k(x, y)| ≥ ] ≤ 2 exp(−D 2 /2). [sent-61, score-0.19]
41 Building on this observation, a much stronger assertion can be proven for every pair of points in the input space simultaneously: Claim 1 (Uniform convergence of Fourier features). [sent-62, score-0.11]
42 Then, for the mapping z defined in Algorithm 1, we have Pr sup |z(x) z(y) − k(x, y)| ≥ ≤ 28 x,y∈M σp diam(M) 2 exp − D 2 4(d + 2) , 2 where σp ≡ Ep [ω ω] is the second moment of the Fourier transform of k. [sent-64, score-0.188]
43 This result is then extended to the entire space using the fact that the feature map is smooth with high probability. [sent-67, score-0.116]
44 It quantifies the curvature of the kernel at the origin. [sent-70, score-0.234]
45 Require: A positive definite shift-invariant kernel k(x, y) = k(x − y). [sent-73, score-0.234]
46 Ensure: A randomized feature map z(x) : Rd → R2D so that z(x) z(y) ≈ k(x − y). [sent-74, score-0.267]
47 1 Compute the Fourier transform p of the kernel k: p(ω) = 2π e−jω ∆ k(∆) d∆. [sent-75, score-0.297]
48 Random Binning Features Our second random map partitions the input space using randomly shifted grids at randomly chosen resolutions and assigns to an input point a binary bit string that corresponds to the bin in which it falls (see Figure 2 and Algorithm 2). [sent-78, score-0.626]
49 The grids are constructed so that the probability that two points x and y are assigned to the same bin is proportional to k(x, y). [sent-79, score-0.232]
50 The inner product between a pair of transformed points is proportional to the number of times the two points are binned together, and is therefore an unbiased estimate of k(x, y). [sent-80, score-0.382]
51 (left) The algorithm repeatedly partitions the input space using a randomly shifted grid at a randomly chosen resolution and assigns to each point x the bit string z(x) associated with the bin to which it is assigned. [sent-82, score-0.471]
52 (right) The binary adjacency matrix that describes this partitioning has z(xi ) z(xj ) in its ijth entry and is an unbiased estimate of kernel matrix. [sent-83, score-0.282]
53 We first describe a randomized mapping to approximate the “hat” kernel khat (x, y; δ) = max 0, 1 − |x−y| on a compact subset of R × R, then show how to construct mappings for δ more general separable multi-dimensional kernels. [sent-84, score-0.632]
54 Partition the real number line with a grid of pitch δ, and shift this grid randomly by an amount u drawn uniformly at random from [0, δ]. [sent-85, score-0.534]
55 This grid partitions the real number line into intervals [u + nδ, u + (n + 1)δ] for all integers n. [sent-86, score-0.198]
56 The probability that two points x and y fall in the same bin in this grid is max 0, 1 − |x−y| [13]. [sent-87, score-0.257]
57 δ In other words, if we number the bins of the grid so that a point x falls in bin x = x−u and y ˆ δ falls in bin y = y−u , then Pru [ˆ = y |δ] = khat (x, y; δ). [sent-88, score-0.604]
58 If we encode x as a binary indicator ˆ x ˆ ˆ δ vector z(x) over the bins, z(x) z(y) = 1 if x and y fall in the same bin and zero otherwise, so Pru [z(x) z(y) = 1|δ] = Eu [z(x) z(y)|δ] = khat (x, y; δ). [sent-89, score-0.258]
59 Now consider shift-invariant kernels that can be written as convex combinations of hat kernels on a ∞ compact subset of R × R: k(x, y) = 0 khat (x, y; δ)p(δ) dδ. [sent-91, score-0.406]
60 If the pitch δ of the grid is sampled from p, z again gives a random map for k because Eδ,u [z(x) z(y)] = Eδ [Eu [z(x) z(y)|δ]] = Eδ [khat (x, y; δ)] = k(x, y). [sent-92, score-0.317]
61 That is, if the pitch δ of the grid is sampled from p, and the shift u is drawn uniformly from [0, δ] the probability that x and y are binned together is k(x, y). [sent-93, score-0.411]
62 Random maps for separable multivariate shift-invariant kernels of the form k(x − y) = d m m m=1 km (|x −y |) (such as the multivariate Laplacian kernel) can be constructed in a similar way if each km can be written as a convex combination of hat kernels. [sent-97, score-0.354]
63 We apply the above binning process over each dimension of Rd independently. [sent-98, score-0.154]
64 The probability that xm and y m are binned together in dimension m is km (|xm − y m |). [sent-99, score-0.208]
65 Since the binning process is independent across dimensions, the 4 d probability that x and y are binned together in every dimension is m=1 km (|xm −y m |) = k(x−y). [sent-100, score-0.318]
66 ˆ In this multivariate case, z(x) encodes the integer vector [ x1 ,··· ,ˆ d ] corresponding to each bin of the x d-dimensional grid as a binary indicator vector. [sent-101, score-0.216]
67 Since there are never more bins than training points, this ensures no overflow is possible. [sent-103, score-0.121]
68 We can again reduce the variance of the estimator z(x) z(y) by concatenating P random binning functions z into a larger list of features z and scaling by 1/P . [sent-104, score-0.335]
69 The inner product z(x) z(y) = P 1 p=1 zp (x) zp (y) is the average of P independent z(x) z(y) and has therefore lower variance. [sent-105, score-0.309]
70 P Since z(x) z(y) is binary, Hoeffding’s inequality guarantees that for a fixed pair of points x and y, z(x) z(y) converges exponentially quickly to k(x, y) as a function of P . [sent-106, score-0.155]
71 The proof δ of the claim (see the appendix) partitions M × M into a few small rectangular cells over which k(x, y) does not change much and z(x) and z(y) are constant. [sent-111, score-0.206]
72 A kernel function k(x, y) = m=1 km (|xm − y m |), so that pm (∆) ≡ ¨ ∆km (∆) is a probability distribution on ∆ ≥ 0. [sent-115, score-0.4]
73 Ensure: A randomized feature map z(x) so that z(x) z(y) ≈ k(x − y). [sent-116, score-0.267]
74 P do Draw grid parameters δ, u ∈ Rd with the pitch δ m ∼ pm , and shift um from the uniform distribution on [0, δ m ]. [sent-120, score-0.342]
75 Let z return the coordinate of the bin containing x as a binary indicator vector zp (x) ≡ d 1 −ud −u1 hash( x δ1 , · · · , x δd ). [sent-121, score-0.195]
76 5 Experiments The experiments summarized in Table 1 show that ridge regression with our random features is a fast way to approximate the training of supervised kernel machines. [sent-123, score-0.62]
77 We focus our comparisons against the Core Vector Machine [14] because it was shown in [14] to be both faster and more accurate than other known approaches for training kernel machines, including, in most cases, random sampling of datapoints [8]. [sent-124, score-0.394]
78 5 Dataset CPU regression 6500 instances 21 dims Census regression 18,000 instances 119 dims Adult classification 32,000 instances 123 dims Forest Cover classification 522,000 instances 54 dims KDDCUP99 (see footnote) classification 4,900,000 instances 127 dims Fourier+LS 3. [sent-132, score-0.976]
79 4 secs (20 secs) Exact SVM 11% 31 secs ASVM 9% 13 mins SVMTorch 15. [sent-151, score-0.571]
80 3% < 1s SVM+sampling Table 1: Comparison of testing error and training time between ridge regression with random features, Core Vector Machine, and various state-of-the-art exact methods reported in the literature. [sent-154, score-0.263]
81 For classification tasks, the percent of testing points incorrectly predicted is reported, and for regression tasks, the RMS error normalized by the norm of the ground truth. [sent-155, score-0.141]
82 On the Forest dataset, using random binning, doubling the dataset size reduces testing error by up to 40% (left). [sent-162, score-0.117]
83 Despite its simplicity, ridge regression with random features is faster than, and provides competitive accuracy with, alternative methods. [sent-167, score-0.249]
84 It also produces very compact functions because only w and a set of O(D) random vectors or a hash-table of partitions need to be retained. [sent-168, score-0.189]
85 On the other hand, random binning features perform better on memorization tasks (those for which the standard SVM requires many support vectors), because they explicitly preserve locality in the input space. [sent-170, score-0.347]
86 6 Conclusion We have presented randomized features whose inner products uniformly approximate many popular kernels. [sent-174, score-0.388]
87 We showed empirically that providing these features as input to a standard linear learning algorithm produces results that are competitive with state-of-the-art large-scale kernel machines in accuracy, training time, and evaluation time. [sent-175, score-0.544]
88 It is worth noting that hybrids of Fourier features and Binning features can be constructed by concatenating these features. [sent-176, score-0.219]
89 While we have focused on regression and classification, our features can be applied to accelerate other kernel methods, including semi-supervised and unsupervised learning algorithms. [sent-177, score-0.422]
90 In all of these cases, a significant computational speed-up can be achieved by first computing random features and then applying the associated linear technique. [sent-178, score-0.128]
91 Fast query-optimized kernel machine classification via incremental approximate nearest support vectors. [sent-201, score-0.264]
92 Efficient kernel machines using the improved fast gauss transform. [sent-244, score-0.433]
93 M∆ is compact and has diameter at most twice diam(M), so we can find an -net that covers M∆ using at most T = (4 diam M/r)d balls of radius r [17]. [sent-297, score-0.698]
94 (5) The union bound followed by Hoeffding’s inequality applied to the anchors in the -net gives Pr ∪T |f (∆i )| ≥ /8 ≤ 2T exp −D 2 /2 . [sent-303, score-0.118]
95 Let δpm be the pitch of the pth grid along the mth dimension. [sent-310, score-0.269]
96 Each grid has at most diam(M)/δpm bins, and P P P 1 overlapping grids produce at most Nm = g=1 diam(M)/δgm ≤ P + diam(M) p=1 δpm partitions along the mth dimension. [sent-311, score-0.342]
97 Denote by dmi the width of the ith cell along the mth dimension and observe Nm that i=1 dmi ≤ diam(M). [sent-315, score-0.212]
98 We further subdivide these cells into smaller rectangles of some small width r to ensure that the kernel k varies very little over each of these cells. [sent-316, score-0.32]
99 This results in at most Nm dmi ≤ Nm +diam(M) small one-dimensional cells over each dimension. [sent-317, score-0.112]
100 i=1 The condition |z(x, y) − k(x, y)| ≤ on M × M holds if |z(xi , yi ) − k(xi , yi )| ≤ − Lk rd and z(x) is constant throughout each rectangle. [sent-319, score-0.118]
wordName wordTfidf (topN-words)
[('diam', 0.529), ('kernel', 0.234), ('fourier', 0.23), ('mins', 0.219), ('secs', 0.176), ('binning', 0.154), ('randomized', 0.151), ('khat', 0.151), ('lf', 0.131), ('machines', 0.127), ('dims', 0.126), ('rd', 0.118), ('grid', 0.109), ('bin', 0.107), ('pitch', 0.1), ('pr', 0.092), ('inner', 0.092), ('cos', 0.089), ('km', 0.089), ('partitions', 0.089), ('zp', 0.088), ('cvm', 0.088), ('nm', 0.086), ('grids', 0.084), ('features', 0.083), ('claim', 0.081), ('pm', 0.077), ('dmi', 0.076), ('binned', 0.075), ('kernels', 0.072), ('fast', 0.072), ('svm', 0.068), ('regression', 0.068), ('training', 0.065), ('hoeffding', 0.064), ('map', 0.063), ('transform', 0.063), ('lk', 0.062), ('lifting', 0.06), ('mth', 0.06), ('ej', 0.06), ('hat', 0.056), ('bins', 0.056), ('shift', 0.056), ('compact', 0.055), ('feature', 0.053), ('forest', 0.053), ('concatenating', 0.053), ('ridge', 0.053), ('bochner', 0.05), ('hrs', 0.05), ('rectangles', 0.05), ('datapoints', 0.05), ('libsvm', 0.048), ('maps', 0.048), ('unbiased', 0.048), ('core', 0.048), ('operations', 0.047), ('laplacian', 0.047), ('trick', 0.046), ('random', 0.045), ('xi', 0.044), ('exp', 0.044), ('pru', 0.044), ('randomly', 0.044), ('xm', 0.044), ('transformed', 0.044), ('shifted', 0.043), ('classi', 0.042), ('instances', 0.042), ('diameter', 0.042), ('points', 0.041), ('mapping', 0.041), ('product', 0.041), ('balls', 0.04), ('inequality', 0.04), ('sup', 0.04), ('dataset', 0.04), ('xj', 0.038), ('drawn', 0.038), ('svmlight', 0.037), ('accelerate', 0.037), ('seattle', 0.037), ('quickly', 0.037), ('guarantees', 0.037), ('falls', 0.037), ('sin', 0.037), ('cells', 0.036), ('ls', 0.035), ('input', 0.035), ('union', 0.034), ('assertion', 0.034), ('centers', 0.033), ('uniformly', 0.033), ('twice', 0.032), ('testing', 0.032), ('support', 0.03), ('eu', 0.03), ('appendix', 0.03), ('popular', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 160 nips-2007-Random Features for Large-Scale Kernel Machines
Author: Ali Rahimi, Benjamin Recht
Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1
2 0.19241922 77 nips-2007-Efficient Inference for Distributions on Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact representations such as graphical models cannot efficiently capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent such distributions compactly. We present Kronecker conditioning, a general and efficient approach for maintaining these distributions directly in the Fourier domain. Low order Fourier-based approximations can lead to functions that do not correspond to valid distributions. To address this problem, we present an efficient quadratic program defined directly in the Fourier domain to project the approximation onto a relaxed form of the marginal polytope. We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting. 1
3 0.13310128 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan
Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1
4 0.12719476 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
5 0.12391949 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
Author: Ronny Luss, Alexandre D'aspremont
Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.
6 0.10423196 35 nips-2007-Bayesian binning beats approximate alternatives: estimating peri-stimulus time histograms
7 0.091495991 109 nips-2007-Kernels on Attributed Pointsets with Applications
8 0.090130515 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
9 0.080937169 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
10 0.076588131 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
11 0.076445378 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
12 0.076435529 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
13 0.072516441 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
14 0.072135985 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections
15 0.071324594 134 nips-2007-Multi-Task Learning via Conic Programming
16 0.069763243 116 nips-2007-Learning the structure of manifolds using random projections
17 0.068810582 7 nips-2007-A Kernel Statistical Test of Independence
18 0.068093516 112 nips-2007-Learning Monotonic Transformations for Classification
19 0.065661773 16 nips-2007-A learning framework for nearest neighbor search
20 0.062878646 62 nips-2007-Convex Learning with Invariances
topicId topicWeight
[(0, -0.222), (1, 0.038), (2, -0.097), (3, 0.12), (4, -0.037), (5, 0.012), (6, 0.022), (7, 0.036), (8, -0.098), (9, 0.117), (10, 0.041), (11, -0.022), (12, 0.074), (13, 0.032), (14, -0.058), (15, -0.139), (16, 0.057), (17, -0.001), (18, -0.014), (19, 0.129), (20, 0.06), (21, 0.038), (22, 0.067), (23, 0.15), (24, 0.014), (25, -0.104), (26, -0.062), (27, 0.025), (28, -0.013), (29, -0.019), (30, 0.08), (31, 0.1), (32, 0.193), (33, 0.038), (34, -0.003), (35, 0.072), (36, 0.135), (37, -0.003), (38, -0.119), (39, -0.003), (40, 0.019), (41, -0.211), (42, 0.216), (43, -0.126), (44, -0.166), (45, 0.123), (46, -0.15), (47, -0.041), (48, -0.024), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.93218595 160 nips-2007-Random Features for Large-Scale Kernel Machines
Author: Ali Rahimi, Benjamin Recht
Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1
2 0.74156356 77 nips-2007-Efficient Inference for Distributions on Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact representations such as graphical models cannot efficiently capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent such distributions compactly. We present Kronecker conditioning, a general and efficient approach for maintaining these distributions directly in the Fourier domain. Low order Fourier-based approximations can lead to functions that do not correspond to valid distributions. To address this problem, we present an efficient quadratic program defined directly in the Fourier domain to project the approximation onto a relaxed form of the marginal polytope. We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting. 1
3 0.59363467 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang
Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.
4 0.57809603 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
5 0.54096574 109 nips-2007-Kernels on Attributed Pointsets with Applications
Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan
Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1
6 0.53388584 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
7 0.5311386 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
8 0.47410482 35 nips-2007-Bayesian binning beats approximate alternatives: estimating peri-stimulus time histograms
9 0.44666752 16 nips-2007-A learning framework for nearest neighbor search
10 0.43712902 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
11 0.41214439 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
12 0.37723827 24 nips-2007-An Analysis of Inference with the Universum
13 0.37559715 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
14 0.36875474 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection
16 0.35846162 112 nips-2007-Learning Monotonic Transformations for Classification
17 0.34241045 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
18 0.34217355 62 nips-2007-Convex Learning with Invariances
19 0.32370567 134 nips-2007-Multi-Task Learning via Conic Programming
20 0.30403838 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
topicId topicWeight
[(5, 0.032), (13, 0.044), (16, 0.019), (19, 0.011), (21, 0.051), (31, 0.018), (34, 0.017), (35, 0.458), (47, 0.054), (49, 0.02), (83, 0.127), (85, 0.022), (87, 0.02), (90, 0.048)]
simIndex simValue paperId paperTitle
1 0.85963905 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
Author: Richard Turner, Maneesh Sahani
Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1
same-paper 2 0.85621905 160 nips-2007-Random Features for Large-Scale Kernel Machines
Author: Ali Rahimi, Benjamin Recht
Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1
3 0.83740509 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
Author: Anton Chechetka, Carlos Guestrin
Abstract: We present the first truly polynomial algorithm for PAC-learning the structure of bounded-treewidth junction trees – an attractive subclass of probabilistic graphical models that permits both the compact representation of probability distributions and efficient exact inference. For a constant treewidth, our algorithm has polynomial time and sample complexity. If a junction tree with sufficiently strong intraclique dependencies exists, we provide strong theoretical guarantees in terms of KL divergence of the result from the true distribution. We also present a lazy extension of our approach that leads to very significant speed ups in practice, and demonstrate the viability of our method empirically, on several real world datasets. One of our key new theoretical insights is a method for bounding the conditional mutual information of arbitrarily large sets of variables with only polynomially many mutual information computations on fixed-size subsets of variables, if the underlying distribution can be approximated by a bounded-treewidth junction tree. 1
4 0.813182 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
Author: Andreas Argyriou, Massimiliano Pontil, Yiming Ying, Charles A. Micchelli
Abstract: Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. 1
5 0.54644704 16 nips-2007-A learning framework for nearest neighbor search
Author: Lawrence Cayton, Sanjoy Dasgupta
Abstract: Can we leverage learning techniques to build a fast nearest-neighbor (ANN) retrieval data structure? We present a general learning framework for the NN problem in which sample queries are used to learn the parameters of a data structure that minimize the retrieval time and/or the miss rate. We explore the potential of this novel framework through two popular NN data structures: KD-trees and the rectilinear structures employed by locality sensitive hashing. We derive a generalization theory for these data structure classes and present simple learning algorithms for both. Experimental results reveal that learning often improves on the already strong performance of these data structures. 1
6 0.52347857 96 nips-2007-Heterogeneous Component Analysis
7 0.513237 175 nips-2007-Semi-Supervised Multitask Learning
8 0.51003754 134 nips-2007-Multi-Task Learning via Conic Programming
9 0.49728465 156 nips-2007-Predictive Matrix-Variate t Models
10 0.49334756 49 nips-2007-Colored Maximum Variance Unfolding
11 0.48005676 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
12 0.47866935 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
13 0.47845009 63 nips-2007-Convex Relaxations of Latent Variable Training
14 0.47711968 116 nips-2007-Learning the structure of manifolds using random projections
15 0.47459534 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
16 0.47420695 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
17 0.47307643 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking
18 0.47190952 70 nips-2007-Discriminative K-means for Clustering
19 0.46609709 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
20 0.46133852 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels