nips nips2013 nips2013-156 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Corinna Cortes, Marius Kloft, Mehryar Mohri
Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classification tasks. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We use the notion of local Rademacher complexity to design new algorithms for learning kernels. [sent-6, score-0.268]
2 Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. [sent-7, score-0.162]
3 We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. [sent-8, score-0.645]
4 For such algorithms, the features are provided intrinsically via the choice of a positive-semi-definite symmetric kernel function, which can be interpreted as a similarity measure in a high-dimensional Hilbert space. [sent-11, score-0.297]
5 In the standard setting of these algorithms, the choice of the kernel is left to the user. [sent-12, score-0.297]
6 In the last decade or so, a number of algorithms and theoretical results have been given for a wider setting known as that of learning kernels or multiple kernel learning (MKL) (e. [sent-14, score-0.569]
7 That setting, instead of demanding from the user to take the risk of specifying a particular kernel function, only requires from him to provide a family of kernels. [sent-17, score-0.377]
8 Both tasks of selecting the kernel out of that family of kernels and choosing a hypothesis based on that kernel are then left to the learning algorithm. [sent-18, score-0.93]
9 One of the most useful data-dependent complexity measures used in the theoretical analysis and design of learning kernel algorithms is the notion of Rademacher complexity (e. [sent-19, score-0.567]
10 These generalization bounds provide a strong theoretical foundation for a family of learning kernel algorithms based on a non-negative linear combination of base kernels. [sent-23, score-0.429]
11 Most of these algorithms, whether for binary classification or multi-class classification, are based on controlling the trace of the combined kernel matrix. [sent-24, score-0.352]
12 This paper seeks to use a finer notion of complexity for the design of algorithms for learning kernels: the notion of local Rademacher complexity [11, 12]. [sent-25, score-0.408]
13 The notion of local Rademacher complexity is precisely based on this idea by considering Rademacher averages of smaller subsets of the hypothesis set. [sent-27, score-0.273]
14 1 We show how the notion of local Rademacher complexity can be used to guide the design of new algorithms for learning kernels. [sent-33, score-0.268]
15 For kernel-based hypotheses, the local Rademacher complexity can be both upper- and lower-bounded in terms of the tail sum of the eigenvalues of the kernel matrix [13]. [sent-34, score-0.7]
16 This motivates the introduction of two natural families of hypotheses based on non-negative combinations of base kernels with kernels constrained by a tail sum of the eigenvalues. [sent-35, score-0.736]
17 We study and compare both families of hypotheses and derive learning kernel algorithms based on both. [sent-36, score-0.39]
18 We show how that problem can be solved using optimization solutions for existing learning kernel algorithms. [sent-38, score-0.297]
19 In Section 2, we present some background on the notion of local Rademacher complexity by summarizing the main results relevant to our theoretical analysis and the design of our algorithms. [sent-42, score-0.244]
20 Section 3 describes and analyzes two new kernel learning algorithms, as just discussed. [sent-43, score-0.297]
21 2 Background on local Rademacher complexity In this section, we present an introduction to local Rademacher complexities and related properties. [sent-46, score-0.267]
22 g2G n i=1 Generalization bounds based on the notion of Rademacher complexity are standard [7]. [sent-73, score-0.168]
23 In particular, for the empirical risk minimization (ERM) hypothesis gn , for any > 0, the following bound holds b with probability at least 1 : s 2 log 2 b E[bn ] E[g ⇤ ] 2 sup E[g] E[g] 4Rn (G) + g . [sent-74, score-0.248]
24 (1) n g2G p Rn (G) is in the order of O(1/ n) for various classes used in practice, including when F is a kernel class with bounded trace and when the loss l is Lipschitz. [sent-75, score-0.424]
25 Rademacher complexity Rn (Gn ), the bound (2) can be sharper than (1). [sent-85, score-0.145]
26 But how can we find a small class Gn that is just large enough to contain gn ? [sent-87, score-0.145]
27 For any r > 0, the local Rademacher complexity of G is defined as Rn (G; r) := Rn g 2 G : E[g 2 ] r . [sent-94, score-0.148]
28 If the local Rademacher complexity is known, it can be used to compare gn with g ⇤ , as E[bn ] E[g ⇤ ] b g can be bounded in terms of the fixed point of the Rademacher complexity of F, besides constantsp and O(1/n) terms. [sent-95, score-0.334]
29 But, while the global Rademacher complexity is generally of the order of O(1/ n) at best, its local counterpart can converge at orders up to O(1/n). [sent-96, score-0.148]
30 2 Kernel classes The local Rademacher complexity for kernel classes can be accurately described and shown to admit a simple expression in terms of the eigenvalues of the kernel [13] (cf. [sent-99, score-0.928]
31 Let k be a Mercer kernel with corresponding feature map k and reproducing kernel P1 Hilbert space Hk . [sent-103, score-0.594]
32 Let k(x, x) = j=1 j 'j (x)> 'j (˜) be its eigenvalue decomposition, where ˜ x 1 ( i )i=1 is the sequence of eigenvalues arranged in descending order. [sent-104, score-0.169]
33 In view of (3), the local Rademacher complexity for kernel classes is determined by the tail sum of the eigenvalues. [sent-110, score-0.628]
34 A core idea of the proof is to optimize over the “cut-off point” ✓ of the tail sum of the eigenvalues in the bound. [sent-111, score-0.255]
35 But, when j>✓ j = O(exp( ✓)), as in the case of Gaussian kernels [14], then ⇣ ⌘ O min ✓r + exp( ✓) = O(r log(1/r)). [sent-115, score-0.248]
36 3 Algorithms In this section, we will use the properties of the local Rademacher complexity just discussed to devise a novel family of algorithms for learning kernels. [sent-118, score-0.2]
37 1 Motivation and analysis Most learning kernel algorithms are based on a family of hypotheses based on a kernel kµ = PM m=1 µm km that is a non-negative linear combination of M base kernels. [sent-120, score-1.099]
38 µ It is known that the Rademacher complexity of H can be upper-bounded in terms of the trace of the combined kernel. [sent-123, score-0.13]
39 Thus, most existing algorithms for learning kernels [1, 4, 6] add the following constraint to restrict H: Tr(kµ ) 1. [sent-124, score-0.296]
40 (4) As we saw in the previous section, however, the tail sum of the eigenvalues of the kernel, rather than its trace, determines the local Rademacher complexity. [sent-125, score-0.328]
41 Since the local Rademacher complexity can lead to tighter generalization bounds than the global Rademacher complexity, this motivates us to consider the following hypothesis class for learning kernels: X H1 := fw,kµ 2 H : j (kµ ) 1 . [sent-126, score-0.296]
42 j>✓ j (k), however, is concave since it can be expressed as the difference of the trace and the sum of the ✓ largest eigenvalues, which is a convex function. [sent-130, score-0.168]
43 Nevertheless, the following upper bound holds, denoting µm := µm / kµk1 , ˜ ◆ M M M X X X X X ✓X µm µm ˜ µm kµk1 km , ˜ j (km ) = j (kµk1 km ) j m=1 j>✓ m=1 j>✓ j>✓ m=1 | {z =kµ (5) } where the equality holds by linearity and the inequality by the concavity just discussed. [sent-131, score-0.741]
44 m=1 j>✓ The class H2 is convex because it is the restriction of the convex class H via a linear inequality constraint. [sent-133, score-0.17]
45 Thus, in the next section, we will consider both hypothesis sets and introduce two distinct learning kernel algorithms, each based on one of these families. [sent-157, score-0.357]
46 2 Convex optimization algorithm The simpler algorithm performs regularized empirical risk minimization based on the convex ˜ class H2 . [sent-159, score-0.137]
47 Note that by a renormalization of the kernels k1 , . [sent-160, score-0.248]
48 , kM , according to km := P PM ˜ ˜ ( j>✓ j (km )) 1 km and kµ = m=1 µm km , we can simply rewrite H2 as ⇢ ˜ 2 := f ˜ = (x 7! [sent-163, score-1.074]
49 hw, ˜ (x)i), kwk H2 = H (6) H ˜ ⇤, µ ⌫ 0, kµk1 1 , w,kµ kµ kµ 4 which is the commonly studied hypothesis class in multiple kernel learning. [sent-164, score-0.391]
50 Of course, in practice, we replace the empirical version of the kernel k by the kernel matrix K = (k(xi , xj ))n , and i,j=1 consider 1 , . [sent-165, score-0.594]
51 , n as the eigenvalues of the kernel matrix and not of the kernel itself. [sent-168, score-0.704]
52 , M , normalize the kernel matrices according to Km P 1 ( j>✓ j (Km )) Km ; := 3. [sent-177, score-0.297]
53 Note that the tail sum can be computed in O(n2 ✓) for each kernel because it is sufficient to compute P P✓ the ✓ largest eigenvalues and the trace: j>✓ j (Km ) = Tr(Km ) j=1 j (Km ). [sent-179, score-0.552]
54 (9) u> K m uj j>✓ j A very similar optimality expression has been used in the context the group Lasso and `p -norm multiple kernel learning by [3]. [sent-212, score-0.361]
55 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: input: kernel matrix K = (k(xi , xj ))n i,j=1 and labels y1 , . [sent-218, score-0.297]
56 , M while optimality conditions are not satisfied within tolerance ✏ do SVM training: compute a new ↵ by solving the SVM problem (10) eigenvalue computation: compute eigenvalues u1 , . [sent-224, score-0.142]
57 It should be clear that in the same way we can replace the familiar `p -regularization used in learning kernel algorithms [3] for p 1 with `p -regularization in terms of the tail eigenvalues. [sent-230, score-0.463]
58 Assume that the kernels are uniformly bounded (for ˜ e all m, kkm k1 < 1) and uncorrelated. [sent-240, score-0.248]
59 Then, the local Rademacher complexity of H2 can be bounded as follows: v ! [sent-241, score-0.148]
60 ,M n n j=1 Note that we show the result under the assumption of uncorrelated kernels only for simplicity of presentation. [sent-246, score-0.248]
61 More generally, a similar result holds for correlated kernels and arbitrary p 1 (cf. [sent-247, score-0.248]
62 Subsequently, we can derive the following bound on the excess risk from Theorem 6 using a result of [11] (presented as Theorem 8 in the supplemental material 1). [sent-249, score-0.171]
63 Assume that for all m, there exists d such 2 ˜ that j (km ) dj for some > 1 (this is a common assumption and, for example, met for finite rank kernels and Gaussian kernels [14]). [sent-252, score-0.496]
64 84 100 250 n l1 l2 unif conv dc 1,000 l1 l2 conv dc 2 1 0 2 n=100 log(tailsum(θ) AUC 0. [sent-257, score-0.81]
65 C EN TER : for each kernel, the average kernel weight and single-kernel AUC. [sent-266, score-0.297]
66 R IGHT: for each kernel P Km , the tail sum j>✓ j as a function of the eigenvalue cut-off point ✓. [sent-267, score-0.474]
67 5 Experiments In this section, we report the results of experiments with the two algorithms we introduced, which we will denote by conv and dc in short. [sent-272, score-0.418]
68 We also measure the performance of the uniform kernel combination, denoted by unif, which has frequently been shown to achieve competitive performances [22]. [sent-274, score-0.297]
69 The SVM based on the uniform combination of these 5 kernels was found to have the highest overall performance among 19 promoter prediction programs [24], it therefore constitutes a strong baseline. [sent-281, score-0.35]
70 All kernel matrices Km were normalized such that Tr(Km ) = n for all m, prior to the experiment. [sent-283, score-0.297]
71 For both conv and dc, we experiment with `1 - and `2 -norms. [sent-285, score-0.258]
72 For all sample sizes investigated, conv and dc yield the highest AUCs. [sent-298, score-0.392]
73 To further investigate, we compare the average kernel weights µ output by the compared algorithms (for n = 100). [sent-300, score-0.321]
74 They are shown in Figure 2 (center), where we report, below each kernel, also its performance in terms of its AUC when training an SVM on that single kernel alone. [sent-301, score-0.297]
75 We observe that l1 focuses on the TSS kernel using the TSS signal, which has the second highest AUC among the kernels (85. [sent-302, score-0.569]
76 A similar order of kernel importance is determined by l2, but which distributes the weights more broadly, 7 Table 1: The training split (sp) fraction, dataset size (n), and multi-class accuracies shown with ±1 standard error. [sent-306, score-0.297]
77 The performance results for MKL and conv correspond to the best values obtained using either `1 -norm or `2 -norm regularization. [sent-307, score-0.258]
78 sp n unif MKL conv ✓ plant nonpl psortPos psortNeg protein 0. [sent-308, score-0.476]
79 In contrast, conv and dc distribute their weight only over the TSS, Promoter, and 1st Exon kernels, which are also the kernels that also have the highest predictive accuracies. [sent-344, score-0.64]
80 The considerably weaker kernels Angle and Energ are discarded. [sent-345, score-0.248]
81 This can be explained by means of Figure 2 (right), where we show the tail sum of each kernel as a function of the cut-off point ✓. [sent-347, score-0.442]
82 We observe that Angle and Energ have only moderately large first and second eigenvalues, which is why they hardly profit when using conv or dc. [sent-348, score-0.258]
83 The Promo and Exon kernels, however, which are discarded by l1, have a large first (and also second) eigenvalues, which is why they are promoted by conv or dc. [sent-349, score-0.258]
84 Indeed, the model selection determines the optimal cut-off, for both conv and dc, for ✓ = 1. [sent-350, score-0.258]
85 2 Multi-class Experiments We next carried out a series of experiments with the conv algorithm in the multi-class classification setting, that repeatedly has demonstrated amenable to MKL learning [26, 27]. [sent-352, score-0.258]
86 2 the conv problem can be solved by simply re-normalizing the kernels by the tail sum of the eigenvalues and making use of any `p -norm MKL solver. [sent-354, score-0.761]
87 For both conv and ufo we experiment both with `1 and `2 regularization and report the best performance achieved in each case. [sent-358, score-0.351]
88 For plant, psortPos, and psortNeg, the results show that conv leads to a consistent improvement in a difficult multi-class setting, although we cannot attest to their significance due to the insufficient size of the data sets. [sent-368, score-0.258]
89 6 Conclusion We showed how the notion of local Rademacher complexity can be used to derive new algorithms for learning kernels by using a regularization based on the tail sum of the eigenvalues of the kernels. [sent-370, score-0.74]
90 We introduced two natural hypothesis sets based on that regularization, discussed their relationships, and showed how they can be used to design an algorithm based on a convex optimization and one based on solving a DC-programming problem. [sent-371, score-0.142]
91 Finally, our analysis based on local Rademacher complexity could be used as the basis for the design of new learning kernel algorithms. [sent-374, score-0.476]
92 Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proc. [sent-389, score-0.297]
93 Zien, “`p -norm multiple kernel learning,” Journal of Machine Learning Research, vol. [sent-400, score-0.297]
94 Jordan, “Learning the kernel matrix with semi-definite programming,” JMLR, vol. [sent-410, score-0.297]
95 Sch¨ lkopf, “Large scale multiple kernel learning,” Journal a a o of Machine Learning Research, vol. [sent-428, score-0.297]
96 Campbell, “Generalization bounds for learning the kernel problem,” in COLT, 2009. [sent-449, score-0.325]
97 Blanchard, “On the convergence rate of `p -norm multiple kernel learning,” Journal of Machine Learning Research, vol. [sent-508, score-0.297]
98 Suzuki, “Unifying framework for fast learning rate of non-sparse multiple kernel learning,” in Advances in Neural Information Processing Systems 24, pp. [sent-520, score-0.297]
99 Jie, “Ultra-fast optimization algorithm for sparse multi kernel learning,” in Proceedings of the 28th International Conference on Machine Learning, 2011. [sent-555, score-0.297]
100 Ong, “Multiclass multiple kernel learning,” in ICML 24, pp. [sent-559, score-0.297]
wordName wordTfidf (topN-words)
[('rademacher', 0.443), ('km', 0.358), ('kernel', 0.297), ('conv', 0.258), ('kernels', 0.248), ('tss', 0.244), ('mkl', 0.176), ('kwkhk', 0.155), ('energ', 0.133), ('tail', 0.119), ('gn', 0.111), ('dc', 0.11), ('eigenvalues', 0.11), ('exon', 0.089), ('kloft', 0.089), ('promo', 0.089), ('sonnenburg', 0.078), ('promoter', 0.078), ('zien', 0.078), ('complexity', 0.075), ('unif', 0.074), ('bn', 0.073), ('local', 0.073), ('mendelson', 0.072), ('tsch', 0.072), ('hypotheses', 0.069), ('auc', 0.069), ('mercer', 0.067), ('psortneg', 0.067), ('psortpos', 0.067), ('ufo', 0.067), ('notion', 0.065), ('uj', 0.064), ('hypothesis', 0.06), ('angle', 0.059), ('trace', 0.055), ('dca', 0.054), ('aucs', 0.054), ('erm', 0.054), ('koltchinskii', 0.054), ('svm', 0.054), ('risk', 0.052), ('convex', 0.051), ('hw', 0.051), ('cortes', 0.048), ('transcription', 0.048), ('plant', 0.048), ('street', 0.048), ('mohri', 0.046), ('complexities', 0.046), ('sharper', 0.045), ('nonpl', 0.044), ('proteinfold', 0.044), ('shogun', 0.044), ('subdifferential', 0.043), ('supplemental', 0.041), ('toolbox', 0.039), ('rangarajan', 0.039), ('corinna', 0.039), ('classes', 0.038), ('un', 0.037), ('courant', 0.036), ('hv', 0.036), ('concave', 0.036), ('bartlett', 0.035), ('class', 0.034), ('classi', 0.034), ('yuille', 0.032), ('vapnik', 0.032), ('bioinformatics', 0.032), ('eigenvalue', 0.032), ('design', 0.031), ('ny', 0.029), ('lanckriet', 0.029), ('material', 0.029), ('family', 0.028), ('ex', 0.028), ('bounds', 0.028), ('rn', 0.028), ('protein', 0.027), ('descending', 0.027), ('tr', 0.027), ('base', 0.026), ('report', 0.026), ('vm', 0.026), ('accessible', 0.026), ('mum', 0.026), ('sum', 0.026), ('rm', 0.026), ('generalization', 0.026), ('proposition', 0.025), ('sp', 0.025), ('bound', 0.025), ('algorithms', 0.024), ('highest', 0.024), ('sch', 0.024), ('excess', 0.024), ('constraint', 0.024), ('biologically', 0.023), ('familiar', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
Author: Corinna Cortes, Marius Kloft, Mehryar Mohri
Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classification tasks. 1
2 0.15817037 289 nips-2013-Scalable kernels for graphs with continuous attributes
Author: Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, Karsten Borgwardt
Abstract: While graphs with continuous node attributes arise in many applications, stateof-the-art graph kernels for comparing continuous-attributed graphs suffer from a high runtime complexity. For instance, the popular shortest path kernel scales as O(n4 ), where n is the number of nodes. In this paper, we present a class of graph kernels with computational complexity O(n2 (m + log n + δ 2 + d)), where δ is the graph diameter, m is the number of edges, and d is the dimension of the node attributes. Due to the sparsity and small diameter of real-world graphs, these kernels typically scale comfortably to large graphs. In our experiments, the presented kernels outperform state-of-the-art kernels in terms of speed and accuracy on classification benchmark datasets. 1
3 0.12935074 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
Author: Le Song, Bo Dai
Abstract: Kernel embedding of distributions has led to many recent advances in machine learning. However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspecified. In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspecification. We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation. 1
4 0.12119702 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
5 0.10778508 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
Author: Qichao Que, Mikhail Belkin
Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1
6 0.10739779 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
7 0.10440312 171 nips-2013-Learning with Noisy Labels
8 0.09349066 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
9 0.093119107 173 nips-2013-Least Informative Dimensions
10 0.090936139 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
11 0.088557638 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
12 0.083899789 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
13 0.081064373 9 nips-2013-A Kernel Test for Three-Variable Interactions
14 0.075090095 75 nips-2013-Convex Two-Layer Modeling
15 0.068249658 224 nips-2013-On the Sample Complexity of Subspace Learning
16 0.067979693 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
17 0.065952666 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
18 0.059117708 251 nips-2013-Predicting Parameters in Deep Learning
19 0.059106871 137 nips-2013-High-Dimensional Gaussian Process Bandits
20 0.058861457 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
topicId topicWeight
[(0, 0.163), (1, 0.063), (2, 0.047), (3, 0.017), (4, 0.045), (5, 0.022), (6, -0.02), (7, 0.004), (8, -0.081), (9, 0.049), (10, -0.041), (11, -0.054), (12, -0.065), (13, -0.112), (14, 0.223), (15, 0.071), (16, -0.006), (17, 0.098), (18, -0.022), (19, -0.1), (20, -0.002), (21, -0.015), (22, 0.08), (23, 0.042), (24, -0.087), (25, -0.003), (26, 0.028), (27, -0.042), (28, 0.086), (29, 0.001), (30, 0.012), (31, 0.029), (32, -0.018), (33, -0.022), (34, -0.006), (35, -0.062), (36, 0.009), (37, -0.016), (38, 0.029), (39, -0.047), (40, 0.02), (41, 0.001), (42, 0.07), (43, 0.02), (44, -0.029), (45, 0.067), (46, 0.03), (47, -0.005), (48, 0.066), (49, -0.117)]
simIndex simValue paperId paperTitle
same-paper 1 0.95497406 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
Author: Corinna Cortes, Marius Kloft, Mehryar Mohri
Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classification tasks. 1
2 0.79119116 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
Author: Qichao Que, Mikhail Belkin
Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1
3 0.74015564 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh
Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1
4 0.73832506 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf
Abstract: Many applications require the analysis of complex interactions between time series. These interactions can be non-linear and involve vector valued as well as complex data structures such as graphs or strings. Here we provide a general framework for the statistical analysis of these dependencies when random variables are sampled from stationary time-series of arbitrary objects. To achieve this goal, we study the properties of the Kernel Cross-Spectral Density (KCSD) operator induced by positive definite kernels on arbitrary input domains. This framework enables us to develop an independence test between time series, as well as a similarity measure to compare different types of coupling. The performance of our test is compared to the HSIC test using i.i.d. assumptions, showing improvements in terms of detection errors, as well as the suitability of this approach for testing dependency in complex dynamical systems. This similarity measure enables us to identify different types of interactions in electrophysiological neural time series. 1
5 0.73811859 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
6 0.7115773 76 nips-2013-Correlated random features for fast semi-supervised learning
7 0.68145841 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
8 0.65218043 289 nips-2013-Scalable kernels for graphs with continuous attributes
9 0.65122539 9 nips-2013-A Kernel Test for Three-Variable Interactions
10 0.61450148 173 nips-2013-Least Informative Dimensions
11 0.55424225 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel
12 0.54269022 327 nips-2013-The Randomized Dependence Coefficient
13 0.515939 171 nips-2013-Learning with Noisy Labels
14 0.51593411 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
15 0.51282555 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
16 0.50282609 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
17 0.4824141 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
18 0.4629561 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
19 0.45681676 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting
20 0.44761163 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
topicId topicWeight
[(2, 0.017), (6, 0.301), (16, 0.031), (33, 0.144), (34, 0.068), (41, 0.025), (49, 0.038), (56, 0.131), (70, 0.023), (85, 0.033), (89, 0.03), (93, 0.035), (95, 0.024), (99, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.73495054 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
Author: Corinna Cortes, Marius Kloft, Mehryar Mohri
Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classification tasks. 1
2 0.68565553 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements
Author: Divyanshu Vats, Richard Baraniuk
Abstract: We consider the problem of accurately estimating a high-dimensional sparse vector using a small number of linear measurements that are contaminated by noise. It is well known that standard computationally tractable sparse recovery algorithms, such as the Lasso, OMP, and their various extensions, perform poorly when the measurement matrix contains highly correlated columns. We develop a simple greedy algorithm, called SWAP, that iteratively swaps variables until a desired loss function cannot be decreased any further. SWAP is surprisingly effective in handling measurement matrices with high correlations. We prove that SWAP can easily be used as a wrapper around standard sparse recovery algorithms for improved performance. We theoretically quantify the statistical guarantees of SWAP and complement our analysis with numerical results on synthetic and real data.
3 0.68341571 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
Author: Yuchen Zhang, John Duchi, Michael Jordan, Martin J. Wainwright
Abstract: We establish lower bounds on minimax risks for distributed statistical estimation under a communication budget. Such lower bounds reveal the minimum amount of communication required by any procedure to achieve the centralized minimax-optimal rates for statistical estimation. We study two classes of protocols: one in which machines send messages independently, and a second allowing for interactive communication. We establish lower bounds for several problems, including various types of location models, as well as for parameter estimation in regression models. 1
5 0.58314472 355 nips-2013-Which Space Partitioning Tree to Use for Search?
Author: Parikshit Ram, Alexander Gray
Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by: γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9
6 0.58200657 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
7 0.58174443 344 nips-2013-Using multiple samples to learn mixture models
8 0.58102798 149 nips-2013-Latent Structured Active Learning
9 0.58075619 249 nips-2013-Polar Operators for Structured Sparse Estimation
10 0.57982385 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
11 0.57979727 233 nips-2013-Online Robust PCA via Stochastic Optimization
12 0.57940829 318 nips-2013-Structured Learning via Logistic Regression
13 0.57916713 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
14 0.57914865 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
15 0.57912618 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation
16 0.57856017 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
17 0.57852173 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
18 0.57732469 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
19 0.57711989 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
20 0.57708424 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching