nips nips2007 nips2007-118 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract This paper considers kernels invariant to translation, rotation and dilation. [sent-6, score-0.226]
2 ) kernels exist which are radial and dilation invariant, only conditionally positive definite (c. [sent-9, score-0.528]
3 case and provide some novel analysis, including an elementary derivation of a c. [sent-16, score-0.081]
4 For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s. [sent-27, score-0.228]
5 1 Introduction Recent years have seen widespread application of reproducing kernel Hilbert space (r. [sent-31, score-0.211]
6 ) based methods to machine learning problems (Sch¨ lkopf & Smola, 2002). [sent-35, score-0.064]
7 As a result, kernel methods o have been analysed to considerable depth. [sent-36, score-0.175]
8 In spite of this, the aspects which we presently investigate seem to have received insufficient attention, at least within the machine learning community. [sent-37, score-0.122]
9 The first is transformation invariance of the kernel, a topic touched on in (Fleuret & Sahbi, 2003). [sent-38, score-0.124]
10 Note we do not mean by this the local invariance (or insensitivity) of an algorithm to application specific transformations which should not affect the class label, such as one pixel image translations (see e. [sent-39, score-0.094]
11 Rather we are referring to global invariance to transformao tions, in the way that radial kernels (i. [sent-42, score-0.301]
12 those of the form k(x, y) = φ( x − y )) are invariant to translations. [sent-44, score-0.074]
13 In Sections 2 and 3 we introduce the more general concept of transformation scaledness, focusing on translation, dilation and orthonormal transformations. [sent-45, score-0.275]
14 case in Section 4, giving novel elementary derivations of some key results, most notably a c. [sent-56, score-0.081]
15 thin-plate kernel which we discuss in Section 5, is not only richly non-linear, but enjoys a duality between the length-scale parameter and the regularisation parameter of Tikhonov regularised solutions such as the s. [sent-72, score-0.448]
16 In Section 7 we compare the resulting classifier (which has only a regularisation parameter), to that of the s. [sent-75, score-0.231]
17 with Gaussian kernel (which has an additional length scale parameter). [sent-78, score-0.175]
18 Let T be a bijection on X and F a Hilbert space of functions on some non-empty set X such that f → f ◦ T is a bijection on F. [sent-83, score-0.184]
19 For any Θ : F − R and T such that f → f ◦ T is a bijection of F, if the left hand − → side is unique then arg min Θ(f ) = f ∈F arg min Θ(fT ◦ T ) ◦ T fT ∈F ∗ Proof. [sent-89, score-0.188]
20 Let f ∗ = arg minf ∈F Θ(f ) and fT = arg minfT ∈F Θ(fT ◦ T ). [sent-90, score-0.142]
21 But since f → f ◦ T is a bijection on F, we also have ∗ ∗ ∀g ∈ F, Θ(fT ◦ T ) ≤ Θ(g). [sent-92, score-0.092]
22 If F is T -scaled and the left hand side is unique then arg min f ∈F f 2 F + Li (f (xi )) i = arg min f ∈F f 2 F /gT (F) + Li (f (T xi )) ◦T. [sent-100, score-0.17]
23 with linear hinge loss for Li (t) = max (0, 1 − yi t), and kernel ridge regression for Li (t) = 2 (yi − t) . [sent-105, score-0.175]
24 Let Ws , Ta and OA be the dilation, translation and orthonormal transformations Rd → Rd defined for s ∈ R \ {0}, a ∈ Rd and orthonormal A : Rd → Rd by Ws x = sx, Ta x = x + a and OA x = Ax respectively. [sent-109, score-0.164]
25 on similarly dilated input patterns but with a regularisation parameter adjusted according to Corollary 2. [sent-120, score-0.231]
26 with a particular kernel, as we have just seen it is easy to demonstrate for the more general Tikhonov regularisation setting with any function norm satisfying our definition of transformation scaledness. [sent-125, score-0.39]
27 3 Transformation Scaled Reproducing Kernel Hilbert Spaces We now derive the necessary and sufficient conditions for a reproducing kernel (r. [sent-126, score-0.211]
28 (3) Which we prove in the accompanying technical report (Walder & Chapelle, 2007) . [sent-157, score-0.088]
29 It is now easy p to see that, for example, the homogeneous polynomial kernel k(x, y) = x, y corresponds to a p p Ws -scaled r. [sent-158, score-0.219]
30 Hence when the homogeneous polynomial kernel is used with the hard-margin s. [sent-163, score-0.219]
31 algorithm, the result is invariant to multiplicative scaling of the training and test data. [sent-166, score-0.074]
32 ’s with radial kernels that are also Ws -scaled for all s = 0. [sent-178, score-0.246]
33 Which we prove in the accompanying technical report (Walder & Chapelle, 2007). [sent-182, score-0.088]
34 Indeed the widely used Gaussian kernel satisfies both of the above invariances. [sent-195, score-0.175]
35 But if we now also assume that H is Ws -scaled for all s = 0 — this time with arbitrary gWs (H) — then k(x, y) = gWs (H)k(Ws x, Ws y) = gW|s| (H)φOT (|s| x − y ) so that letting r = x − y we have that φOT (r) = gW|s| (H)φOT (|s| r) and hence by Lemma 3. [sent-196, score-0.101]
36 For the corresponding Gram matrix a bdp K , bdp a to be positive semi definite we require 0 ≤ det(K) = a2 − b2 d2p , but for arbitrary d > 0 and a < ∞, this implies b = 0. [sent-200, score-0.165]
37 kernel functions with the stated properties, such as the thin-plate kernel. [sent-204, score-0.175]
38 We discuss this case in detail in Section 5, after the following particularly elementary and in part novel introduction to c. [sent-205, score-0.081]
39 kernel functions – these are given by the following Definition 4. [sent-212, score-0.175]
40 A continuous function φ : X × X → R is conditionally positive definite with respect to (w. [sent-214, score-0.074]
41 kernels which corresponds to our definition when P = {1} (e. [sent-239, score-0.152]
42 (Sch¨ lkopf o & Smola, 2002)) or when P is taken to be the space of polynomials of some fixed maximum degree (e. [sent-241, score-0.12]
43 , xm ) for the set m {α ∈ Rm : i=1 αi p(xi ) = 0 for all p ∈ P} . [sent-248, score-0.069]
44 We define Fφ (X ) to be the Hilbert space of functions which is the completion of the set m j=1 αj φ(·, xj ) : m ∈ N, x1 , . [sent-262, score-0.094]
45 , xm ) , which due to the definition of φ we may endow with the inner product D E Pm j=1 αj φ(·, xj ), Pn k=1 βk φ(·, yk ) = Fφ (X ) 3 Pm Pn j=1 k=1 αj βk φ(xj , yk ). [sent-266, score-0.163]
46 (xm , ym )} ⊂ X × R, there exists an s = sFφ (X ) + sP where sFφ (X ) = m r j=1 αj φ(·, xj ) ∈ Fφ (X ) and sP = k=1 βk pk ∈ P, such that s(xi ) = yi , i = 1 . [sent-289, score-0.159]
47 A simple and elementary proof (which shows (17) is solvable when λ = 0), is given in (Wendland, 2004) and reproduced in the accompanying technical report (Walder & Chapelle, 2007). [sent-293, score-0.169]
48 Hence, returning to the main thread, we have the following lemma — our proof of which seems to be novel and particularly elementary. [sent-307, score-0.08]
49 Consider an arbitrary function s = sFφ (X ) + sP with sFφ (X ) = j=1 αj φ(·, xj ) ∈ Fφ (X ) r and sP = k=1 βk pk ∈ P. [sent-320, score-0.218]
50 We can always write f as m n r (αi + αi ) φ(·, xj ) + f= j=1 ck pk . [sent-327, score-0.159]
51 bl φ(·, zl ) + k=1 l=1 If we define1 [Px ]i,j = pj (xi ), [Pz ]i,j = pj (zi ), [Φxx ]i,j = φ(xi , xj ), [Φxz ]i,j = φ(xi , zj ), and [Φzx ]i,j = φ(zi , xj ), then the condition (6) can hence be written Px β = Φxx α + Φxz b + Px c, (7) and the definition of Fφ (X ) requires that e. [sent-328, score-0.416]
52 , xm ), hence implying the constraints Px α = 0 and Px (α + α) + Pz b = 0. [sent-333, score-0.111]
53 4 Using these results it is now easy to prove an analog of the representer theorem for the p. [sent-344, score-0.148]
54 , f (xm )) + Ω Pφ (P)f 2 Fφ (X ) (10) αi φ(·, xi ) + p, where p ∈ P. [sent-362, score-0.074]
55 Let s = i=1 αi φ(·, xi ) + p satisfy s(xi ) = f (xi ), i = 2 1 . [sent-365, score-0.074]
56 The m-th order thin-plate kernel φm : Rd × Rd → R is given by (−1)m−(d−2)/2 x − y (−1)m−(d−1)/2 x − y φm (x, y) = 2m−d 2m−d log( x − y ) if d ∈ 2N, if d ∈ (2N − 1), (11) for x = y, and zero otherwise. [sent-376, score-0.175]
57 The kernel induces the following norm on the space Fφm Rd of Definition 4. [sent-381, score-0.211]
58 dxd , ∂xi1 ∂xim d → L2 (R ) is a regularisation operator, implicitly defined above. [sent-387, score-0.231]
59 1, the process is actually more involved due to the log factor in the first case of (11), and it is necessary to use the fact that the kernel is c. [sent-391, score-0.175]
60 In the Section 3 we showed that non-trivial kernels which are both radial and dilation scaled cannot be p. [sent-402, score-0.451]
61 — one of the most widely used kernel algorithms — has been applied only with p. [sent-410, score-0.175]
62 Hence we propose using the thin-plate kernel with the s. [sent-420, score-0.175]
63 484) dim/n 18/2086 20/3000∗ 60/2844 5/215 20/3000∗ 21/3000 Table 1: Comparison of Gaussian and thin-plate kernel with the s. [sent-483, score-0.175]
64 A star in the n column means that more examples were available but we kept only a maximum of 2000 per class in order to reduce the computational burden of the extensive number of cross validation and model selection training runs (see Section 7). [sent-489, score-0.112]
65 None of the data sets were linearly separable so we always used used the normal (β unconstrained) version of the optimisation described in Section 6. [sent-490, score-0.148]
66 an arbitrary finite dimensional space of functions P by extending the primal optimisation approach of (Chapelle, 2007) to the c. [sent-508, score-0.169]
67 solution can be formulated as arg minf ∈Fφ (X )⊕P of n λ Pφ (P)f 2 Fφ (X ) max(0, 1 − yi f (xi ))2 , + (14) i=1 Note that for the second order thin-plate case we have X = Rd and P = π1 (Rd ) (the space of constant and first order polynomials). [sent-515, score-0.094]
68 Hence dim (P) = d + 1 and we can take the basis to be pj (x) = [x]j for j = 1 . [sent-516, score-0.119]
69 pdim(P) span P, the solution to (14) n dim(P) is given by fsvm (x) = i=1 αi φ(xi , x) + j=1 βj pj (x). [sent-524, score-0.121]
70 Now, if we consider only the margin violators — those vectors which (at a given step of the optimisation process) satisfy yi f (xi ) < 1, we can replace the max (0, ·) in (14) with (·). [sent-525, score-0.269]
71 In particular we make the QR factorisation (Golub & Van Loan, 1996) P = QR, where Q Q = I and R is square. [sent-531, score-0.084]
72 As a final step at the end of the optimisation process, we take the minimum norm solution of the system β = Rβ, β = R# β where R# is the pseudo inverse of R. [sent-533, score-0.146]
73 The precise algorithm is given in (Walder & Chapelle, 2007), where we also detail two efficient factorisation techniques, specific to the new s. [sent-541, score-0.084]
74 This is correct in that, since P lies in the null 2 space of the regulariser Pφ (P)· Fφ (X ) , such solutions minimise (14), but may be undesirable for various reasons. [sent-551, score-0.132]
75 It is unclear how to deal with this problem — after all it implies that the regulariser is simply inappropriate for the problem at hand. [sent-555, score-0.069]
76 Nonetheless we still wish to apply a (non-linear) algorithm with the previously discussed invariances of the thin-plate. [sent-556, score-0.092]
77 The way we deal with this is simple — instead of minimising over Fφ (X ) we consider only the finite dimensional subspace given by n j=1 αj φ(·, xj ) : α ∈ P ⊥ (x1 , . [sent-566, score-0.131]
78 The closed form solution to the constrained quadratic programme is in this case given by (see (Walder & Chapelle, 2007)) α = −P⊥ P⊥ λΦ + Φsx Φsx P⊥ −1 P⊥ Φsx ys (18) where Φsx = [Φ]s,: , s is the current set of margin violators and P⊥ the null space of P satisfying P P⊥ = 0. [sent-574, score-0.213]
79 The precise algorithm we use to optimise in this manner is given in the accompanying technical report (Walder & Chapelle, 2007), where we also detail efficient factorisation techniques. [sent-575, score-0.172]
80 with 1) the optimisation over Fφ (X ) ⊕ P as per Section 6. [sent-579, score-0.11]
81 1, and 2) the optimisation over a subspace of Fφ (X ) as per Section 6. [sent-580, score-0.11]
82 For a baseline we take the Gaussian kernel 2 k(x, y) = exp − x − y /(2σ 2 ) , and compare on real world classification problems. [sent-583, score-0.175]
83 Table 1 provides numerical evidence supporting our claim that the thin-plate method is competitive with the Gaussian, in spite of it’s having one less hyper parameter. [sent-585, score-0.08]
84 For parameter selection, we performed five fold cross validation on the four-fifths of the data available for training each split, over an exhaustive search of the algorithm parameter(s) (σ and λ for the Gaussian and happily just λ for the thin-plate). [sent-590, score-0.165]
85 We ensured that the chosen parameters were well within the searched range by visually inspecting the cross validation error as a function of the parameters. [sent-592, score-0.112]
86 Happily, for the thin-plate we needed to cross validate to choose only the regularisation parameter λ, whereas for the Gaussian we had to choose both λ and the scale parameter σ. [sent-593, score-0.292]
87 The discovery of an equally effective algorithm which has only one parameter is important, since the Gaussian is probably the most popular and effective kernel used with the s. [sent-594, score-0.175]
88 the rest models we used five fold cross validation on the 7291 training examples to find the parameters, retrained on the full training set, and labeled the 2007 test examples according to the binary classifier with maximum output. [sent-602, score-0.112]
89 Hence the Gaussian did not perform significantly better, in spite of the extra parameter. [sent-606, score-0.08]
90 For the β = 0 variant (necessary only on linearly separable problems — presently only the USPS set) however, the cost is O(nb 2 nsv + nb 3 ), where nb is the number of basis functions in the expansion. [sent-615, score-0.304]
91 For our USPS experiments we expanded on all m training points, but if nsv m this is inefficient and probably unnecessary. [sent-616, score-0.106]
92 took only ∼ 40s in comparison to ∼ 17 minutes (with the use of various efficient factorisation techniques as detailed in the accompanying (Walder & Chapelle, 2007) ) for the thin-plate. [sent-620, score-0.172]
93 Given that for the thin-plate cross validation needs to be performed over one less parameter, even in this most unfavourable scenario of nsv m, the overall times of the algorithms are comparable. [sent-622, score-0.218]
94 Moreover, during cross validation one typically encounters larger numbers of violators for some suboptimal parameter configurations, in which cases the Gaussian and thin-plate training times are comparable. [sent-623, score-0.218]
95 8 Conclusion We have proven that there exist no non-trivial radial p. [sent-624, score-0.14]
96 kernels which are dilation invariant (or more accurately, dilation scaled), but rather only c. [sent-626, score-0.55]
97 Such kernels have the advantage that, to take the s. [sent-630, score-0.152]
98 as an example, varying the absolute multiplicative scale (or length scale) of the data has the same effect as changing the regularisation parameter — hence one needs model selection to chose only one of these, in contrast to the widely used Gaussian kernel for example. [sent-633, score-0.448]
99 Accordingly we provided a compact introduction to the topic, including some novel analysis which includes an new, elementary and self contained derivation of one particularly important result for the machine learning community, the representer theorem. [sent-647, score-0.229]
100 Conditionally positive definite kernels for svm based image recognition. [sent-653, score-0.152]
wordName wordTfidf (topN-words)
[('ws', 0.283), ('walder', 0.239), ('regularisation', 0.231), ('rd', 0.224), ('wendland', 0.186), ('chapelle', 0.175), ('kernel', 0.175), ('dilation', 0.162), ('oa', 0.159), ('xz', 0.159), ('kernels', 0.152), ('representer', 0.148), ('sx', 0.127), ('px', 0.118), ('optimisation', 0.11), ('gws', 0.106), ('nsv', 0.106), ('tikhonov', 0.106), ('wahba', 0.106), ('xim', 0.106), ('violators', 0.106), ('ft', 0.1), ('xj', 0.094), ('radial', 0.094), ('bijection', 0.092), ('invariances', 0.092), ('pz', 0.092), ('ot', 0.088), ('accompanying', 0.088), ('factorisation', 0.084), ('xx', 0.083), ('elementary', 0.081), ('usps', 0.081), ('lemma', 0.08), ('fleuret', 0.08), ('sahbi', 0.08), ('spite', 0.08), ('sf', 0.079), ('smola', 0.077), ('invariant', 0.074), ('xi', 0.074), ('conditionally', 0.074), ('sp', 0.071), ('regulariser', 0.069), ('xm', 0.069), ('nition', 0.069), ('transformation', 0.069), ('pj', 0.068), ('pk', 0.065), ('lkopf', 0.064), ('minimise', 0.063), ('sch', 0.061), ('cross', 0.061), ('arbitrary', 0.059), ('nb', 0.059), ('gt', 0.059), ('li', 0.058), ('hilbert', 0.056), ('ta', 0.056), ('polynomials', 0.056), ('invariance', 0.055), ('corollary', 0.054), ('satisfying', 0.054), ('bdp', 0.053), ('boughorbel', 0.053), ('fsvm', 0.053), ('gw', 0.053), ('happily', 0.053), ('minimiser', 0.053), ('thinplate', 0.053), ('margin', 0.053), ('dim', 0.051), ('validation', 0.051), ('zj', 0.05), ('gaussian', 0.049), ('arg', 0.048), ('brp', 0.046), ('zx', 0.046), ('minf', 0.046), ('exist', 0.046), ('orthonormal', 0.044), ('homogeneous', 0.044), ('scaled', 0.043), ('presently', 0.042), ('qr', 0.042), ('optimising', 0.042), ('regularised', 0.042), ('splice', 0.042), ('loan', 0.042), ('hence', 0.042), ('transformations', 0.039), ('uniqueness', 0.039), ('mika', 0.039), ('golub', 0.039), ('separable', 0.038), ('hsu', 0.037), ('minimising', 0.037), ('translation', 0.037), ('norm', 0.036), ('reproducing', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
2 0.17897195 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
3 0.13445736 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan
Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1
4 0.12719476 160 nips-2007-Random Features for Large-Scale Kernel Machines
Author: Ali Rahimi, Benjamin Recht
Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1
5 0.11828671 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
Author: Ronny Luss, Alexandre D'aspremont
Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.
6 0.1173546 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
7 0.10804859 109 nips-2007-Kernels on Attributed Pointsets with Applications
8 0.095503435 7 nips-2007-A Kernel Statistical Test of Independence
9 0.092785239 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
10 0.088784881 108 nips-2007-Kernel Measures of Conditional Dependence
11 0.075878844 112 nips-2007-Learning Monotonic Transformations for Classification
12 0.075218812 21 nips-2007-Adaptive Online Gradient Descent
13 0.07204251 24 nips-2007-An Analysis of Inference with the Universum
14 0.072034076 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
15 0.071555838 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
16 0.067909569 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
17 0.067532137 101 nips-2007-How SVMs can estimate quantiles and the median
18 0.066787831 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
19 0.063527919 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging
20 0.06202165 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
topicId topicWeight
[(0, -0.214), (1, 0.013), (2, -0.118), (3, 0.145), (4, -0.023), (5, -0.008), (6, 0.053), (7, 0.005), (8, -0.133), (9, 0.117), (10, 0.117), (11, -0.065), (12, 0.012), (13, 0.069), (14, -0.083), (15, -0.139), (16, 0.081), (17, -0.06), (18, -0.041), (19, 0.108), (20, 0.151), (21, 0.004), (22, 0.066), (23, 0.075), (24, -0.038), (25, -0.096), (26, -0.151), (27, 0.029), (28, -0.088), (29, 0.061), (30, 0.115), (31, -0.02), (32, 0.048), (33, -0.081), (34, -0.033), (35, 0.006), (36, -0.099), (37, -0.012), (38, -0.066), (39, -0.036), (40, -0.082), (41, -0.048), (42, -0.117), (43, -0.009), (44, 0.019), (45, 0.063), (46, 0.009), (47, -0.034), (48, -0.053), (49, -0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.94675112 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
2 0.73062736 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
3 0.71406394 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan
Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1
4 0.65397429 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
Author: Ronny Luss, Alexandre D'aspremont
Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.
5 0.65350312 24 nips-2007-An Analysis of Inference with the Universum
Author: Olivier Chapelle, Alekh Agarwal, Fabian H. Sinz, Bernhard Schölkopf
Abstract: We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results. 1
6 0.64354211 160 nips-2007-Random Features for Large-Scale Kernel Machines
7 0.61555213 109 nips-2007-Kernels on Attributed Pointsets with Applications
8 0.58018941 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
9 0.57259798 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
10 0.56311774 7 nips-2007-A Kernel Statistical Test of Independence
11 0.54955822 108 nips-2007-Kernel Measures of Conditional Dependence
12 0.50321823 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
13 0.4980593 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
14 0.47420204 101 nips-2007-How SVMs can estimate quantiles and the median
15 0.47029936 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
16 0.44678891 112 nips-2007-Learning Monotonic Transformations for Classification
17 0.40908039 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
18 0.40171418 49 nips-2007-Colored Maximum Variance Unfolding
19 0.39206535 70 nips-2007-Discriminative K-means for Clustering
20 0.37822062 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
topicId topicWeight
[(0, 0.349), (5, 0.034), (13, 0.073), (16, 0.03), (21, 0.06), (34, 0.026), (35, 0.041), (47, 0.057), (49, 0.032), (83, 0.121), (85, 0.028), (87, 0.013), (90, 0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.6955058 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
2 0.45262355 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
3 0.44921285 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
4 0.44662923 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng
Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. 1
5 0.44566035 134 nips-2007-Multi-Task Learning via Conic Programming
Author: Tsuyoshi Kato, Hisashi Kashima, Masashi Sugiyama, Kiyoshi Asai
Abstract: When we have several related tasks, solving them simultaneously is shown to be more effective than solving them individually. This approach is called multi-task learning (MTL) and has been studied extensively. Existing approaches to MTL often treat all the tasks as uniformly related to each other and the relatedness of the tasks is controlled globally. For this reason, the existing methods can lead to undesired solutions when some tasks are not highly related to each other, and some pairs of related tasks can have significantly different solutions. In this paper, we propose a novel MTL algorithm that can overcome these problems. Our method makes use of a task network, which describes the relation structure among tasks. This allows us to deal with intricate relation structures in a systematic way. Furthermore, we control the relatedness of the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines (SVMs) and show that the optimization problem can be cast as a second order cone program, which is convex and can be solved efficiently. The usefulness of our approach is demonstrated through simulations with protein super-family classification and ordinal regression problems.
6 0.4455716 156 nips-2007-Predictive Matrix-Variate t Models
7 0.44263268 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
8 0.44244453 200 nips-2007-The Tradeoffs of Large Scale Learning
10 0.44182026 7 nips-2007-A Kernel Statistical Test of Independence
11 0.44180071 102 nips-2007-Incremental Natural Actor-Critic Algorithms
12 0.4414022 49 nips-2007-Colored Maximum Variance Unfolding
13 0.44139546 185 nips-2007-Stable Dual Dynamic Programming
14 0.44115841 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
15 0.44104993 86 nips-2007-Exponential Family Predictive Representations of State
16 0.43962994 24 nips-2007-An Analysis of Inference with the Universum
17 0.43881407 40 nips-2007-Bundle Methods for Machine Learning
18 0.43878236 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
19 0.43854532 187 nips-2007-Structured Learning with Approximate Inference
20 0.43853441 186 nips-2007-Statistical Analysis of Semi-Supervised Regression