nips nips2007 nips2007-118 knowledge-graph by maker-knowledge-mining

118 nips-2007-Learning with Transformation Invariant Kernels

Source: pdf

Author: Christian Walder, Olivier Chapelle

Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive deﬁnite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive deﬁnite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classiﬁer with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract This paper considers kernels invariant to translation, rotation and dilation. [sent-6, score-0.226]

2 ) kernels exist which are radial and dilation invariant, only conditionally positive deﬁnite (c. [sent-9, score-0.528]

3 case and provide some novel analysis, including an elementary derivation of a c. [sent-16, score-0.081]

4 For the thinplate kernel this leads to a classiﬁer with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s. [sent-27, score-0.228]

5 1 Introduction Recent years have seen widespread application of reproducing kernel Hilbert space (r. [sent-31, score-0.211]

6 ) based methods to machine learning problems (Sch¨ lkopf & Smola, 2002). [sent-35, score-0.064]

7 As a result, kernel methods o have been analysed to considerable depth. [sent-36, score-0.175]

8 In spite of this, the aspects which we presently investigate seem to have received insufﬁcient attention, at least within the machine learning community. [sent-37, score-0.122]

9 The ﬁrst is transformation invariance of the kernel, a topic touched on in (Fleuret & Sahbi, 2003). [sent-38, score-0.124]

10 Note we do not mean by this the local invariance (or insensitivity) of an algorithm to application speciﬁc transformations which should not affect the class label, such as one pixel image translations (see e. [sent-39, score-0.094]

11 Rather we are referring to global invariance to transformao tions, in the way that radial kernels (i. [sent-42, score-0.301]

12 those of the form k(x, y) = φ( x − y )) are invariant to translations. [sent-44, score-0.074]

13 In Sections 2 and 3 we introduce the more general concept of transformation scaledness, focusing on translation, dilation and orthonormal transformations. [sent-45, score-0.275]

14 case in Section 4, giving novel elementary derivations of some key results, most notably a c. [sent-56, score-0.081]

15 thin-plate kernel which we discuss in Section 5, is not only richly non-linear, but enjoys a duality between the length-scale parameter and the regularisation parameter of Tikhonov regularised solutions such as the s. [sent-72, score-0.448]

16 In Section 7 we compare the resulting classiﬁer (which has only a regularisation parameter), to that of the s. [sent-75, score-0.231]

17 with Gaussian kernel (which has an additional length scale parameter). [sent-78, score-0.175]

18 Let T be a bijection on X and F a Hilbert space of functions on some non-empty set X such that f → f ◦ T is a bijection on F. [sent-83, score-0.184]

19 For any Θ : F − R and T such that f → f ◦ T is a bijection of F, if the left hand − → side is unique then arg min Θ(f ) = f ∈F arg min Θ(fT ◦ T ) ◦ T fT ∈F ∗ Proof. [sent-89, score-0.188]

20 Let f ∗ = arg minf ∈F Θ(f ) and fT = arg minfT ∈F Θ(fT ◦ T ). [sent-90, score-0.142]

21 But since f → f ◦ T is a bijection on F, we also have ∗ ∗ ∀g ∈ F, Θ(fT ◦ T ) ≤ Θ(g). [sent-92, score-0.092]

22 If F is T -scaled and the left hand side is unique then arg min f ∈F f 2 F + Li (f (xi )) i = arg min f ∈F f 2 F /gT (F) + Li (f (T xi )) ◦T. [sent-100, score-0.17]

23 with linear hinge loss for Li (t) = max (0, 1 − yi t), and kernel ridge regression for Li (t) = 2 (yi − t) . [sent-105, score-0.175]

24 Let Ws , Ta and OA be the dilation, translation and orthonormal transformations Rd → Rd deﬁned for s ∈ R \ {0}, a ∈ Rd and orthonormal A : Rd → Rd by Ws x = sx, Ta x = x + a and OA x = Ax respectively. [sent-109, score-0.164]

25 on similarly dilated input patterns but with a regularisation parameter adjusted according to Corollary 2. [sent-120, score-0.231]

26 with a particular kernel, as we have just seen it is easy to demonstrate for the more general Tikhonov regularisation setting with any function norm satisfying our deﬁnition of transformation scaledness. [sent-125, score-0.39]

27 3 Transformation Scaled Reproducing Kernel Hilbert Spaces We now derive the necessary and sufﬁcient conditions for a reproducing kernel (r. [sent-126, score-0.211]

28 (3) Which we prove in the accompanying technical report (Walder & Chapelle, 2007) . [sent-157, score-0.088]

29 It is now easy p to see that, for example, the homogeneous polynomial kernel k(x, y) = x, y corresponds to a p p Ws -scaled r. [sent-158, score-0.219]

30 Hence when the homogeneous polynomial kernel is used with the hard-margin s. [sent-163, score-0.219]

31 algorithm, the result is invariant to multiplicative scaling of the training and test data. [sent-166, score-0.074]

32 ’s with radial kernels that are also Ws -scaled for all s = 0. [sent-178, score-0.246]

33 Which we prove in the accompanying technical report (Walder & Chapelle, 2007). [sent-182, score-0.088]

34 Indeed the widely used Gaussian kernel satisﬁes both of the above invariances. [sent-195, score-0.175]

35 But if we now also assume that H is Ws -scaled for all s = 0 — this time with arbitrary gWs (H) — then k(x, y) = gWs (H)k(Ws x, Ws y) = gW|s| (H)φOT (|s| x − y ) so that letting r = x − y we have that φOT (r) = gW|s| (H)φOT (|s| r) and hence by Lemma 3. [sent-196, score-0.101]

36 For the corresponding Gram matrix a bdp K , bdp a to be positive semi deﬁnite we require 0 ≤ det(K) = a2 − b2 d2p , but for arbitrary d > 0 and a < ∞, this implies b = 0. [sent-200, score-0.165]

37 kernel functions with the stated properties, such as the thin-plate kernel. [sent-204, score-0.175]

38 We discuss this case in detail in Section 5, after the following particularly elementary and in part novel introduction to c. [sent-205, score-0.081]

39 kernel functions – these are given by the following Deﬁnition 4. [sent-212, score-0.175]

40 A continuous function φ : X × X → R is conditionally positive deﬁnite with respect to (w. [sent-214, score-0.074]

41 kernels which corresponds to our deﬁnition when P = {1} (e. [sent-239, score-0.152]

42 (Sch¨ lkopf o & Smola, 2002)) or when P is taken to be the space of polynomials of some ﬁxed maximum degree (e. [sent-241, score-0.12]

43 , xm ) for the set m {α ∈ Rm : i=1 αi p(xi ) = 0 for all p ∈ P} . [sent-248, score-0.069]

44 We deﬁne Fφ (X ) to be the Hilbert space of functions which is the completion of the set m j=1 αj φ(·, xj ) : m ∈ N, x1 , . [sent-262, score-0.094]

45 , xm ) , which due to the deﬁnition of φ we may endow with the inner product D E Pm j=1 αj φ(·, xj ), Pn k=1 βk φ(·, yk ) = Fφ (X ) 3 Pm Pn j=1 k=1 αj βk φ(xj , yk ). [sent-266, score-0.163]

46 (xm , ym )} ⊂ X × R, there exists an s = sFφ (X ) + sP where sFφ (X ) = m r j=1 αj φ(·, xj ) ∈ Fφ (X ) and sP = k=1 βk pk ∈ P, such that s(xi ) = yi , i = 1 . [sent-289, score-0.159]

47 A simple and elementary proof (which shows (17) is solvable when λ = 0), is given in (Wendland, 2004) and reproduced in the accompanying technical report (Walder & Chapelle, 2007). [sent-293, score-0.169]

48 Hence, returning to the main thread, we have the following lemma — our proof of which seems to be novel and particularly elementary. [sent-307, score-0.08]

49 Consider an arbitrary function s = sFφ (X ) + sP with sFφ (X ) = j=1 αj φ(·, xj ) ∈ Fφ (X ) r and sP = k=1 βk pk ∈ P. [sent-320, score-0.218]

50 We can always write f as m n r (αi + αi ) φ(·, xj ) + f= j=1 ck pk . [sent-327, score-0.159]

51 bl φ(·, zl ) + k=1 l=1 If we deﬁne1 [Px ]i,j = pj (xi ), [Pz ]i,j = pj (zi ), [Φxx ]i,j = φ(xi , xj ), [Φxz ]i,j = φ(xi , zj ), and [Φzx ]i,j = φ(zi , xj ), then the condition (6) can hence be written Px β = Φxx α + Φxz b + Px c, (7) and the deﬁnition of Fφ (X ) requires that e. [sent-328, score-0.416]

52 , xm ), hence implying the constraints Px α = 0 and Px (α + α) + Pz b = 0. [sent-333, score-0.111]

53 4 Using these results it is now easy to prove an analog of the representer theorem for the p. [sent-344, score-0.148]

54 , f (xm )) + Ω Pφ (P)f 2 Fφ (X ) (10) αi φ(·, xi ) + p, where p ∈ P. [sent-362, score-0.074]

55 Let s = i=1 αi φ(·, xi ) + p satisfy s(xi ) = f (xi ), i = 2 1 . [sent-365, score-0.074]

56 The m-th order thin-plate kernel φm : Rd × Rd → R is given by (−1)m−(d−2)/2 x − y (−1)m−(d−1)/2 x − y φm (x, y) = 2m−d 2m−d log( x − y ) if d ∈ 2N, if d ∈ (2N − 1), (11) for x = y, and zero otherwise. [sent-376, score-0.175]

57 The kernel induces the following norm on the space Fφm Rd of Deﬁnition 4. [sent-381, score-0.211]

58 dxd , ∂xi1 ∂xim d → L2 (R ) is a regularisation operator, implicitly deﬁned above. [sent-387, score-0.231]

59 1, the process is actually more involved due to the log factor in the ﬁrst case of (11), and it is necessary to use the fact that the kernel is c. [sent-391, score-0.175]

60 In the Section 3 we showed that non-trivial kernels which are both radial and dilation scaled cannot be p. [sent-402, score-0.451]

61 — one of the most widely used kernel algorithms — has been applied only with p. [sent-410, score-0.175]

62 Hence we propose using the thin-plate kernel with the s. [sent-420, score-0.175]

63 484) dim/n 18/2086 20/3000∗ 60/2844 5/215 20/3000∗ 21/3000 Table 1: Comparison of Gaussian and thin-plate kernel with the s. [sent-483, score-0.175]

64 A star in the n column means that more examples were available but we kept only a maximum of 2000 per class in order to reduce the computational burden of the extensive number of cross validation and model selection training runs (see Section 7). [sent-489, score-0.112]

65 None of the data sets were linearly separable so we always used used the normal (β unconstrained) version of the optimisation described in Section 6. [sent-490, score-0.148]

66 an arbitrary ﬁnite dimensional space of functions P by extending the primal optimisation approach of (Chapelle, 2007) to the c. [sent-508, score-0.169]

67 solution can be formulated as arg minf ∈Fφ (X )⊕P of n λ Pφ (P)f 2 Fφ (X ) max(0, 1 − yi f (xi ))2 , + (14) i=1 Note that for the second order thin-plate case we have X = Rd and P = π1 (Rd ) (the space of constant and ﬁrst order polynomials). [sent-515, score-0.094]

68 Hence dim (P) = d + 1 and we can take the basis to be pj (x) = [x]j for j = 1 . [sent-516, score-0.119]

69 pdim(P) span P, the solution to (14) n dim(P) is given by fsvm (x) = i=1 αi φ(xi , x) + j=1 βj pj (x). [sent-524, score-0.121]

70 Now, if we consider only the margin violators — those vectors which (at a given step of the optimisation process) satisfy yi f (xi ) < 1, we can replace the max (0, ·) in (14) with (·). [sent-525, score-0.269]

71 In particular we make the QR factorisation (Golub & Van Loan, 1996) P = QR, where Q Q = I and R is square. [sent-531, score-0.084]

72 As a ﬁnal step at the end of the optimisation process, we take the minimum norm solution of the system β = Rβ, β = R# β where R# is the pseudo inverse of R. [sent-533, score-0.146]

73 The precise algorithm is given in (Walder & Chapelle, 2007), where we also detail two efﬁcient factorisation techniques, speciﬁc to the new s. [sent-541, score-0.084]

74 This is correct in that, since P lies in the null 2 space of the regulariser Pφ (P)· Fφ (X ) , such solutions minimise (14), but may be undesirable for various reasons. [sent-551, score-0.132]

75 It is unclear how to deal with this problem — after all it implies that the regulariser is simply inappropriate for the problem at hand. [sent-555, score-0.069]

76 Nonetheless we still wish to apply a (non-linear) algorithm with the previously discussed invariances of the thin-plate. [sent-556, score-0.092]

77 The way we deal with this is simple — instead of minimising over Fφ (X ) we consider only the ﬁnite dimensional subspace given by n j=1 αj φ(·, xj ) : α ∈ P ⊥ (x1 , . [sent-566, score-0.131]

78 The closed form solution to the constrained quadratic programme is in this case given by (see (Walder & Chapelle, 2007)) α = −P⊥ P⊥ λΦ + Φsx Φsx P⊥ −1 P⊥ Φsx ys (18) where Φsx = [Φ]s,: , s is the current set of margin violators and P⊥ the null space of P satisfying P P⊥ = 0. [sent-574, score-0.213]

79 The precise algorithm we use to optimise in this manner is given in the accompanying technical report (Walder & Chapelle, 2007), where we also detail efﬁcient factorisation techniques. [sent-575, score-0.172]

80 with 1) the optimisation over Fφ (X ) ⊕ P as per Section 6. [sent-579, score-0.11]

81 1, and 2) the optimisation over a subspace of Fφ (X ) as per Section 6. [sent-580, score-0.11]

82 For a baseline we take the Gaussian kernel 2 k(x, y) = exp − x − y /(2σ 2 ) , and compare on real world classiﬁcation problems. [sent-583, score-0.175]

83 Table 1 provides numerical evidence supporting our claim that the thin-plate method is competitive with the Gaussian, in spite of it’s having one less hyper parameter. [sent-585, score-0.08]

84 For parameter selection, we performed ﬁve fold cross validation on the four-ﬁfths of the data available for training each split, over an exhaustive search of the algorithm parameter(s) (σ and λ for the Gaussian and happily just λ for the thin-plate). [sent-590, score-0.165]

85 We ensured that the chosen parameters were well within the searched range by visually inspecting the cross validation error as a function of the parameters. [sent-592, score-0.112]

86 Happily, for the thin-plate we needed to cross validate to choose only the regularisation parameter λ, whereas for the Gaussian we had to choose both λ and the scale parameter σ. [sent-593, score-0.292]

87 The discovery of an equally effective algorithm which has only one parameter is important, since the Gaussian is probably the most popular and effective kernel used with the s. [sent-594, score-0.175]

88 the rest models we used ﬁve fold cross validation on the 7291 training examples to ﬁnd the parameters, retrained on the full training set, and labeled the 2007 test examples according to the binary classiﬁer with maximum output. [sent-602, score-0.112]

89 Hence the Gaussian did not perform signiﬁcantly better, in spite of the extra parameter. [sent-606, score-0.08]

90 For the β = 0 variant (necessary only on linearly separable problems — presently only the USPS set) however, the cost is O(nb 2 nsv + nb 3 ), where nb is the number of basis functions in the expansion. [sent-615, score-0.304]

91 For our USPS experiments we expanded on all m training points, but if nsv m this is inefﬁcient and probably unnecessary. [sent-616, score-0.106]

92 took only ∼ 40s in comparison to ∼ 17 minutes (with the use of various efﬁcient factorisation techniques as detailed in the accompanying (Walder & Chapelle, 2007) ) for the thin-plate. [sent-620, score-0.172]

93 Given that for the thin-plate cross validation needs to be performed over one less parameter, even in this most unfavourable scenario of nsv m, the overall times of the algorithms are comparable. [sent-622, score-0.218]

94 Moreover, during cross validation one typically encounters larger numbers of violators for some suboptimal parameter conﬁgurations, in which cases the Gaussian and thin-plate training times are comparable. [sent-623, score-0.218]

95 8 Conclusion We have proven that there exist no non-trivial radial p. [sent-624, score-0.14]

96 kernels which are dilation invariant (or more accurately, dilation scaled), but rather only c. [sent-626, score-0.55]

97 Such kernels have the advantage that, to take the s. [sent-630, score-0.152]

98 as an example, varying the absolute multiplicative scale (or length scale) of the data has the same effect as changing the regularisation parameter — hence one needs model selection to chose only one of these, in contrast to the widely used Gaussian kernel for example. [sent-633, score-0.448]

99 Accordingly we provided a compact introduction to the topic, including some novel analysis which includes an new, elementary and self contained derivation of one particularly important result for the machine learning community, the representer theorem. [sent-647, score-0.229]

100 Conditionally positive deﬁnite kernels for svm based image recognition. [sent-653, score-0.152]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ws', 0.283), ('walder', 0.239), ('regularisation', 0.231), ('rd', 0.224), ('wendland', 0.186), ('chapelle', 0.175), ('kernel', 0.175), ('dilation', 0.162), ('oa', 0.159), ('xz', 0.159), ('kernels', 0.152), ('representer', 0.148), ('sx', 0.127), ('px', 0.118), ('optimisation', 0.11), ('gws', 0.106), ('nsv', 0.106), ('tikhonov', 0.106), ('wahba', 0.106), ('xim', 0.106), ('violators', 0.106), ('ft', 0.1), ('xj', 0.094), ('radial', 0.094), ('bijection', 0.092), ('invariances', 0.092), ('pz', 0.092), ('ot', 0.088), ('accompanying', 0.088), ('factorisation', 0.084), ('xx', 0.083), ('elementary', 0.081), ('usps', 0.081), ('lemma', 0.08), ('fleuret', 0.08), ('sahbi', 0.08), ('spite', 0.08), ('sf', 0.079), ('smola', 0.077), ('invariant', 0.074), ('xi', 0.074), ('conditionally', 0.074), ('sp', 0.071), ('regulariser', 0.069), ('xm', 0.069), ('nition', 0.069), ('transformation', 0.069), ('pj', 0.068), ('pk', 0.065), ('lkopf', 0.064), ('minimise', 0.063), ('sch', 0.061), ('cross', 0.061), ('arbitrary', 0.059), ('nb', 0.059), ('gt', 0.059), ('li', 0.058), ('hilbert', 0.056), ('ta', 0.056), ('polynomials', 0.056), ('invariance', 0.055), ('corollary', 0.054), ('satisfying', 0.054), ('bdp', 0.053), ('boughorbel', 0.053), ('fsvm', 0.053), ('gw', 0.053), ('happily', 0.053), ('minimiser', 0.053), ('thinplate', 0.053), ('margin', 0.053), ('dim', 0.051), ('validation', 0.051), ('zj', 0.05), ('gaussian', 0.049), ('arg', 0.048), ('brp', 0.046), ('zx', 0.046), ('minf', 0.046), ('exist', 0.046), ('orthonormal', 0.044), ('homogeneous', 0.044), ('scaled', 0.043), ('presently', 0.042), ('qr', 0.042), ('optimising', 0.042), ('regularised', 0.042), ('splice', 0.042), ('loan', 0.042), ('hence', 0.042), ('transformations', 0.039), ('uniqueness', 0.039), ('mika', 0.039), ('golub', 0.039), ('separable', 0.038), ('hsu', 0.037), ('minimising', 0.037), ('translation', 0.037), ('norm', 0.036), ('reproducing', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

2 0.17897195 62 nips-2007-Convex Learning with Invariances

Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola

Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1

3 0.13445736 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classiﬁcation problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classiﬁcation problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

4 0.12719476 160 nips-2007-Random Features for Large-Scale Kernel Machines

Author: Ali Rahimi, Benjamin Recht

Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user speciﬁed shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classiﬁcation and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1

5 0.11828671 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classiﬁcation using indeﬁnite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously ﬁnds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classiﬁcation problem where the indeﬁnite kernel matrix is treated as a noisy observation of the true positive semideﬁnite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efﬁciently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

6 0.1173546 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

7 0.10804859 109 nips-2007-Kernels on Attributed Pointsets with Applications

8 0.095503435 7 nips-2007-A Kernel Statistical Test of Independence

9 0.092785239 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

10 0.088784881 108 nips-2007-Kernel Measures of Conditional Dependence

11 0.075878844 112 nips-2007-Learning Monotonic Transformations for Classification

12 0.075218812 21 nips-2007-Adaptive Online Gradient Descent

13 0.07204251 24 nips-2007-An Analysis of Inference with the Universum

14 0.072034076 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

15 0.071555838 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

16 0.067909569 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

17 0.067532137 101 nips-2007-How SVMs can estimate quantiles and the median

18 0.066787831 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

19 0.063527919 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

20 0.06202165 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.214), (1, 0.013), (2, -0.118), (3, 0.145), (4, -0.023), (5, -0.008), (6, 0.053), (7, 0.005), (8, -0.133), (9, 0.117), (10, 0.117), (11, -0.065), (12, 0.012), (13, 0.069), (14, -0.083), (15, -0.139), (16, 0.081), (17, -0.06), (18, -0.041), (19, 0.108), (20, 0.151), (21, 0.004), (22, 0.066), (23, 0.075), (24, -0.038), (25, -0.096), (26, -0.151), (27, 0.029), (28, -0.088), (29, 0.061), (30, 0.115), (31, -0.02), (32, 0.048), (33, -0.081), (34, -0.033), (35, 0.006), (36, -0.099), (37, -0.012), (38, -0.066), (39, -0.036), (40, -0.082), (41, -0.048), (42, -0.117), (43, -0.009), (44, 0.019), (45, 0.063), (46, 0.009), (47, -0.034), (48, -0.053), (49, -0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94675112 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

2 0.73062736 62 nips-2007-Convex Learning with Invariances

Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola

3 0.71406394 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

4 0.65397429 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

5 0.65350312 24 nips-2007-An Analysis of Inference with the Universum

Author: Olivier Chapelle, Alekh Agarwal, Fabian H. Sinz, Bernhard Schölkopf

Abstract: We study a pattern classiﬁcation algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results. 1

6 0.64354211 160 nips-2007-Random Features for Large-Scale Kernel Machines

7 0.61555213 109 nips-2007-Kernels on Attributed Pointsets with Applications

8 0.58018941 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

9 0.57259798 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

10 0.56311774 7 nips-2007-A Kernel Statistical Test of Independence

11 0.54955822 108 nips-2007-Kernel Measures of Conditional Dependence

12 0.50321823 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

13 0.4980593 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

14 0.47420204 101 nips-2007-How SVMs can estimate quantiles and the median

15 0.47029936 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

16 0.44678891 112 nips-2007-Learning Monotonic Transformations for Classification

17 0.40908039 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

18 0.40171418 49 nips-2007-Colored Maximum Variance Unfolding

19 0.39206535 70 nips-2007-Discriminative K-means for Clustering

20 0.37822062 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.349), (5, 0.034), (13, 0.073), (16, 0.03), (21, 0.06), (34, 0.026), (35, 0.041), (47, 0.057), (49, 0.032), (83, 0.121), (85, 0.028), (87, 0.013), (90, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.6955058 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

2 0.45262355 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semideﬁnite relaxation that yields global training by eliminating local minima. 1

3 0.44921285 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to ﬁnd a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difﬁcult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efﬁcient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1

4 0.44662923 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random ﬁelds (CRFs). Using an implicit differentiation trick, we derive an efﬁcient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we ﬁnd that multiple hyperparameter learning can provide a signiﬁcant boost in accuracy compared to using only a single regularization hyperparameter. 1

5 0.44566035 134 nips-2007-Multi-Task Learning via Conic Programming

Author: Tsuyoshi Kato, Hisashi Kashima, Masashi Sugiyama, Kiyoshi Asai

Abstract: When we have several related tasks, solving them simultaneously is shown to be more effective than solving them individually. This approach is called multi-task learning (MTL) and has been studied extensively. Existing approaches to MTL often treat all the tasks as uniformly related to each other and the relatedness of the tasks is controlled globally. For this reason, the existing methods can lead to undesired solutions when some tasks are not highly related to each other, and some pairs of related tasks can have signiﬁcantly different solutions. In this paper, we propose a novel MTL algorithm that can overcome these problems. Our method makes use of a task network, which describes the relation structure among tasks. This allows us to deal with intricate relation structures in a systematic way. Furthermore, we control the relatedness of the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines (SVMs) and show that the optimization problem can be cast as a second order cone program, which is convex and can be solved efﬁciently. The usefulness of our approach is demonstrated through simulations with protein super-family classiﬁcation and ordinal regression problems.

6 0.4455716 156 nips-2007-Predictive Matrix-Variate t Models

7 0.44263268 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

8 0.44244453 200 nips-2007-The Tradeoffs of Large Scale Learning

9 0.44225493 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

10 0.44182026 7 nips-2007-A Kernel Statistical Test of Independence

11 0.44180071 102 nips-2007-Incremental Natural Actor-Critic Algorithms

12 0.4414022 49 nips-2007-Colored Maximum Variance Unfolding

13 0.44139546 185 nips-2007-Stable Dual Dynamic Programming

14 0.44115841 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

15 0.44104993 86 nips-2007-Exponential Family Predictive Representations of State

16 0.43962994 24 nips-2007-An Analysis of Inference with the Universum

17 0.43881407 40 nips-2007-Bundle Methods for Machine Learning

18 0.43878236 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

19 0.43854532 187 nips-2007-Structured Learning with Approximate Inference

20 0.43853441 186 nips-2007-Statistical Analysis of Semi-Supervised Regression