nips nips2013 nips2013-170 knowledge-graph by maker-knowledge-mining

170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space


Source: pdf

Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh

Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract Incorporating invariance information is important for many learning problems. [sent-12, score-0.236]

2 In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. [sent-15, score-0.58]

3 These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. [sent-16, score-0.805]

4 Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. [sent-17, score-0.279]

5 For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. [sent-18, score-0.654]

6 Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. [sent-19, score-0.461]

7 One way to utilize this invariance is by assuming that the gradient of the function is small along the directions of transformation at each data instance. [sent-22, score-0.362]

8 However, in practice it usually leads to a large or infinite number of constraints, and hence tractable formulations inevitably rely on approximating or restricting the invariance under consideration. [sent-32, score-0.236]

9 the reproducing kernel Hilbert space (RKHS) induced by universal kernels. [sent-35, score-0.175]

10 However, tractable oracles are often unavailable, and so a simpler approximation can be performed by merely enforcing the invariance of f along some given directions, e. [sent-44, score-0.265]

11 The goal of our paper, therefore, is to develop a new framework that: (1) allows a variety of invariances to be compactly encoded over a rich family of functions like RKHS, and (2) allows the search of the optimal function to be formulated as a convex program that is efficiently solvable. [sent-49, score-0.503]

12 The key requirement to our approach is that the invariances can be characterized by linear functionals that are bounded (§ 2). [sent-50, score-0.768]

13 We give a representer theorem that guarantees that the cost can be minimized by linearly combining a finite number of basis functions1 . [sent-52, score-0.255]

14 Note [23] also proposed an operator based model for invariance, but did not derive a representer theorem and did not study the empirical performance. [sent-54, score-0.304]

15 We also show that a wide range of commonly used invariances can be encoded as bounded linear functionals. [sent-55, score-0.56]

16 This include derivatives, transformation invariances, and local averages in commonly used RKHSs such as those defined by Gaussian and polynomial kernels (§ 4). [sent-56, score-0.227]

17 Experiment show that the use of some of these invariances within our framework yield better or similar results compared to the state of the art. [sent-57, score-0.394]

18 , xl ∈ X, the l × l Gram matrix K := (k(xi , xj ))ij is symmetric positive semi-definite. [sent-66, score-0.147]

19 Example kernels on Rn ×Rn include polynomial kernel of degree r, which are defined as k(x1 , x2 ) = (x1 · x2 + 1)r ,2 as well as Gaus2 sian kernels, defined as k(x1 , x2 ) = κσ (x1 , x2 ) where κσ (x1 , x2 ) := exp(− x1 − x2 /(2σ 2 )). [sent-67, score-0.197]

20 Using the positive semi-definite properties of k, it is easy to show that H0 is an inner product space, and we call its completion under ·, · as a reproducing kernel Hilbert space (RKHS) H induced by k. [sent-73, score-0.212]

21 In the case that V and W are normed, T is called bounded if c := supx∈V, x V =1 T x W is finite, and we call c as the norm of the operator, denoted by T . [sent-79, score-0.123]

22 Let H be an RKHS induced by a kernel k defined on X × X. [sent-83, score-0.118]

23 Then for any x ∈ X, the linear functional T : H → R defined as T (f ) := f (x) is bounded since |f (x)| = | f, k(x, ·) | ≤ k(x, ·) · f = k(x, x)1/2 f by the Cauchy-Schwarz inequality. [sent-84, score-0.219]

24 Boundedness of linear functionals is particularly useful thanks to the Riesz representation theorem which establishes their one-to-one correspondence to V [29]. [sent-85, score-0.347]

25 Every bounded linear functional L on a Hilbert space V can be represented in terms of an inner product L(x) = x, z for all x ∈ V, where the representer of the functional, z ∈ V, has norm z = L and is uniquely determined by L. [sent-87, score-0.513]

26 For any functional L on H, the representer z can be constructed as z(x) = z, k(x, ·) = L(k(x, ·)) for all x ∈ X. [sent-90, score-0.324]

27 Using Riesz’s representer theorem, it is not hard to show that for any bounded linear operator T : V → V where V is Hilbertian, there exists a unique bounded linear operator T ∗ : V → V such that T x, y = x, T ∗ y for all x, y ∈ V. [sent-92, score-0.551]

28 Suppose the functional L has the form L(f ) = T (f )(x0 ), where x0 ∈ X and T : H → H is a bounded linear operator on H. [sent-95, score-0.268]

29 Then the representer of L is z = T ∗ (k(x0 , ·)) because ∀ x ∈ X, z(x) = L(k(x, ·)) = T (k(x, ·))(x0 ) = T (k(x, ·)), k(x0 , ·) = k(x, ·), T ∗ (k(x0 , ·)) . [sent-96, score-0.221]

30 (1) Riesz’s theorem will be useful for our framework since it allows us to compactly represent functionals related to local invariances as elements of the RKHS. [sent-97, score-0.783]

31 In SSL, we wish to learn a target function f both from labeled data and from local invariances extracted from labeled and unlabeled data. [sent-99, score-0.672]

32 , (xl , yl ) be the labeled training data, and 1 (f (x), y) be the loss function on f when the training input x is labeled as y. [sent-103, score-0.239]

33 We measure deviations from local invariances around each labeled or unlabeled input instance, and express them as bounded linear functionals Ll+1 (f ), . [sent-107, score-0.959]

34 The linear functionals are associated with another convex loss function 2 (Li (f )) penalizing violations of the local invariances. [sent-111, score-0.413]

35 As an example, the derivative of f with respect to an input feature at some training instance x is a linear functional in f , and the loss function penalizes large values of the derivative at x using, e. [sent-112, score-0.259]

36 Section 4 describes other local invariances we can consider and shows that these can be expressed as bounded linear functionals. [sent-115, score-0.547]

37 Putting together the loss and regularizer, we set out to minimize the regularized risk functional over f ∈ H: min f ∈H 1 f 2 l 2 +λ l+m 1 (f (xi ), yi ) +ν i=1 2 (Li (f )) , (2) i=l+1 where λ, ν > 0. [sent-117, score-0.192]

38 In order to derive an efficient optimization procedure, we now derive a representer theorem showing that the optimal solution lies in the span of a finite number of functions associated with the labeled data and the representers of the functionals Li . [sent-120, score-0.786]

39 , l + m) be bounded linear functionals on H with representers zi . [sent-127, score-0.607]

40 (4) 2 i=1 i=1 i=l+1 i=l+1 ˆ ˆ ˆ ˆ Here Kij = ki , kj , where ki = k(xi , ·) if i ≤ l, and ki = zi (·) otherwise. [sent-132, score-0.14]

41 However, our model is quite different in that it uses the representers of the linear functionals corresponding to the invariance, rather than virtual samples drawn from the invariant neighborhood. [sent-135, score-0.603]

42 This could result in more compact models because the invariance (e. [sent-136, score-0.236]

43 The efficient computation of zi , zj depends on the specific kernels and invariances, as we will show in Section 4. [sent-143, score-0.184]

44 In the simplest case, consider the commonly used graph Laplacian regularizer [10, 11] which, given the similarity measure wij between xi and xj , can be written as ij wij (f (xi ) − f (xj ))2 = 2 ij wij (Lij (f )) , where Lij (f ) = f, k(xi , ·) − k(xj , ·) is linear and bounded. [sent-144, score-0.334]

45 Then zij , zpq = k(xi , xp ) + k(xj , xq ) − k(xj , xp ) − k(xi , xq ). [sent-145, score-0.114]

46 Then zi , zj = Li (zj ) = Ti (zj )(xi ) = Ti (zj ), k(xi , ·) = zj , Ti∗ (k(xi , ·)) = Tj (Ti∗ (k(xi , ·)))(xj ). [sent-147, score-0.177]

47 A similar representer theorem can be established (see Appendix A). [sent-150, score-0.255]

48 4 Local Invariances as Bounded Linear Functionals Interestingly, many useful local invariances can be modeled as bounded linear functionals. [sent-151, score-0.573]

49 In general, boundedness hinges on the functional and the RKHS H. [sent-153, score-0.14]

50 When H is finite dimensional, such as that induced by linear or polynomial kernels, all linear functionals on H must be bounded: Theorem 3 ([29, Thm 2. [sent-154, score-0.375]

51 Linear functionals on finite dimensional normed space are bounded. [sent-156, score-0.303]

52 Then we are interested in linear functionals Lxi ,d (f ) := ∂f (x) |x=xi , where xd stands for the d-th component of ∂xd the vector x. [sent-163, score-0.421]

53 ∂xd x=xi ∂xd ∂y d x=y=xi Let us denote the representer of Lxi ,d as zi,d . [sent-169, score-0.221]

54 The inner product between representers can be computed by definition: ∂ ∂ ∂2 zi,d , zj,d = zi,d (y) = zi,d , k(y, ·) = d d d ∂xd ∂y y=xj ∂y y=xj ∂y k(x, y). [sent-172, score-0.223]

55 Applying (6) to the polynomial kernel k(x, y) = ( x, y + 1)r , we derive zi,d , zj,d = r( xi , xj + 1)r−2 [(r − 1)xd xd + ( xi , xj + 1)δd=d ], i j 4 (7) where δd=d = 1 if d = d , and 0 otherwise. [sent-174, score-0.651]

56 So applying (5), ∂ k(xi , xj ) 2 ∂ zi,d , zj,d = − d k(xi , y) = [σ δd=d − (xd − xd )(xd − xd )]. [sent-177, score-0.37]

57 Here we show that transformation invariance can be handled in our framework via representers in RKHS. [sent-181, score-0.482]

58 In particular, gradients with respect to the transformations are bounded linear functionals. [sent-182, score-0.155]

59 Suppose the linear functional that maps f ∈ F to ∂f (u) u=u0 is bounded for any u0 and coordinate ∂uj ∂ j, and its norm is denoted as Cj . [sent-201, score-0.287]

60 derivatives with respect to each of the components in α, must be bounded linear functionals on F. [sent-204, score-0.423]

61 The derivatives ∂αd f (I(g, α; S)) with respect to each of the component in α are α=0 bounded linear functionals on the RKHS defined by the polynomial and Gaussian kernels. [sent-210, score-0.456]

62 To compute the inner product between representers, let zg,d,S be the representer of Lg,d,S and denote vg,d,S = ∂I(g,α;S) |α=0 . [sent-211, score-0.258]

63 3 Local Averaging Using gradients to enforce the local invariance that the target function does not change much around data instances increases the number of basis functions by a factor of n, where n is the number of gradient directions that we use. [sent-218, score-0.395]

64 When we do not have useful information about the invariant directions, it may be useful to have methods that do not increase the number of basis functions by much. [sent-220, score-0.121]

65 Consider functionals Lxi (f ) = X f (τ )p(xi − τ )dτ − f (xi ), (10) where p(·) is a probability density function centered at zero. [sent-221, score-0.286]

66 Minimizing a loss with such linear functionals will favor functions whose local averages given by the integral are close to the function values at data instances. [sent-222, score-0.389]

67 Hence, such loss functions may be appropriate when we believe that data instances from the same class are clustered together. [sent-224, score-0.121]

68 To use the framework we have developed, we need to select the probability density p(·) and the kernel k such that Lxi (f ) is a bounded linear functional. [sent-225, score-0.236]

69 Then the linear functional Lxi in (10) is bounded in the RKHS defined by k for any probability density p(·). [sent-228, score-0.247]

70 As a result, radial kernels such as Gaussian and exponential kernels make Lxi bounded. [sent-230, score-0.144]

71 k(x, x) is not bounded for polynomial kernel, but it has been covered by Theorem 3. [sent-231, score-0.12]

72 To allow for efficient implementation, the inner product between representers of invariances must be computed efficiently. [sent-232, score-0.617]

73 In the batch setting, the major challenge is the cost in computing the gradient when the number of invariance is large. [sent-240, score-0.302]

74 For example, consider all derivatives at all labeled examples x1 , . [sent-241, score-0.136]

75 When the number of invariance is huge, a batch solver can be slow and coordinate-wise updates can be more efficient. [sent-263, score-0.265]

76 in (semi-)supervised learning using invariance to differentiation and transformation. [sent-318, score-0.326]

77 1 Invariance to Differentiation: Transduction on Two Moon Dataset As a proof of concept, we experimented with the “two moon” dataset shown in Figure 1, with only l = 2 labeled data instances (red circle and black square). [sent-320, score-0.143]

78 25 and the gradient invariances on both labeled and unlabeled data. [sent-322, score-0.585]

79 The loss 1 and 2 are logistic and squared loss respectively, with λ = 1, and ν ∈ {0, 0. [sent-323, score-0.13]

80 In Figure 1, from left to right our method lays more and more emphasis on placing the separation boundary in low density regions, which allows unlabeled data to improve the classification accuracy. [sent-327, score-0.119]

81 We also tried hinge loss for 1 and -insensitive loss for 2 with similar classification results. [sent-328, score-0.161]

82 We trained InvSVM with hinge loss for 1 and squared loss for 2 . [sent-336, score-0.161]

83 The differentiation invariance was used over the whole dataset, i. [sent-337, score-0.326]

84 For each fold of CV, InvSVM and 4 LapSVM used the other 4 folds ( 5 l points) as labeled data, and the remaining t − 4 l points as unla5 1 beled data. [sent-346, score-0.12]

85 The lower error of InvSVM suggests the superiority of the use of gradient, which is enabled by our representer based approach. [sent-353, score-0.246]

86 3 Transformation Invariance Next we study the use of transformation invariance for supervised learning. [sent-357, score-0.296]

87 We used the handwritten digits from the MNIST dataset [38] and compared InvSVM with the virtual sample SVM (VirSVM) which constructs additional instances by applying the following transformations to the training data: 2-pixel shifts in 4 directions, rotations by ±10 degrees, scaling by ±0. [sent-358, score-0.156]

88 5 2 Error of VirSVM (d) 7 (pos) vs 1 (neg) Figure 2: Test error (in percentage) of InvSVM versus SVM with virtual sample. [sent-531, score-0.119]

89 As the real distribution of digit is imbalanced and invariance is more useful when the number of labeled data is low, we randomly chose n+ = 50 labeled images for one class and n− = 10 images for the other. [sent-533, score-0.515]

90 Logistic loss and -insensitive loss were + − used for 1 and 2 respectively. [sent-535, score-0.13]

91 7 Conclusion and Discussion We have shown how to model local invariances by using representers in RKHS. [sent-539, score-0.617]

92 This subsumes a wide range of invariances that are useful in practice, and the formulation can be optimized efficiently. [sent-540, score-0.42]

93 A potential application is to the problem of imbalanced data learning, where one wishes to keep the decision boundary further away from instances of the minority class and closer to the instances of majority class. [sent-542, score-0.165]

94 It will also be interesting to generalize the framework to invariances that are not b directly linear. [sent-543, score-0.394]

95 For example, the total variation (f ) := a |f (x)| dx (a, b ∈ R) is not linear, but we can treat the integral of absolute value as a loss function and take the derivative as the linear b functional: Lx (f ) = f (x) and (f ) = a |Lx (f )| dx. [sent-544, score-0.125]

96 Finally the norm of the linear functional does affect the optimization efficiency, and a detailed analysis will be useful for choosing the linear functionals. [sent-546, score-0.223]

97 Lie transformation groups, integral transforms, and invariant pattern recognition. [sent-557, score-0.129]

98 Transformation invariance in pattern recognitiontangent distance and tangent propagation. [sent-565, score-0.262]

99 Learning from labeled and unlabeled data using graph mincuts. [sent-599, score-0.154]

100 An introduction to model building with reproducing kernel Hilbert spaces. [sent-677, score-0.149]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('invsvm', 0.433), ('invariances', 0.394), ('functionals', 0.258), ('rkhs', 0.239), ('invariance', 0.236), ('representer', 0.221), ('lapsvm', 0.206), ('lxi', 0.206), ('representers', 0.186), ('xd', 0.134), ('virsvm', 0.124), ('functional', 0.103), ('xj', 0.102), ('xi', 0.094), ('kernel', 0.092), ('differentiation', 0.09), ('labeled', 0.087), ('bounded', 0.087), ('kernels', 0.072), ('svm', 0.071), ('invariant', 0.069), ('riesz', 0.067), ('neg', 0.067), ('unlabeled', 0.067), ('zj', 0.065), ('loss', 0.065), ('virtual', 0.061), ('pos', 0.06), ('transformation', 0.06), ('rotation', 0.06), ('xq', 0.057), ('reproducing', 0.057), ('instances', 0.056), ('moon', 0.055), ('sch', 0.052), ('hilbert', 0.051), ('cv', 0.05), ('derivatives', 0.049), ('operator', 0.049), ('zi', 0.047), ('xl', 0.045), ('normed', 0.045), ('rnl', 0.041), ('transformations', 0.039), ('laplacian', 0.039), ('local', 0.037), ('inner', 0.037), ('boundedness', 0.037), ('gradient', 0.037), ('norm', 0.036), ('chapelle', 0.035), ('compactly', 0.034), ('theorem', 0.034), ('ssl', 0.034), ('polynomial', 0.033), ('fold', 0.033), ('vs', 0.033), ('cj', 0.032), ('coordinate', 0.032), ('lij', 0.032), ('lx', 0.032), ('niyogi', 0.032), ('hinge', 0.031), ('ki', 0.031), ('ti', 0.031), ('derivative', 0.031), ('batch', 0.029), ('nite', 0.029), ('directions', 0.029), ('translation', 0.029), ('linear', 0.029), ('imbalanced', 0.029), ('rs', 0.029), ('oracles', 0.029), ('density', 0.028), ('wij', 0.028), ('teo', 0.028), ('bundle', 0.028), ('lkopf', 0.027), ('program', 0.026), ('support', 0.026), ('smola', 0.026), ('induced', 0.026), ('useful', 0.026), ('sq', 0.026), ('tangent', 0.026), ('kij', 0.025), ('discriminant', 0.025), ('transductive', 0.025), ('nl', 0.025), ('gaussian', 0.025), ('images', 0.025), ('encoded', 0.025), ('li', 0.025), ('commonly', 0.025), ('error', 0.025), ('boundary', 0.024), ('penalize', 0.024), ('risk', 0.024), ('convex', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh

Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1

2 0.18119504 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Author: Qichao Que, Mikhail Belkin

Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1

3 0.12272844 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

Author: Le Song, Bo Dai

Abstract: Kernel embedding of distributions has led to many recent advances in machine learning. However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspecified. In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspecification. We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation. 1

4 0.10121352 166 nips-2013-Learning invariant representations and applications to face verification

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1

5 0.096300319 75 nips-2013-Convex Two-Layer Modeling

Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans

Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1

6 0.090095066 137 nips-2013-High-Dimensional Gaussian Process Bandits

7 0.083899789 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

8 0.07981921 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

9 0.078183368 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

10 0.077099971 9 nips-2013-A Kernel Test for Three-Variable Interactions

11 0.065534726 171 nips-2013-Learning with Noisy Labels

12 0.064820945 173 nips-2013-Least Informative Dimensions

13 0.062685274 211 nips-2013-Non-Linear Domain Adaptation with Boosting

14 0.060705639 158 nips-2013-Learning Multiple Models via Regularized Weighting

15 0.060081534 149 nips-2013-Latent Structured Active Learning

16 0.059753422 224 nips-2013-On the Sample Complexity of Subspace Learning

17 0.058495574 289 nips-2013-Scalable kernels for graphs with continuous attributes

18 0.054619506 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

19 0.053215239 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

20 0.052704494 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.181), (1, 0.063), (2, 0.022), (3, -0.003), (4, 0.052), (5, 0.031), (6, -0.019), (7, 0.023), (8, -0.042), (9, 0.041), (10, -0.028), (11, -0.01), (12, -0.048), (13, -0.082), (14, 0.185), (15, 0.074), (16, -0.062), (17, 0.011), (18, -0.056), (19, -0.067), (20, -0.049), (21, 0.045), (22, 0.092), (23, 0.052), (24, -0.077), (25, 0.008), (26, 0.114), (27, 0.008), (28, 0.012), (29, 0.049), (30, 0.003), (31, 0.015), (32, 0.067), (33, -0.02), (34, -0.083), (35, -0.082), (36, -0.004), (37, 0.041), (38, -0.017), (39, 0.022), (40, 0.012), (41, 0.007), (42, -0.057), (43, 0.075), (44, -0.086), (45, -0.026), (46, -0.001), (47, -0.003), (48, -0.004), (49, -0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91638571 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh

Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1

2 0.83266032 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Author: Qichao Que, Mikhail Belkin

Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1

3 0.75601786 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

Author: Corinna Cortes, Marius Kloft, Mehryar Mohri

Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classification tasks. 1

4 0.74777395 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf

Abstract: Many applications require the analysis of complex interactions between time series. These interactions can be non-linear and involve vector valued as well as complex data structures such as graphs or strings. Here we provide a general framework for the statistical analysis of these dependencies when random variables are sampled from stationary time-series of arbitrary objects. To achieve this goal, we study the properties of the Kernel Cross-Spectral Density (KCSD) operator induced by positive definite kernels on arbitrary input domains. This framework enables us to develop an independence test between time series, as well as a similarity measure to compare different types of coupling. The performance of our test is compared to the HSIC test using i.i.d. assumptions, showing improvements in terms of detection errors, as well as the suitability of this approach for testing dependency in complex dynamical systems. This similarity measure enables us to identify different types of interactions in electrophysiological neural time series. 1

5 0.68717074 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1

6 0.6628077 9 nips-2013-A Kernel Test for Three-Variable Interactions

7 0.65806049 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

8 0.62613887 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

9 0.61363649 76 nips-2013-Correlated random features for fast semi-supervised learning

10 0.56995523 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

11 0.56280941 173 nips-2013-Least Informative Dimensions

12 0.5610072 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

13 0.52913886 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

14 0.52584106 244 nips-2013-Parametric Task Learning

15 0.5223397 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty

16 0.52026004 327 nips-2013-The Randomized Dependence Coefficient

17 0.51904631 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections

18 0.50610238 202 nips-2013-Multiclass Total Variation Clustering

19 0.50482297 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited

20 0.49533662 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.014), (13, 0.217), (16, 0.023), (19, 0.036), (33, 0.139), (34, 0.129), (36, 0.011), (41, 0.031), (49, 0.053), (56, 0.098), (70, 0.026), (85, 0.033), (89, 0.034), (93, 0.06), (95, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80702943 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh

Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1

2 0.79574519 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

Author: Rishabh K. Iyer, Stefanie Jegelka, Jeff A. Bilmes

Abstract: We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [28]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the “curvature” of the submodular function, and provide lower and upper bounds that refine and improve previous results [2, 6, 8, 27]. Our proof techniques are fairly generic. We either use a black-box transformation of the function (for approximation and learning), or a transformation of algorithms to use an appropriate surrogate function (for minimization). Curiously, curvature has been known to influence approximations for submodular maximization [3, 29], but its effect on minimization, approximation and learning has hitherto been open. We complete this picture, and also support our theoretical claims by empirical results. 1

3 0.78268456 294 nips-2013-Similarity Component Analysis

Author: Soravit Changpinyo, Kuan Liu, Fei Sha

Abstract: Measuring similarity is crucial to many learning tasks. To this end, metric learning has been the dominant paradigm. However, similarity is a richer and broader notion than what metrics entail. For example, similarity can arise from the process of aggregating the decisions of multiple latent components, where each latent component compares data in its own way by focusing on a different subset of features. In this paper, we propose Similarity Component Analysis (SCA), a probabilistic graphical model that discovers those latent components from data. In SCA, a latent component generates a local similarity value, computed with its own metric, independently of other components. The final similarity measure is then obtained by combining the local similarity values with a (noisy-)OR gate. We derive an EM-based algorithm for fitting the model parameters with similarity-annotated data from pairwise comparisons. We validate the SCA model on synthetic datasets where SCA discovers the ground-truth about the latent components. We also apply SCA to a multiway classification task and a link prediction task. For both tasks, SCA attains significantly better prediction accuracies than competing methods. Moreover, we show how SCA can be instrumental in exploratory analysis of data, where we gain insights about the data by examining patterns hidden in its latent components’ local similarity values. 1

4 0.75511205 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini

Abstract: We study in this paper flat and hierarchical classification strategies in the context of large-scale taxonomies. To this end, we first propose a multiclass, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. We then introduce another type of bound targeting the approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune (or flatten) in a large-scale taxonomy. We finally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1

5 0.7288723 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints

Author: Rishabh K. Iyer, Jeff A. Bilmes

Abstract: We investigate two new optimization problems — minimizing a submodular function subject to a submodular lower bound constraint (submodular cover) and maximizing a submodular function subject to a submodular upper bound constraint (submodular knapsack). We are motivated by a number of real-world applications in machine learning including sensor placement and data subset selection, which require maximizing a certain submodular function (like coverage or diversity) while simultaneously minimizing another (like cooperative cost). These problems are often posed as minimizing the difference between submodular functions [9, 25] which is in the worst case inapproximable. We show, however, that by phrasing these problems as constrained optimization, which is more natural for many applications, we achieve a number of bounded approximation guarantees. We also show that both these problems are closely related and an approximation algorithm solving one can be used to obtain an approximation guarantee for the other. We provide hardness results for both problems thus showing that our approximation factors are tight up to log-factors. Finally, we empirically demonstrate the performance and good scalability properties of our algorithms. 1

6 0.72486252 201 nips-2013-Multi-Task Bayesian Optimization

7 0.71954137 173 nips-2013-Least Informative Dimensions

8 0.71461004 5 nips-2013-A Deep Architecture for Matching Short Texts

9 0.71111083 99 nips-2013-Dropout Training as Adaptive Regularization

10 0.7105999 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

11 0.70828396 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

12 0.7082203 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

13 0.70748246 64 nips-2013-Compete to Compute

14 0.70676166 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

15 0.70671588 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

16 0.70626026 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

17 0.706204 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis

18 0.70612711 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

19 0.7056759 251 nips-2013-Predicting Parameters in Deep Learning

20 0.7052868 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits