nips nips2010 nips2010-280 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Meihong Wang, Fei Sha, Michael I. Jordan
Abstract: We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. [sent-12, score-0.567]
2 In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. [sent-13, score-0.313]
3 We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. [sent-14, score-0.248]
4 Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. [sent-15, score-0.144]
5 Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions. [sent-16, score-0.227]
6 1 Introduction Dimensionality reduction is an important aspect of many statistical learning tasks. [sent-17, score-0.163]
7 In unsupervised dimensionality reduction, the primary interest is to preserve significant properties of the data in a low-dimensional representation. [sent-18, score-0.249]
8 In supervised dimensionality reduction, side information is available to influence the choice of the low-dimensional space. [sent-20, score-0.181]
9 To address this dilemma, there has been a growing interest in sufficient dimension reduction (SDR) and related techniques [5–8]. [sent-26, score-0.244]
10 This is ensured by requiring conditional independence among the three variables; i. [sent-28, score-0.176]
11 ⊥ Recently, kernel methods have been adapted to this purpose. [sent-32, score-0.183]
12 In particular, kernel dimensional reduction (KDR) develops a kernel-based contrast function that measures the degree of conditional independence [7]. [sent-33, score-0.588]
13 In this paper we show how the KDR framework can be used in the setting of unsupervised learning. [sent-36, score-0.122]
14 By exploiting the SDR and KDR frameworks, on the other hand, we can cast the unsupervised learning problem within a general nonparametric framework involving conditional independence, and in particular as one of optimizing kernel-based measures of independence. [sent-39, score-0.223]
15 We refer to this approach as “unsupervised kernel dimensionality reduction” (UKDR). [sent-40, score-0.31]
16 As we will show in an empirical investigation, the UKDR approach works well in practice, comparing favorably to other techniques for unsupervised dimension reduction. [sent-41, score-0.203]
17 We also provide some interesting analytical links of the UKDR approach to stochastic neighbor embedding (SNE) and t-distributed SNE (t-SNE) [14, 15]. [sent-43, score-0.189]
18 In Section 2, we review the SDR framework and discuss how kernels can be used to solve the SDR problem. [sent-45, score-0.113]
19 We show how the kernel-based approach can be used for unsupervised dimensionality reduction in Section 3. [sent-47, score-0.412]
20 2 Sufficient dimension reduction and measures of independence with kernels Discovering statistical (in)dependencies among random variables is a classical problem in statistics; examples of standard measures include Spearman’s ρ, Kendall’s τ and Pearson’s χ2 tests. [sent-54, score-0.66]
21 Recently, there have been a growing interest in computing measures of independence in Reproducing Kernel Hilbert spaces (RKHSs) [7, 16]. [sent-55, score-0.231]
22 In particular, the resulting independence measures attain minimum values when random variables are independent. [sent-57, score-0.207]
23 These methods were originally developed in the context of independent component analysis [17] and have found applications in a variety of other problems, including clustering, feature selection, and dimensionality reduction [7, 8, 18–21]. [sent-58, score-0.29]
24 We will be applying these approaches to unsupervised dimensionality reduction. [sent-59, score-0.249]
25 To this end, we first describe briefly kernel-based measures of (conditional) independence, focusing on how they are applied to supervised dimensionality reduction. [sent-61, score-0.247]
26 1 Kernel dimension reduction for supervised learning In supervised dimensionality reduction for classification and regression, the response variable, Y ∈ Y, provides side information about the covariates, X ∈ X . [sent-63, score-0.642]
27 This problem is referred to as sufficient dimension reduction (SDR) in statistics, where it has been the subject of a large literature [22]. [sent-69, score-0.244]
28 Of special interest is the technique of kernel dimensional reduction (KDR) that is based on assessing conditional independence in RKHS spaces [7]. [sent-73, score-0.57]
29 Concretely, we map the two variables X and Y to the RKHS spaces F and G induced by two positive semidefinite kernels KX : X × X → R 2 and KY : Y × Y → R. [sent-74, score-0.137]
30 B The conditional covariance operator has an important property: for any projection B, CY Y |X ≥ CY Y |X where the (partial) order is defined in terms of the trace operator. [sent-77, score-0.204]
31 Concretely, with N samples drawn from P (X, Y ), we compute the corresponding kernel matrices KB ⊤X and KY . [sent-81, score-0.213]
32 The trace of the estimated conditional variance operator B CY Y |X is then defined as follows: ˆ JY Y |X (B ⊤X, Y ) = Trace GY (GB ⊤X + N ǫN IN )−1 , (3) where GY = HKY H and GB ⊤X = HKB ⊤X H. [sent-83, score-0.139]
33 ǫN is a regularizer, smoothing the kernel √ matrix. [sent-84, score-0.183]
34 The minimizer of the conditional independence measure yields the optimal projection B for kernel dimensionality reduction: ˆ BY Y |X = arg minB ⊤B=I JY Y |X (B ⊤X, Y ). [sent-86, score-0.551]
35 (4) We defer discussion on choosing kernels as well as numerical optimization to later sections. [sent-87, score-0.113]
36 Indeed, another kernel-based measure of independence that can be optimized in the KDR context is the Hilbert-Schmidt Independence Criterion (HSIC) [16]. [sent-91, score-0.141]
37 It has been shown that for universal kernels such as Gaussian kernels the Hilbert-Schmidt norm of CXY is zero if and only if X and Y are independent [16]. [sent-93, score-0.226]
38 Given N samples from P (X, Y ), the empirical estimate of HSIC is given by (up to a multiplicative constant): ˆ JXY (X, Y ) = Trace [HKX HKY ] , (6) where KX and KY are RN ×N kernel matrices computed over X and Y respectively. [sent-94, score-0.213]
39 To apply this independence measure to dimensionality reduction, we seek a projection B which maximizes ˆ JXY (B ⊤X, Y ), such that the low-dimensional coordinates Z = B ⊤X are maximally correlated with X, ˆ BXY = arg maxB ⊤B=I JXY (B ⊤X, Y ) = arg maxB ⊤B=I Trace [HKB ⊤X HKY ] . [sent-95, score-0.44]
40 (7) It is interesting to note that the independence measures in eq. [sent-96, score-0.207]
41 Furthermore, JXY is slightly easier to use in practice as it does not depend on regularization parameters to smooth the kernel matrices. [sent-109, score-0.183]
42 ˆ The HSIC measure JXY is also closely related to the technique of kernel alignment which minimizes the angles between (vectorized) kernel matrices KX and KY [23]. [sent-110, score-0.442]
43 The alignment technique has been used for clustering data X by assigning cluster labels Y so that the two kernel matrices are maximally aligned. [sent-112, score-0.337]
44 While both JY Y |X and JXY have been used for supervised dimensionality reduction with known values of Y , they have not yet been applied to unsupervised dimensionality reduction, which is the direction that we pursue here. [sent-114, score-0.593]
45 3 Unsupervised kernel dimension reduction In unsupervised dimensionality reduction, the low-dimensional representation Z can be viewed as a compression of X. [sent-115, score-0.676]
46 In this section, we describe how this can be done, viewing unsupervised dimensionality reduction as a special type of supervised regression problem. [sent-120, score-0.494]
47 1 Linear unsupervised kernel dimension reduction ˜ ˜ Given a random variable X ∈ RD , we consider the regression problem X = f (B ⊤X) where X is a copy of X and Z = B ⊤X ∈ RM is the low-dimensional representation of X. [sent-124, score-0.577]
48 With a set of N samples from P (X), the linear projection B can be identified as the minimizer of the following kernel-based measure of independence min B ⊤B=I ˆ JXX|B ⊤X = Trace GX (GB ⊤X + N ǫN I)−1 , (9) where GX and GB ⊤X are centralized kernel matrices of KX and KB ⊤X respectively. [sent-129, score-0.419]
49 (10) We refer collectively to this kernel-based dimension reduction method as linear unsupervised KDR ˆ (UKDR) and we use J(B ⊤X, X) as a shorthand for the independence measure to be either minimized or maximized. [sent-131, score-0.507]
50 2 Nonlinear unsupervised kernel dimension reduction For data with complicated multimodal distributions, linear transformation of the inputs X is unlikely to be sufficiently flexible to reveal useful structures. [sent-133, score-0.602]
51 For the purpose of better data visualization and exploratory data analysis, we describe several simple yet effective nonlinear extensions to linear UKDR. [sent-135, score-0.134]
52 The main idea is to find a linear subspace embedding of nonlinearly transformed X. [sent-136, score-0.181]
53 Our choice of random matrix W is motivated by earlier work in neural networks with infinite number of hidden units, and recent work in large-scale kernel machines and deep learning kernels [24– 26]. [sent-148, score-0.296]
54 3 Choice of kernels ˆ The independence measures J(B ⊤X, X) are defined via kernels over B ⊤X and X. [sent-156, score-0.433]
55 We have also experimented with other types of kernels; in particular we have found the following kernels to be of particular interest. [sent-158, score-0.113]
56 (11), when properly normalized, can be seen as the probability of random walk from xi to xj , pij = P (xi → xj ) = exp{− xi − xj 2 2 /σi } / j=i exp{− xi − xj 2 2 /σi }. [sent-164, score-0.426]
57 Thus KX (xi , xj ) = k pik pjk measures the similarity between xi and xj in terms of these local structures. [sent-168, score-0.218]
58 A Cauchy kernel is a positive semidefinite kernel and is given by C(u, v) = 1/ 1 + u − v ⊤ ⊤ 2 = exp − log(1 + u − v 2 ) . [sent-170, score-0.366]
59 (14) We define KB ⊤X (xi , xj ) = C(B xi , B xj ). [sent-171, score-0.152]
60 Intuitively, the Cauchy kernel can be viewed as a Gaussian kernel in the transformed space φ(B ⊤X) such that φ(xi )⊤φ(xj ) = C(xi , xj ) [27]. [sent-172, score-0.454]
61 These two types of kernels are closely related to t-distributed stochastic neighbor embedding (tSNE), a state-of-the-art technique for dimensionality reduction [15]. [sent-173, score-0.616]
62 4 Numerical optimization We apply gradient-based techniques (with line search) to optimize either independence measure. [sent-176, score-0.141]
63 The techniques constrain the projection matrix B to lie on the Grassman-Stiefel manifold 5 150 0. [sent-177, score-0.093]
64 They differ in terms of how the embeddings are constrained (see text for details). [sent-189, score-0.18]
65 The complexity of computing gradients is quadratic in the number of data points as the kernel matrix needs to be computed. [sent-195, score-0.183]
66 Standard tricks—such as chunking—for handling large kernel matrices apply, though our empirical work has not used them. [sent-196, score-0.213]
67 More efficient implementation can bring the complexity to quadratic on D and linearly on M , the dimensionality of the low-dimensional space. [sent-198, score-0.127]
68 4 Experiments We compare the performance of our proposed methods for unsupervised kernel dimension reduction (UKDR) to a state-of-the-art method, specifically t-distributed stochastic neighbor embedding (tSNE) [15]. [sent-200, score-0.738]
69 In addition to visual examination of 2D embedding quality, we also investigate the performance of the resulting low-dimensional representations in classification. [sent-202, score-0.149]
70 In all of the experiments reported in this ˆ section, we have used the independence measure JB ⊤X X (B ⊤X, X) of eq. [sent-203, score-0.141]
71 We use t-SNE and our proposed method to yield 1D embeddings of these data points, plotted in Fig. [sent-208, score-0.157]
72 1(b) plots a typical embedding by t-SNE where we see that there is significant overlap between the clusters. [sent-212, score-0.149]
73 1(c), the embedding is computed as the linear projection of the RBN-transformed original data. [sent-215, score-0.214]
74 1(d), the embedding is unconstrained and free to take any value on 1D axis, corresponding to the “nonparametric embedding” presented in section 3. [sent-217, score-0.149]
75 2 Images of handwritten digits Our second data set is a set of 2007 images of USPS handwritten digits [20]. [sent-219, score-0.11]
76 The first row was generated with kernel PCA, Laplacian eigenmaps and t-SNE. [sent-229, score-0.209]
77 6 The second row was generated with our UKDR method with Gaussian kernels for both the lowdimensional coordinates Z and X. [sent-231, score-0.173]
78 The difference between the three embeddings is whether Z is constrained as a linear projection of the original X (linear UKDR), an RBN-transformed X (RBN UKDR), or a Random Sparse Feature transform of X (RSF UKDR). [sent-232, score-0.245]
79 The bandwidth for the Gaussian kernel over X was 0. [sent-239, score-0.215]
80 Indeed, the quality of the embeddings is on par with that of t-SNE. [sent-242, score-0.157]
81 In the third row of the figure, the embedding Z is constrained to be RSF UKDR. [sent-243, score-0.198]
82 However, instead of using Gaussian kernels (as in the second row), we have used Cauchy kernels. [sent-244, score-0.113]
83 The kernels over X are Gaussian, Random Walk, and Diffusion Map kernels [29], respectively. [sent-245, score-0.226]
84 In general, contrasting to embeddings in the second row, using a Cauchy kernel for the embedding space Z leads to tighter clusters. [sent-246, score-0.489]
85 Additionally, the embeddings by the diffusion map kernel is the most visually appealing one, outperforming t-SNE by significantly increasing the gap of digit 1 and 4 from the others. [sent-247, score-0.466]
86 5 −1 0 1 2 3 4 5 (h) Random Walk+Cauchy −2 −2 −1 0 1 2 3 4 5 (i) Diffusion+Cauchy Figure 2: 2D embedding results for the USPS-500 dataset by existing approaches, shown in the first row. [sent-298, score-0.149]
87 3, we compare the embeddings of t-SNE and unsupervised KDR on the full USPS 2007 data set. [sent-309, score-0.279]
88 Both t-SNE and unsupervised KDR lead to visually appealing clustering of data. [sent-311, score-0.21]
89 In the UKDR framework, using an RBN transformation to parameterize the embedding performs slightly better than using the RSF transformation. [sent-312, score-0.176]
90 6 Table 1: Classification errors on the USPS-2007 data set with different dimensionality reduction techniques. [sent-331, score-0.29]
91 Finally, as another way to assess the quality of the low-dimensional embeddings discovered by these methods, we used these embeddings as inputs to supervised classifiers. [sent-332, score-0.394]
92 As the dimensionality goes up, t-SNE starts to perform better than our method but only marginally. [sent-339, score-0.127]
93 PCA is expected to perform well with very high dimensionality as it recovers pairwise distances the best. [sent-340, score-0.127]
94 The superior classification performance by our method is highly desirable when the target dimensionality is very much constrained. [sent-341, score-0.127]
95 However, using these embeddings as inputs to classifiers suggests that the embedding by nonlinear UKDR is of higher quality. [sent-352, score-0.388]
96 5 Conclusions We propose a novel technique for unsupervised dimensionality reduction. [sent-353, score-0.273]
97 The algorithm identifies low-dimensional representations of input data by optimizing independence measures computed in a reproducing kernel Hilbert space. [sent-355, score-0.419]
98 Dimension reduction and visualization in discriminant analysis (with discussion). [sent-399, score-0.241]
99 On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. [sent-425, score-0.159]
100 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-510, score-0.429]
wordName wordTfidf (topN-words)
[('ukdr', 0.538), ('kdr', 0.323), ('rsf', 0.215), ('sdr', 0.194), ('kernel', 0.183), ('jxy', 0.172), ('rbn', 0.172), ('reduction', 0.163), ('embeddings', 0.157), ('embedding', 0.149), ('independence', 0.141), ('jy', 0.132), ('cauchy', 0.13), ('dimensionality', 0.127), ('unsupervised', 0.122), ('cy', 0.113), ('kernels', 0.113), ('kx', 0.113), ('ky', 0.104), ('hsic', 0.094), ('hrsf', 0.086), ('dimension', 0.081), ('visualization', 0.078), ('trace', 0.072), ('measures', 0.066), ('kb', 0.065), ('projection', 0.065), ('cxy', 0.065), ('hky', 0.065), ('sne', 0.065), ('tsne', 0.065), ('covariates', 0.06), ('pca', 0.058), ('gb', 0.058), ('nonlinear', 0.056), ('xj', 0.056), ('supervised', 0.054), ('gx', 0.046), ('maximally', 0.046), ('autoencoder', 0.043), ('bxy', 0.043), ('hkb', 0.043), ('sliced', 0.043), ('pij', 0.042), ('gretton', 0.042), ('neighbor', 0.04), ('walk', 0.04), ('xi', 0.04), ('diffusion', 0.039), ('autoencoders', 0.038), ('classi', 0.036), ('rkhs', 0.035), ('conditional', 0.035), ('heaviside', 0.035), ('bishop', 0.035), ('rbf', 0.035), ('coordinates', 0.034), ('appealing', 0.034), ('smola', 0.034), ('song', 0.034), ('maxb', 0.033), ('fukumizu', 0.033), ('jb', 0.033), ('digits', 0.032), ('semide', 0.032), ('clustering', 0.032), ('transformed', 0.032), ('bandwidth', 0.032), ('operator', 0.032), ('digit', 0.031), ('borgwardt', 0.031), ('matrices', 0.03), ('classical', 0.03), ('southern', 0.029), ('reproducing', 0.029), ('manifold', 0.028), ('regression', 0.028), ('seek', 0.027), ('mistakes', 0.027), ('concretely', 0.027), ('sha', 0.027), ('transformation', 0.027), ('angeles', 0.026), ('row', 0.026), ('inputs', 0.026), ('usps', 0.026), ('hilbert', 0.025), ('gy', 0.025), ('gaussian', 0.024), ('los', 0.024), ('offset', 0.024), ('technique', 0.024), ('spaces', 0.024), ('radial', 0.024), ('bach', 0.023), ('handwritten', 0.023), ('constrained', 0.023), ('visually', 0.022), ('alignment', 0.022), ('bottleneck', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 280 nips-2010-Unsupervised Kernel Dimension Reduction
Author: Meihong Wang, Fei Sha, Michael I. Jordan
Abstract: We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.
2 0.16579762 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
3 0.16237916 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
Author: Andreas Christmann, Ingo Steinwart
Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H ⊂ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X ⊂ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1
4 0.15127422 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
5 0.13016942 133 nips-2010-Kernel Descriptors for Visual Recognition
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.
6 0.117366 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls
7 0.10265755 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
9 0.089857869 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
10 0.085589588 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
11 0.08477769 94 nips-2010-Feature Set Embedding for Incomplete Data
12 0.07791736 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
13 0.07077764 287 nips-2010-Worst-Case Linear Discriminant Analysis
14 0.068840958 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
15 0.066151083 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
16 0.065687619 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
17 0.062368207 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
18 0.061875176 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm
19 0.061186008 221 nips-2010-Random Projections for $k$-means Clustering
20 0.058069091 146 nips-2010-Learning Multiple Tasks using Manifold Regularization
topicId topicWeight
[(0, 0.181), (1, 0.076), (2, 0.033), (3, -0.04), (4, 0.148), (5, 0.027), (6, 0.229), (7, -0.055), (8, 0.024), (9, 0.048), (10, -0.02), (11, -0.022), (12, -0.019), (13, -0.093), (14, 0.078), (15, 0.023), (16, 0.016), (17, -0.076), (18, -0.048), (19, -0.06), (20, -0.056), (21, -0.054), (22, -0.06), (23, -0.065), (24, 0.047), (25, 0.036), (26, 0.055), (27, 0.027), (28, -0.106), (29, -0.085), (30, 0.105), (31, -0.051), (32, -0.048), (33, 0.006), (34, 0.063), (35, -0.005), (36, -0.045), (37, -0.016), (38, 0.024), (39, 0.039), (40, 0.023), (41, 0.08), (42, 0.007), (43, -0.063), (44, 0.007), (45, -0.008), (46, 0.05), (47, 0.052), (48, -0.068), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.93776661 280 nips-2010-Unsupervised Kernel Dimension Reduction
Author: Meihong Wang, Fei Sha, Michael I. Jordan
Abstract: We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.
2 0.80657709 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
Author: Andreas Christmann, Ingo Steinwart
Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H ⊂ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X ⊂ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1
3 0.79825473 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
4 0.75577426 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
5 0.65758699 133 nips-2010-Kernel Descriptors for Visual Recognition
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.
6 0.59931558 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls
7 0.56040484 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
9 0.52290142 233 nips-2010-Scrambled Objects for Least-Squares Regression
10 0.51745301 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
11 0.51744592 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
12 0.47706011 287 nips-2010-Worst-Case Linear Discriminant Analysis
13 0.47586423 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis
14 0.47412249 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
15 0.44692144 94 nips-2010-Feature Set Embedding for Incomplete Data
16 0.44025555 114 nips-2010-Humans Learn Using Manifolds, Reluctantly
17 0.43904388 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
18 0.43206257 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm
19 0.42793405 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
20 0.41408911 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions
topicId topicWeight
[(13, 0.062), (17, 0.016), (27, 0.06), (30, 0.079), (35, 0.014), (45, 0.166), (50, 0.046), (52, 0.069), (60, 0.056), (66, 0.254), (77, 0.04), (78, 0.014), (90, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.75584465 280 nips-2010-Unsupervised Kernel Dimension Reduction
Author: Meihong Wang, Fei Sha, Michael I. Jordan
Abstract: We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.
2 0.73082149 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
3 0.70338041 282 nips-2010-Variable margin losses for classifier design
Author: Hamed Masnadi-shirazi, Nuno Vasconcelos
Abstract: The problem of controlling the margin of a classifier is studied. A detailed analytical study is presented on how properties of the classification risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classification margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the fixed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter. 1
4 0.65424538 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
5 0.64504832 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
6 0.64191735 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
7 0.64146894 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
8 0.64110875 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
9 0.64045018 222 nips-2010-Random Walk Approach to Regret Minimization
10 0.639121 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
11 0.63856542 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
12 0.63815314 148 nips-2010-Learning Networks of Stochastic Differential Equations
13 0.63725203 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
14 0.63456494 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
15 0.63358015 63 nips-2010-Distributed Dual Averaging In Networks
16 0.63304305 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
17 0.63302422 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
18 0.63203907 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
19 0.63196576 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
20 0.63105464 155 nips-2010-Learning the context of a category