nips nips2011 nips2011-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ioannis A. Gkioulekas, Todd Zickler
Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. [sent-5, score-0.66]
2 We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. [sent-6, score-0.612]
3 We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. [sent-7, score-0.184]
4 Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. [sent-8, score-0.331]
5 Principal component analysis (PCA) [3], locality preserving projections (LPP) [4], and neighborhood preserving embedding (NPE) [5] are some common approaches. [sent-11, score-0.377]
6 They seek to reveal underlying structure using the global geometry, local distances, and local linear structure, respectively, of the signals in their original domain; and have been extended in many ways [6–8]. [sent-12, score-0.248]
7 1 On the other hand, it is commonly observed that geometric relations between signals in their original domain are only weakly linked to useful underlying structure. [sent-13, score-0.262]
8 In the latter case, however, it can be difficult to design a kernel that is beneficial for a particular signal class, and ad hoc selections are not always appropriate. [sent-16, score-0.211]
9 In this paper, we also address dimensionality reduction through an intermediate higher-dimensional space: we consider the case in which input signals are samples from an underlying dictionary model. [sent-17, score-0.73]
10 This generative model naturally suggests using the hidden covariate vectors as intermediate features, and learning a linear projection (of the original domain) to approximately preserve the Euclidean geometry of these vectors. [sent-18, score-0.209]
11 Throughout the paper, we emphasize a particular instance of this model that is related to sparse coding, motivated by studies suggesting that data-adaptive sparse representations 1 Other linear methods, most notably linear discriminant analysis (LDA), exploit class labels to learn projections. [sent-19, score-0.539]
12 1 are appropriate for signals such as natural images and facial images [12, 13], and enable state-ofthe-art performance for denoising, deblurring, and classification tasks [14–19]. [sent-21, score-0.426]
13 Formally, we assume our input signal to be well-represented by a sparse linear model [20], previously used for probabilistic sparse coding. [sent-22, score-0.495]
14 Based on this generative model, we formulate learning a linear projection as an optimization problem with the objective of preservation, in expectation, of pairwise inner products between sparse codes, without having to explicitly obtain the sparse representation for each new sample. [sent-23, score-0.716]
15 We discuss applicability of our results to general dictionary models, and nonlinear extensions. [sent-25, score-0.376]
16 Finally, by applying our method to the visualization, clustering, and classification of facial images, texture patches, and general images, we show experimentally that it improves our ability to uncover useful structure. [sent-26, score-0.208]
17 2 The sparse linear model We use RN to denote the ambient space of the input signals, and assume that each signal x ∈ RN is generated as the sum of a noise term ε ∈ RN and a linear combination of the columns, or atoms, of a N × K dictionary matrix D = [d1 , . [sent-28, score-0.71]
18 We are interested in the sparse linear model [20], according to which the elements of a are a-priori independent from ε and are identically and independently drawn from a Laplace distribution, K p (ai ) , p (ai ) = p (a) = i=1 1 |ai | exp − 2τ τ . [sent-33, score-0.24]
19 Several efficient algorithms exist for dictionary learning [21–23], and we assume in our analysis that a dictionary D adapted to the signals of interest is given. [sent-35, score-0.78]
20 Typically, the model (1) with an appropriate dictionary D is employed as a means for feature extraction, in which input signals x in RN are mapped to higher-dimensional feature vectors a ∈ RK . [sent-37, score-0.507]
21 When inferring features a (termed sparse codes) through maximum-a-posteriori (MAP) estimation, they are solutions to 1 1 2 x − Da 2 + a 1. [sent-38, score-0.188]
22 (3) σ2 τ This problem, known as the lasso [25], is a convex relaxation of the more general problem of sparse coding [26] (in the rest of the paper we use both terms interchangeably). [sent-39, score-0.242]
23 Typically, we are interested in projections that reveal useful structure in a given set of input signals. [sent-42, score-0.251]
24 When a suitable feature transform can be found, this structure may exist as simple Euclidean geometry and be encoded in pairwise Euclidean distances or inner products between feature vectors. [sent-44, score-0.255]
25 This is used, for example, in support vector machines and nearest-neighbor classifiers based on Euclidean distance, as well as k-means and spectral clustering based on pairwise inner products. [sent-45, score-0.233]
26 For the problem of dimensionality reduction, this motivates learning a projection matrix L such that, for any two input samples, the inner product between their resulting low-dimensional representations is close to that of their corresponding high-dimensional features. [sent-46, score-0.32]
27 Here we solve it for the case of the sparse linear model of Section 2, under which the feature vectors are the sparse codes. [sent-50, score-0.428]
28 Through comparison with (5), we observe that (6) is a trade-off between bringing inner products of sparse codes and their projections close (first term), and suppressing noise (second and third terms). [sent-59, score-0.605]
29 The solution to (7) is similar—and in the noiseless case identical—to the whitening transform of the atoms of D. [sent-65, score-0.173]
30 When the atoms are centered at the origin, this essentially means that solving (4) for the sparse linear model amounts to performing PCA on dictionary atoms learned from training samples instead of the training samples themselves. [sent-66, score-1.129]
31 The above result can also be interpreted in the setting of [28]: dimensionality reduction in the case of the sparse linear model with the objective of (4) corresponds to kernel PCA using the kernel DD T , modulo centering and the normalization. [sent-67, score-0.696]
32 1 Other dictionary models Even though we have presented our results using the sparse linear model described in Section 2, it is important to realize that our analysis is not limited to this model. [sent-69, score-0.551]
33 The above assumptions apply for several other popular dictionary models. [sent-71, score-0.311]
34 In the context of sparse coding, other sparsity-inducing priors that have been proposed in the literature, such as Student’s t-distribution [31], also fall into the same framework. [sent-73, score-0.188]
35 We choose to emphasize the sparse linear model, however, due to the apparent structure present in dictionaries learned using this model, and its empirical success in diverse applications. [sent-74, score-0.29]
36 We denote by Φ : RN → H a mapping from the signal domain to a reproducing kernel Hilbert space H associated ˜ with a kernel function k : RN × RN → R [33]. [sent-82, score-0.459]
37 , K} as dictionary, we extend the sparse linear model of Section 2 by replacing (1) for each x ∈ RN with Φ (x) = Da + ε, ˜ (12) ˜ ai di . [sent-86, score-0.378]
38 For a ∈ RK we make the same assumptions as in the sparse linear model. [sent-87, score-0.24]
39 where Da ≡ The term ε denotes a Gaussian process over the domain RN whose sample paths are functions in H ˜ and with covariance operator Cε = σ 2 I, where I is the identity operator on H [33, 34]. [sent-88, score-0.218]
40 ˜ K i=1 This nonlinear extension of the sparse linear model is valid only in finite dimensional spaces H. [sent-89, score-0.399]
41 In the infinite dimensional case, constructing a Gaussian process with both sample paths in H and identity covariance operator is not possible, as that would imply that the identity operator in H has finite Hilbert-Schmidt norm [33, 34]. [sent-90, score-0.241]
42 In the supplementary material, we discuss an alternative to (12) that resolves these problems by requiring that all Φ (x) be in the subspace spanned by the atoms of D. [sent-93, score-0.167]
43 In the kernel case, the equivalent of the projection matrix L (transposed) is a compact, linear operator V : H → RM , that maps an element x ∈ RN to y = VΦ (x) ∈ RM . [sent-95, score-0.364]
44 If we consider optimizing over S, we prove that (4) reduces to K K min 4τ 4 S i=1 i=1 ˜ ˜ di , S dj K 2 H ˜ ˜ S di , S di + 4τ 2 σ 2 − δij i=1 H + S 2 HS , (14) where · HS is the Hilbert-Schmidt norm. [sent-97, score-0.255]
45 Shown at high resolution and at their respective projections are identity-averaged faces across the dataset for various illuminations, poses, and expressions. [sent-100, score-0.255]
46 Insets show projections of samples from only two distinct identities. [sent-101, score-0.306]
47 We can replace L = B T K DD to turn (16) into an 1 ˜ equivalent problem over L of the form (6), with K 2 instead of D, and thus use (8) to obtain DD B = V M diag (g (λM )) (17) where, similar to the linear case, λM and V M are the M largest eigenpairs of the matrix K DD , and 4τ 4 . [sent-107, score-0.233]
48 As in the linear case, this is similar to the result of applying kernel PCA on the dictionary D instead of the training samples. [sent-112, score-0.583]
49 3 Computational considerations It is interesting to compare the proposed method in the nonlinear case with kernel PCA, in terms of computational and memory requirements. [sent-117, score-0.209]
50 If we require dictionary atoms to have pre-images in RN , that is D = Φ (di ) , di ∈ RN , i = 1, . [sent-118, score-0.516]
51 , K [36], then the proposed algorithm requires calculating and decomposing the K × K kernel matrix K DD when learning V, and performing K kernel evaluations for projecting a new sample x. [sent-121, score-0.328]
52 For kernel PCA on the other hand, the S×S matrix K X X and S kernel evaluations are needed respectively, where X = Φ (xi ) , xi ∈ RN , i = 1, . [sent-122, score-0.328]
53 , S and xi are the representations of the training samples in H, with S ≫ K. [sent-125, score-0.228]
54 In this case, the proposed method also requires calculating K X X during learning and S kernel evaluations for out-of-sample projections, but only the eigendecomposition of the K × K matrix F T K 2 X F is required. [sent-127, score-0.184]
55 X On the other hand, we have assumed so far, in both the linear and nonlinear cases, that a dictionary is given. [sent-128, score-0.428]
56 When this is not true, we need to take into account the cost of learning a dictionary, which greatly outweights the computational savings described above, despite advances in dictionary learning algorithms [21, 22]. [sent-129, score-0.311]
57 In the kernel case, whereas imposing the pre-image constraint has the advantages we mentioned, it also makes dictionary learning a harder nonlinear optimization problem, due to the need for evaluation of kernel derivatives. [sent-130, score-0.664]
58 In the linear case, the computational savings from applying (linear) PCA to the dictionary instead of the training samples are usually negligible, and therefore the difference in required computation becomes even more severe. [sent-131, score-0.532]
59 From left to right: CMU PIE (varying value of M ); CMU PIE (varying number of training samples); brodatz texture patches; Caltech-101. [sent-133, score-0.23]
60 ) 4 Experimental validation In order to evaluate our proposed method, we compare it with other unsupervised dimensionality reduction methods on visualization, clustering, and classification tasks. [sent-135, score-0.22]
61 We use facial images in the linear case, and texture patches and images of object categories in the kernel case. [sent-136, score-0.687]
62 2 We visualize the dataset by projecting all face samples to M = 2 dimensions using LPP and the proposed method, as shown in Figure 1. [sent-138, score-0.167]
63 We separately show the projections of samples from two distinct indviduals, and see that different identities are mapped to parallely shifted ellipsoids, easily separated by a nearestneighbor classifier. [sent-141, score-0.344]
64 We compare against three baseline methods, PCA, NPE, and LPP, linear extensions (spectral regression “SRLPP” [7], spatially smooth LPP “SmoothLPP” [8]), and random projections (see Section 5). [sent-145, score-0.265]
65 We produce 20 random splits into training and testing sets, learn a dictionary and projection matrices from the training set, and use the obtained low-dimensional representations with a k-nearest neighbor classifier (k = 4) to classify the test samples, as is common in the literature. [sent-146, score-0.592]
66 In Figure 2, we show the average recognition accuracy for the various methods as the number of projections is varied, when using 100 training samples for each of the 68 individuals in the dataset. [sent-147, score-0.382]
67 Also, we compare the proposed method with the best performing alternative, when the number of training samples per individual is varied from 40 to 120. [sent-148, score-0.169]
68 However, it can only be used when there are enough training samples to learn a dictionary, a limitation that does not apply to the other methods. [sent-150, score-0.169]
69 We extract 12 × 12 patches and use those from the training images to learn dictionaries and projections for the Gaussian kernel. [sent-153, score-0.499]
70 For 20000 training samples, KPCA and KLPP require storing and processing a 20000×20000 kernel matrix, as opposed to 512 × 512 for our method. [sent-161, score-0.22]
71 On the other hand, training a dictionary with K = 512 for this dataset takes approximately 2 hours, on an 8 core machine and using a C++ implementation of the learning algorithm, as opposed to the few minutes required for the eigendecompositions in KPCA and KLPP. [sent-162, score-0.387]
72 τ 3 Following [36], we set the kernel parameter γ = 8, and use their method for dictionary learning with K = 512 and λ = 0. [sent-166, score-0.455]
73 30, but with a conjugate gradient optimizer for the dictionary update step. [sent-167, score-0.311]
74 Firstly, we use 30 training samples from each class to learn a dictionary4 and projections using KPCA, KLPP, and the proposed method. [sent-180, score-0.382]
75 We also perform unsupervised clustering experiments, where we randomly select 30 samples from each of the 20 classes used in [43] to learn projections with the three methods, over a range of values for M between 10 and 150. [sent-184, score-0.43]
76 We combine each with three clustering algorithms, k-means, spectral clustering [44], and affinity propagation [43] (using negative Euclidean distances of the low-dimensional representations as similarities). [sent-185, score-0.297]
77 5 Discussion and future directions As we remarked in Section 3, the proposed method uses available training samples to learn D and ignores them afterwards, relying exclusively on the assumed generative model and the correlation information in D. [sent-188, score-0.214]
78 To see how this approach could fail, consider the degenerate case when D is the identity matrix, that is the signal and sparse domains coincide. [sent-189, score-0.295]
79 Better use of the training samples within our framework can be made by adopting a richer probabilistic model, using available data to train it, naturally with appropriate regularization to avoid overfitting, and then minimizing (4) for the learned model. [sent-191, score-0.169]
80 An orthogonal approach is to forgo adopting a generative model, and learn a projection matrix directly from training samples using an appropriate empirical loss function. [sent-195, score-0.324]
81 One possibility is minimizing AT A−X T LT LX 2 , F where the columns of X and A are the training samples and corresponding sparse code estimates, which is an instance of multidimensional scaling [46] (as modified to achieve linear induction). [sent-196, score-0.409]
82 For the sparse linear model case, objective function (4) is related to the Restricted Isometry Property (RIP) [47], used in the compressed sensing literature as a condition enabling reconstruction of a sparse vector a ∈ RK from linear measurements y ∈ RM when M ≪ K. [sent-197, score-0.635]
83 Despite this property, our experiments demonstrate that a k learned matrix L is in practice more useful than random projections (see left of Figure 2). [sent-200, score-0.253]
84 The formal guarantees that preservation of Euclidean geometry of sparse codes is possible with few linear projections are unique for the sparse linear model, thus further justifying our choice to emphasize this model throughout the paper. [sent-201, score-0.923]
85 ˜ Another quantity used in compressed sensing is the mutual coherence of D [49], and its approximate minimization has been proposed as a way for learning L for signal reconstruction [50, 51]. [sent-202, score-0.222]
86 1 from a range of values, to achieve about 10% non-zero coefficients in the sparse codes and small reconstruction error for the training samples. [sent-207, score-0.335]
87 In Section 3, we motivated preserving inner products in the sparse domain by considering existing algorithms that employ sparse codes. [sent-213, score-0.653]
88 As our understanding of sparse coding continues to improve [52], there is motivation for considering other structure in RK . [sent-214, score-0.242]
89 Possibilities include preservation of linear subspace (as determined by the support of the sparse codes) or local group relations in the sparse domain. [sent-215, score-0.634]
90 Linear dimensionality reduction has traditionally been used for data preprocessing and visualization, but we are also beginning to see its utility for low-power sensors. [sent-217, score-0.168]
91 A sensor can be designed to record linear projections of an input signal, instead of the signal itself, with projections implemented through a low-power physical process like optical filtering. [sent-218, score-0.545]
92 An example for visual sensing is described in [2], where a heuristically-modified version of our linear approach is employed to select projections for face detection. [sent-220, score-0.417]
93 Rigorously extending our analysis to this domain will require accounting for noise and constraints on the projections (for example non-negativity, limited resolution) induced by fabrication processes. [sent-221, score-0.275]
94 Image denoising via sparse and redundant representations over learned dictionaries. [sent-322, score-0.293]
95 Blind motion deblurring from a single image using sparse approximation. [sent-330, score-0.265]
96 Classification and clustering via dictionary learning with structured incoherence and shared features. [sent-353, score-0.383]
97 Bayesian inference and optimal design for the sparse linear model. [sent-364, score-0.24]
98 From sparse solutions of systems of equations to sparse modeling of signals and images. [sent-408, score-0.534]
99 A kernel view of the dimensionality reduction of manifolds. [sent-423, score-0.312]
100 Learning to sense sparse signals: Simultaneous sensing matrix and sparsifying dictionary optimization. [sent-559, score-0.617]
wordName wordTfidf (topN-words)
[('dictionary', 0.311), ('dd', 0.298), ('projections', 0.213), ('sparse', 0.188), ('klpp', 0.175), ('kpca', 0.175), ('signals', 0.158), ('lpp', 0.146), ('pie', 0.146), ('kernel', 0.144), ('atoms', 0.12), ('preservation', 0.117), ('facial', 0.112), ('rn', 0.106), ('cmu', 0.105), ('cai', 0.099), ('pca', 0.097), ('texture', 0.096), ('samples', 0.093), ('da', 0.093), ('sd', 0.09), ('eigenpairs', 0.088), ('zickler', 0.088), ('di', 0.085), ('reduction', 0.084), ('dimensionality', 0.084), ('lm', 0.083), ('preserving', 0.082), ('patches', 0.082), ('rk', 0.081), ('sensing', 0.078), ('images', 0.078), ('deblurring', 0.077), ('compressed', 0.077), ('training', 0.076), ('illumination', 0.074), ('face', 0.074), ('clustering', 0.072), ('codes', 0.071), ('projection', 0.07), ('signal', 0.067), ('inner', 0.067), ('representer', 0.066), ('products', 0.066), ('nonlinear', 0.065), ('domain', 0.062), ('representations', 0.059), ('pose', 0.059), ('brodatz', 0.058), ('gkioulekas', 0.058), ('npe', 0.058), ('operator', 0.058), ('rip', 0.058), ('iccv', 0.056), ('sapiro', 0.056), ('spectral', 0.054), ('euclidean', 0.054), ('coding', 0.054), ('ai', 0.053), ('rm', 0.053), ('noiseless', 0.053), ('diag', 0.053), ('ponce', 0.052), ('unsupervised', 0.052), ('linear', 0.052), ('seas', 0.051), ('cvpr', 0.05), ('classi', 0.05), ('dictionaries', 0.05), ('hilbert', 0.05), ('spaces', 0.049), ('ij', 0.049), ('subspace', 0.047), ('denoising', 0.046), ('object', 0.045), ('ip', 0.045), ('qk', 0.045), ('lt', 0.045), ('generative', 0.045), ('bach', 0.045), ('dimensional', 0.045), ('lx', 0.044), ('probabilistically', 0.044), ('mairal', 0.043), ('geometry', 0.042), ('army', 0.042), ('reproducing', 0.042), ('faces', 0.042), ('relations', 0.042), ('mentioned', 0.041), ('scholkopf', 0.04), ('harvard', 0.04), ('distances', 0.04), ('matrix', 0.04), ('pairwise', 0.04), ('laplace', 0.04), ('identity', 0.04), ('han', 0.039), ('mapped', 0.038), ('reveal', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
Author: Ioannis A. Gkioulekas, Todd Zickler
Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1
2 0.2342646 259 nips-2011-Sparse Estimation with Structured Dictionaries
Author: David P. Wipf
Abstract: In the vast majority of recent work on sparse estimation algorithms, performance has been evaluated using ideal or quasi-ideal dictionaries (e.g., random Gaussian or Fourier) characterized by unit ℓ2 norm, incoherent columns or features. But in reality, these types of dictionaries represent only a subset of the dictionaries that are actually used in practice (largely restricted to idealized compressive sensing applications). In contrast, herein sparse estimation is considered in the context of structured dictionaries possibly exhibiting high coherence between arbitrary groups of columns and/or rows. Sparse penalized regression models are analyzed with the purpose of finding, to the extent possible, regimes of dictionary invariant performance. In particular, a Type II Bayesian estimator with a dictionarydependent sparsity penalty is shown to have a number of desirable invariance properties leading to provable advantages over more conventional penalties such as the ℓ1 norm, especially in areas where existing theoretical recovery guarantees no longer hold. This can translate into improved performance in applications such as model selection with correlated features, source localization, and compressive sensing with constrained measurement directions. 1
3 0.214562 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge
Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. 1
4 0.15307686 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classification problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports). 1
5 0.14521585 261 nips-2011-Sparse Filtering
Author: Jiquan Ngiam, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, Andrew Y. Ng
Abstract: Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classification. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse filtering, a simple new algorithm which is efficient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function – the sparsity of 2 -normalized features – which can easily be implemented in a few lines of MATLAB code. Sparse filtering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse filtering on natural images, object classification (STL-10), and phone classification (TIMIT), and show that our method works well on a range of different modalities. 1
6 0.14227284 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
7 0.14192453 276 nips-2011-Structured sparse coding via lateral inhibition
8 0.1294522 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
9 0.12306175 54 nips-2011-Co-regularized Multi-view Spectral Clustering
10 0.11421989 258 nips-2011-Sparse Bayesian Multi-Task Learning
11 0.11311071 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
12 0.11184762 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data
13 0.11116455 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
14 0.11040796 285 nips-2011-The Kernel Beta Process
15 0.10747492 260 nips-2011-Sparse Features for PCA-Like Linear Regression
16 0.10652998 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
17 0.10108829 186 nips-2011-Noise Thresholds for Spectral Clustering
18 0.095613286 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
19 0.095561124 165 nips-2011-Matrix Completion for Multi-label Image Classification
20 0.091142818 264 nips-2011-Sparse Recovery with Brownian Sensing
topicId topicWeight
[(0, 0.308), (1, 0.123), (2, -0.087), (3, -0.09), (4, -0.091), (5, 0.124), (6, 0.132), (7, 0.164), (8, 0.089), (9, 0.001), (10, -0.03), (11, 0.109), (12, 0.065), (13, 0.044), (14, -0.045), (15, -0.104), (16, -0.031), (17, -0.006), (18, 0.017), (19, 0.0), (20, 0.143), (21, 0.054), (22, -0.041), (23, -0.068), (24, 0.016), (25, -0.074), (26, 0.008), (27, 0.089), (28, -0.142), (29, 0.062), (30, -0.122), (31, -0.017), (32, -0.012), (33, -0.077), (34, -0.06), (35, -0.067), (36, 0.013), (37, -0.051), (38, 0.009), (39, 0.138), (40, 0.076), (41, -0.009), (42, 0.015), (43, -0.048), (44, -0.099), (45, -0.002), (46, -0.003), (47, -0.031), (48, -0.048), (49, 0.125)]
simIndex simValue paperId paperTitle
same-paper 1 0.9542706 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
Author: Ioannis A. Gkioulekas, Todd Zickler
Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1
2 0.80240923 259 nips-2011-Sparse Estimation with Structured Dictionaries
Author: David P. Wipf
Abstract: In the vast majority of recent work on sparse estimation algorithms, performance has been evaluated using ideal or quasi-ideal dictionaries (e.g., random Gaussian or Fourier) characterized by unit ℓ2 norm, incoherent columns or features. But in reality, these types of dictionaries represent only a subset of the dictionaries that are actually used in practice (largely restricted to idealized compressive sensing applications). In contrast, herein sparse estimation is considered in the context of structured dictionaries possibly exhibiting high coherence between arbitrary groups of columns and/or rows. Sparse penalized regression models are analyzed with the purpose of finding, to the extent possible, regimes of dictionary invariant performance. In particular, a Type II Bayesian estimator with a dictionarydependent sparsity penalty is shown to have a number of desirable invariance properties leading to provable advantages over more conventional penalties such as the ℓ1 norm, especially in areas where existing theoretical recovery guarantees no longer hold. This can translate into improved performance in applications such as model selection with correlated features, source localization, and compressive sensing with constrained measurement directions. 1
3 0.75795907 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge
Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. 1
4 0.75476801 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
Author: Nobuyuki Morioka, Shin'ichi Satoh
Abstract: Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, 1 regularized sparse coding is combined with the spatial pyramid representation to obtain state-of-the-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents “Generalized Lasso based Approximation of Sparse coding” (GLAS). By representing the distribution of sparse coefficients with slice transform, we fit a piece-wise linear mapping function with the generalized lasso. We also propose an efficient post-refinement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that GLAS obtains a comparable performance to 1 regularized sparse coding, yet achieves a significant speed up demonstrating its effectiveness for large-scale visual recognition problems. 1
5 0.72371095 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classification problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports). 1
6 0.6209535 276 nips-2011-Structured sparse coding via lateral inhibition
7 0.61642194 143 nips-2011-Learning Anchor Planes for Classification
8 0.59956568 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
9 0.59485561 285 nips-2011-The Kernel Beta Process
10 0.54886943 261 nips-2011-Sparse Filtering
11 0.54242373 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
12 0.53077751 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
14 0.52281302 209 nips-2011-Orthogonal Matching Pursuit with Replacement
15 0.51031452 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
16 0.5005427 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
17 0.49028867 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
18 0.48636332 225 nips-2011-Probabilistic amplitude and frequency demodulation
19 0.48628631 260 nips-2011-Sparse Features for PCA-Like Linear Regression
20 0.47375199 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
topicId topicWeight
[(0, 0.051), (4, 0.05), (20, 0.053), (26, 0.03), (31, 0.093), (33, 0.031), (43, 0.083), (45, 0.142), (57, 0.039), (65, 0.036), (73, 0.137), (74, 0.078), (83, 0.037), (84, 0.022), (99, 0.055)]
simIndex simValue paperId paperTitle
1 0.92025346 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1
same-paper 2 0.87548977 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
Author: Ioannis A. Gkioulekas, Todd Zickler
Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1
3 0.8436839 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
4 0.833929 276 nips-2011-Structured sparse coding via lateral inhibition
Author: Arthur D. Szlam, Karol Gregor, Yann L. Cun
Abstract: This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modifications of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efficient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-specified. 1
5 0.83292866 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
6 0.83194685 244 nips-2011-Selecting Receptive Fields in Deep Networks
7 0.82929993 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
8 0.82854426 265 nips-2011-Sparse recovery by thresholded non-negative least squares
9 0.82768291 263 nips-2011-Sparse Manifold Clustering and Embedding
10 0.82516593 156 nips-2011-Learning to Learn with Compound HD Models
11 0.82455957 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
12 0.82217276 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
13 0.82069129 180 nips-2011-Multiple Instance Filtering
14 0.82031792 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
15 0.82001561 231 nips-2011-Randomized Algorithms for Comparison-based Search
16 0.81861287 199 nips-2011-On fast approximate submodular minimization
17 0.81854165 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
18 0.81767237 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
19 0.81765246 25 nips-2011-Adaptive Hedge
20 0.8168869 227 nips-2011-Pylon Model for Semantic Segmentation