nips nips2011 nips2011-261 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jiquan Ngiam, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, Andrew Y. Ng
Abstract: Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classification. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse filtering, a simple new algorithm which is efficient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function – the sparsity of 2 -normalized features – which can easily be implemented in a few lines of MATLAB code. Sparse filtering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse filtering on natural images, object classification (STL-10), and phone classification (TIMIT), and show that our method works well on a range of different modalities. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classification. [sent-4, score-0.288]
2 In this work, we present sparse filtering, a simple new algorithm which is efficient and only has one hyperparameter, the number of features to learn. [sent-6, score-0.503]
3 In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution. [sent-7, score-0.455]
4 Instead, it optimizes a simple cost function – the sparsity of 2 -normalized features – which can easily be implemented in a few lines of MATLAB code. [sent-8, score-0.453]
5 Sparse filtering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. [sent-9, score-0.359]
6 We evaluate sparse filtering on natural images, object classification (STL-10), and phone classification (TIMIT), and show that our method works well on a range of different modalities. [sent-10, score-0.481]
7 1 Introduction Unsupervised feature learning has recently emerged as a viable alternative to manually designing feature representations. [sent-11, score-0.362]
8 In many audio [1, 2], image [3, 4], and video [5] tasks, learned features have matched or outperformed features specifically designed for such tasks. [sent-12, score-0.627]
9 For example, the sparse RBM [6, 7] has up to half a dozen hyperparameters and an intractable objective function, making it hard to tune and monitor convergence. [sent-14, score-0.38]
10 In this work, we present sparse filtering, a new feature learning algorithm which is easy to implement and essentially hyperparameter-free. [sent-15, score-0.455]
11 Sparse filtering works by optimizing exclusively for sparsity in the feature distribution. [sent-18, score-0.408]
12 Moreover, the hyperparameter-free approach means that sparse filtering works well on a range of data modalities without the need for specific tuning on each modality. [sent-21, score-0.344]
13 This allows us to easily learn feature representations that are well-suited for a variety of tasks, including object classification and phone classification. [sent-22, score-0.508]
14 Comparison of tunable hyperparameters in various feature learning algorithms. [sent-24, score-0.322]
15 These feature learning approaches have been successfully used to learn good feature representations for a wide variety of tasks [1, 2, 3, 4, 5]. [sent-26, score-0.482]
16 However, they are also often challenging to implement, requiring the tuning of various hyperparameters; see Table 1 for a comparison of tunable hyperparameters in several popular feature learning algorithms. [sent-27, score-0.353]
17 Though ICA has only one tunable hyperparameter, it scales poorly to large sets of features or large inputs. [sent-29, score-0.312]
18 To this end, we only focus on a few key properties of our features – population sparsity, lifetime sparsity, and high dispersal – without explicitly modeling the data distribution. [sent-31, score-0.895]
19 3 Feature distributions The feature learning methods discussed in the previous section can all be viewed as generating particular feature distributions. [sent-34, score-0.362]
20 For instance, sparse coding represents each example using a few non-zero coefficients (features). [sent-35, score-0.464]
21 A feature distribution oriented approach can provide insights into designing new algorithms based on optimizing for desirable properties of the feature distribution. [sent-36, score-0.412]
22 For clarity, let us consider a feature distribution matrix over a finite dataset, where each row is a (i) feature, each column is an example, and each entry fj is the activity of feature j on example i. [sent-37, score-0.564]
23 We consider the following as desirable properties of the feature distribution: Sparse features per example (Population Sparsity). [sent-39, score-0.41]
24 This notion is known as population sparsity [13, 14] and is considered a principle adopted by the early visual cortex as an efficient means of coding. [sent-43, score-0.363]
25 ICA is unable to learn overcomplete feature representations unless one resorts to extremely expensive approximate orthogonalization algorithms [12]. [sent-45, score-0.426]
26 This property is known as lifetime sparsity [13, 14]. [sent-50, score-0.427]
27 Concretely, we consider the mean squared activations of each feature obtained by averaging the squared values in the feature matrix across the columns (examples). [sent-53, score-0.449]
28 While high dispersal is not strictly necessary for good feature representations, we found that enforcing high dispersal prevents degenerate situations in which the same features are always active [14]. [sent-55, score-1.053]
29 For overcomplete representations, high dispersal translates to having fewer “inactive” features. [sent-56, score-0.315]
30 As an example, principle component analysis (PCA) codes do not generally satisfy high dispersal since the codes that correspond to the largest eigenvalues are almost always active. [sent-57, score-0.349]
31 For instance, [14] showed that population sparsity and lifetime sparsity are not necessarily correlated. [sent-59, score-0.762]
32 For example, the sparse RBM [6] works by constraining the expected activation of a feature (over its lifetime) to be close to a target value. [sent-62, score-0.556]
33 , each basis has unit norm) that normalize each feature, and further optimizes for the lifetime sparsity of the features it learns. [sent-65, score-0.753]
34 Sparse autoencoders [16] also explicitly optimize for lifetime sparsity. [sent-66, score-0.369]
35 On the other hand, clustering-based methods such as k-means [17] can be seen as enforcing an extreme form of population sparsity where each cluster centroid corresponds to a feature and only one feature is allowed to be active per example. [sent-67, score-0.824]
36 Sparse coding [11] is also typically seen as enforcing population sparsity. [sent-69, score-0.42]
37 In this work, we use the feature distribution view to derive a simple feature learning algorithm that solely optimizes for population sparsity while enforcing high dispersal. [sent-70, score-0.816]
38 In our experiments, we found that realizing these two properties was sufficient to allow us to learn overcomplete representations; we also argue later that these two properties are jointly sufficient to ensure lifetime sparsity. [sent-71, score-0.336]
39 4 Sparse filtering In this section, we will show how the sparse filtering objective captures the aforementioned principles. [sent-72, score-0.322]
40 Concretely, let (i) (i) fj represent the j th feature value (rows) for the ith example (columns), where fj = wj T x(i) . [sent-74, score-0.524]
41 Specifically, we first normalize each feature to be equally active by dividing each feature by its 2 norm across all examples: ˜j = fj / fj 2 . [sent-76, score-0.797]
42 The normalized features are optimized f f f for sparseness using the 1 penalty. [sent-78, score-0.333]
43 For a dataset of M examples, this gives us the sparse filtering objective (Eqn. [sent-79, score-0.322]
44 2 (1) 1 Optimizing for population sparsity The term ˆ(i) f ˜(i) f ˜(i) f measures the population sparsity of the features on the ith example. [sent-82, score-0.899]
45 Since the normalized features ˆ(i) are constrained to lie on the unit 2 -ball, this objective is f 1 = 2 1 3 minimized when the features are sparse (Fig. [sent-83, score-0.827]
46 Notice that the sparseness of the features (in the 1 sense) is maximized when the examples are on the axes. [sent-89, score-0.286]
47 One property of normalizing features is that it implicitly introduces competition between features. [sent-93, score-0.291]
48 Since we are minimizing ˆ(i) 1 , the objective encourages f the normalized features, ˆ(i) , to be sparse and mostly close to zero. [sent-97, score-0.369]
49 j fj /F This measure is commonly used to characterize the sparsity of neuron activations in the brain. [sent-101, score-0.36]
50 2 Optimizing for high dispersal Recall that for high dispersal we want every feature to be equally active. [sent-104, score-0.723]
51 Specifically, we want the mean squared activation of each feature to be roughly equal. [sent-105, score-0.283]
52 In our formulation of sparse filtering, we first normalize each feature so that they are equally active by dividing each feature by its norm across the examples: ˜j = fj / fj 2 . [sent-106, score-1.096]
53 This has the same effect as constraining each feature to have f (i) the same expected squared value, Ex(i) ∼D [(fj )2 ] = 1, thus enforcing high dispersal. [sent-107, score-0.308]
54 3 Optimizing for lifetime sparsity We found that optimizing for population sparsity and enforcing high dispersal led to lifetime sparsity in our features. [sent-109, score-1.569]
55 To understand how lifetime sparsity is achieved, first notice that a feature distribution which is population sparse must have many non-active (zero) entries in the feature distribution matrix. [sent-110, score-1.277]
56 Therefore, every feature must have a significant number of zero entries and be lifetime sparse. [sent-112, score-0.46]
57 This implies that optimizing for population sparsity and high dispersal is sufficient to define a good feature distribution. [sent-113, score-0.824]
58 4 Deep sparse filtering Since the sparse filtering objective is agnostic about the method which generates the feature matrix, one is relatively free to choose the feedforward network that computes the features. [sent-115, score-0.777]
59 In this way, sparse filtering presents itself as a natural framework for training deep networks. [sent-119, score-0.391]
60 Training a deep network with sparse filtering can be achieved using the canonical greedy layerwise approach [7, 19]. [sent-120, score-0.38]
61 In particular, after training a single layer of features with sparse filtering, one can compute the normalized features ˆ(i) and then use these as input to sparse filtering for learning f another layer of features. [sent-121, score-1.34]
62 In practice, we find that greedy layer-wise training with sparse filtering learns meaningful representations on the next layer (Sec. [sent-122, score-0.571]
63 5 Experiments (i) In our experiments, we adopted the soft-absolute function fj = T T + (wj x(i) )2 ≈ |wj x(i) | as our activation function, setting = 10−8 , and used an off-the-shelf L-BFGS [20] package to optimize the sparse filtering objective until convergence. [sent-125, score-0.548]
64 1 Timing and scaling up Figure 2: Timing comparisons between sparse coding, ICA, sparse autoencoders and sparse filtering over different input sizes. [sent-127, score-0.941]
65 In this section, we examine the efficiency of the sparse filtering algorithm by comparing it against ICA, sparse coding, and sparse autoencoders. [sent-128, score-0.822]
66 For each image size, we learned a complete set of features (i. [sent-131, score-0.355]
67 We implemented sparse autoencoders as described in Coates et al. [sent-134, score-0.393]
68 However, with 32 × 32 image patches (3072-dimensional inputs), sparse coding, sparse autoencoders and ICA were significantly slower to converge than sparse filtering (Fig. [sent-138, score-1.045]
69 For ICA, each iteration of the algorithm (FastICA [12]) requires orthogonalizing the bases learned; since the cost of orthogonalization is cubic in the number of features, the algorithm can be very slow when the number of features is large. [sent-140, score-0.288]
70 For sparse coding, as the number of features increased, it took significantly longer to solve the 1 -regularized least squares problem for finding the coefficients. [sent-141, score-0.503]
71 5 We obtained an overall speedup of at least 4x over sparse coding and ICA when learning features from 32 × 32 image patches. [sent-142, score-0.756]
72 In contrast to ICA, optimizing the sparse filtering objective does not require the expensive cubic-time whitening step. [sent-143, score-0.408]
73 For the larger input dimensions, sparse coding and sparse autoencoders did not converge in a reasonable time (<3 hours). [sent-144, score-0.857]
74 2 Natural images In this section, we applied sparse filtering to learn features off 200,000 randomly sampled patches (16x16) from natural images [9]. [sent-146, score-0.637]
75 The first layer of features learned by sparse filtering corresponded to Gabor-like edge detectors, similar to those learned by standard sparse feature learning methods [6, 9, 10, 11, 16]. [sent-150, score-1.205]
76 More interestingly, when we learned a second layer of features using greedy layer-wise stacking on the features produced by the first layer, it discovers meaningful features that pool the first layer features (Fig. [sent-151, score-1.288]
77 We highlight that the second layer of features were learned using the same algorithm without any tuning or preprocessing of the data. [sent-153, score-0.444]
78 3 Figure 3: Learned pooling units in a second layer using sparse filtering. [sent-156, score-0.395]
79 We show the most strongly connected first layer units for each second layer unit; each column corresponds to a second layer unit. [sent-157, score-0.363]
80 To obtain features from the large image, we followed the protocol of [17]: features were extracted densely from all locations in each image and later pooled into quadrants. [sent-174, score-0.586]
81 For a fair comparison, the number of features learnt was also set to be consistent with the number of features used by [17]. [sent-179, score-0.458]
82 In accordance with the recommended STL-10 testing protocol [17], we performed supervised training on each of the 10 supervised training folds and reported the mean accuracy on the full test set along with the standard deviation across the 10 training folds (Table 2). [sent-180, score-0.325]
83 Random weight baselines have been shown to perform remarkably well on a variety of tasks [23], and provide a means of distinguishing the effect of our divisive normalization scheme versus the effect of feature learning. [sent-185, score-0.451]
84 Test accuracy for phone classification using features learned from MFCCs. [sent-188, score-0.448]
85 Using sparse filtering, we learned 256 features from contiguous groups of 11 MFCC frames. [sent-205, score-0.566]
86 For comparison, we also learned sets of 256 features in a similar way using sparse coding [11, 30] and ICA [12]. [sent-206, score-0.756]
87 3 To evaluate the relative performances of the different feature sets (MFCC, ICA, sparse coding and sparse filtering), we used a linear SVM, choosing the regularization coefficient C by cross-validation on the development set. [sent-208, score-0.919]
88 We found that the features learned using sparse filtering outperformed MFCC features alone and ICA features; they were also competitive with sparse coding and faster to compute. [sent-209, score-1.259]
89 Using an RBF kernel [31] gave performances competitive with state-of-the-art methods when MFCCs were combined with learned sparse filtering features (Table 3). [sent-210, score-0.566]
90 In contrast, concatenating ICA and sparse coding features with MFCCs resulted in decreased performance when compared to MFCCs alone. [sent-211, score-0.719]
91 Indeed, these pipelines are built on top of feature representations that can be derived from a variety of sources, including sparse filtering. [sent-213, score-0.591]
92 Conversely, sparse filtering uses divisive normalization as an integral component of the feature learning process to introduce competition between features, resulting in population sparse representations. [sent-220, score-1.195]
93 2 Connections to ICA and sparse coding The sparse filtering objective can be viewed as a normalized version of the ICA objective. [sent-222, score-0.833]
94 In sparse filtering, we replace the objective with a normalized sparsity penalty, where the response of filters are divided by the norm of the all the filters ( W x 1 / W x 2 ). [sent-227, score-0.546]
95 Similarly, one can apply the normalization idea to the sparse coding framework. [sent-229, score-0.551]
96 In particular, 1 sparse filtering resembles the 2 sparsity penalty that has been used in non-negative matrix factorization [35]. [sent-230, score-0.514]
97 Thus, instead of the usual 1 penalty that is used in conjunction with sparse coding (i. [sent-231, score-0.527]
98 Unsupervised feature learning for audio classification using convolutional deep belief networks. [sent-253, score-0.296]
99 Linear spatial pyramid matching using sparse coding for image classification. [sent-261, score-0.527]
100 Efficient learning of sparse representations with an energy-based model. [sent-357, score-0.338]
wordName wordTfidf (topN-words)
[('ica', 0.334), ('ltering', 0.309), ('sparse', 0.274), ('dispersal', 0.258), ('lifetime', 0.25), ('features', 0.229), ('coding', 0.19), ('feature', 0.181), ('sparsity', 0.177), ('mfcc', 0.176), ('population', 0.158), ('phone', 0.156), ('fj', 0.152), ('divisive', 0.13), ('layer', 0.121), ('autoencoders', 0.119), ('normalization', 0.087), ('tunable', 0.083), ('mfccs', 0.083), ('lmgmm', 0.077), ('rbf', 0.075), ('activation', 0.074), ('enforcing', 0.072), ('lters', 0.072), ('deep', 0.072), ('ranzato', 0.071), ('svm', 0.071), ('unsupervised', 0.069), ('phonetic', 0.068), ('representations', 0.064), ('image', 0.063), ('penalty', 0.063), ('learned', 0.063), ('competition', 0.062), ('hyperparameter', 0.059), ('orthogonalization', 0.059), ('timit', 0.059), ('hyperparameters', 0.058), ('overcomplete', 0.057), ('sparseness', 0.057), ('classi', 0.056), ('active', 0.055), ('object', 0.051), ('rbms', 0.051), ('normalize', 0.05), ('optimizing', 0.05), ('concretely', 0.048), ('objective', 0.048), ('optimizes', 0.047), ('normalized', 0.047), ('folds', 0.046), ('pipelines', 0.045), ('training', 0.045), ('audio', 0.043), ('patches', 0.041), ('modalities', 0.039), ('wj', 0.039), ('protocol', 0.038), ('koh', 0.037), ('expensive', 0.036), ('schwartz', 0.035), ('coates', 0.035), ('boltzmann', 0.034), ('greedy', 0.034), ('gracefully', 0.034), ('hyv', 0.034), ('meaningful', 0.033), ('rbm', 0.033), ('inputs', 0.032), ('images', 0.032), ('codes', 0.031), ('tuning', 0.031), ('activations', 0.031), ('speech', 0.03), ('cvpr', 0.03), ('supervised', 0.03), ('component', 0.029), ('entries', 0.029), ('learn', 0.029), ('matlab', 0.028), ('recognition', 0.028), ('hmm', 0.028), ('filtering', 0.028), ('squared', 0.028), ('visual', 0.028), ('denoising', 0.027), ('constraining', 0.027), ('notice', 0.027), ('variety', 0.027), ('sensory', 0.027), ('extracted', 0.027), ('timing', 0.026), ('decreased', 0.026), ('coef', 0.026), ('equally', 0.026), ('weight', 0.026), ('activity', 0.025), ('increased', 0.025), ('row', 0.025), ('formulation', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 261 nips-2011-Sparse Filtering
Author: Jiquan Ngiam, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, Andrew Y. Ng
Abstract: Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classification. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse filtering, a simple new algorithm which is efficient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function – the sparsity of 2 -normalized features – which can easily be implemented in a few lines of MATLAB code. Sparse filtering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse filtering on natural images, object classification (STL-10), and phone classification (TIMIT), and show that our method works well on a range of different modalities. 1
2 0.35738248 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning
Author: Quoc V. Le, Alexandre Karpenko, Jiquan Ngiam, Andrew Y. Ng
Abstract: Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it difficult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data. Our formulation reveals formal connections between ICA and sparse autoencoders, which have previously been observed only empirically. Our algorithm can be used in conjunction with off-the-shelf fast unconstrained optimizers. We show that the soft reconstruction cost can also be used to prevent replicated features in tiled convolutional neural networks. Using our method to learn highly overcomplete sparse features and tiled convolutional neural networks, we obtain competitive performances on a wide variety of object recognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and Hollywood2 datasets. 1
3 0.23125519 244 nips-2011-Selecting Receptive Fields in Deep Networks
Author: Adam Coates, Andrew Y. Ng
Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded “local receptive fields” that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Specifically, we choose local receptive fields that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive fields (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive fields by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. 1
4 0.18739389 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classification problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports). 1
5 0.17475 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
Author: Yan Karklin, Eero P. Simoncelli
Abstract: Efficient coding provides a powerful principle for explaining early sensory coding. Most attempts to test this principle have been limited to linear, noiseless models, and when applied to natural images, have yielded oriented filters consistent with responses in primary visual cortex. Here we show that an efficient coding model that incorporates biologically realistic ingredients – input and output noise, nonlinear response functions, and a metabolic cost on the firing rate – predicts receptive fields and response nonlinearities similar to those observed in the retina. Specifically, we develop numerical methods for simultaneously learning the linear filters and response nonlinearities of a population of model neurons, so as to maximize information transmission subject to metabolic costs. When applied to an ensemble of natural images, the method yields filters that are center-surround and nonlinearities that are rectifying. The filters are organized into two populations, with On- and Off-centers, which independently tile the visual space. As observed in the primate retina, the Off-center neurons are more numerous and have filters with smaller spatial extent. In the absence of noise, our method reduces to a generalized version of independent components analysis, with an adapted nonlinear “contrast” function; in this case, the optimal filters are localized and oriented.
6 0.16624776 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
7 0.14521585 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
8 0.13217086 276 nips-2011-Structured sparse coding via lateral inhibition
9 0.12688105 258 nips-2011-Sparse Bayesian Multi-Task Learning
10 0.12360382 298 nips-2011-Unsupervised learning models of primary cortical receptive fields and receptive field plasticity
11 0.11919285 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
12 0.11510332 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
13 0.11050227 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
14 0.11021894 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
15 0.10661249 156 nips-2011-Learning to Learn with Compound HD Models
16 0.10316253 259 nips-2011-Sparse Estimation with Structured Dictionaries
17 0.10206772 180 nips-2011-Multiple Instance Filtering
18 0.10179754 250 nips-2011-Shallow vs. Deep Sum-Product Networks
19 0.10016808 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
20 0.099817954 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
topicId topicWeight
[(0, 0.26), (1, 0.198), (2, 0.017), (3, 0.05), (4, 0.021), (5, 0.157), (6, 0.204), (7, 0.301), (8, -0.024), (9, -0.161), (10, -0.082), (11, -0.12), (12, 0.13), (13, -0.003), (14, -0.006), (15, 0.01), (16, 0.021), (17, -0.003), (18, 0.063), (19, -0.007), (20, 0.011), (21, -0.062), (22, -0.001), (23, -0.04), (24, 0.018), (25, -0.007), (26, -0.018), (27, 0.071), (28, -0.077), (29, -0.06), (30, 0.025), (31, -0.079), (32, 0.026), (33, 0.01), (34, -0.08), (35, -0.056), (36, 0.044), (37, 0.04), (38, 0.05), (39, -0.094), (40, 0.054), (41, 0.083), (42, -0.077), (43, 0.012), (44, 0.166), (45, 0.147), (46, -0.006), (47, -0.011), (48, -0.063), (49, -0.074)]
simIndex simValue paperId paperTitle
1 0.96740133 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning
Author: Quoc V. Le, Alexandre Karpenko, Jiquan Ngiam, Andrew Y. Ng
Abstract: Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it difficult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data. Our formulation reveals formal connections between ICA and sparse autoencoders, which have previously been observed only empirically. Our algorithm can be used in conjunction with off-the-shelf fast unconstrained optimizers. We show that the soft reconstruction cost can also be used to prevent replicated features in tiled convolutional neural networks. Using our method to learn highly overcomplete sparse features and tiled convolutional neural networks, we obtain competitive performances on a wide variety of object recognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and Hollywood2 datasets. 1
same-paper 2 0.95878005 261 nips-2011-Sparse Filtering
Author: Jiquan Ngiam, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, Andrew Y. Ng
Abstract: Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classification. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse filtering, a simple new algorithm which is efficient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function – the sparsity of 2 -normalized features – which can easily be implemented in a few lines of MATLAB code. Sparse filtering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse filtering on natural images, object classification (STL-10), and phone classification (TIMIT), and show that our method works well on a range of different modalities. 1
3 0.77740628 244 nips-2011-Selecting Receptive Fields in Deep Networks
Author: Adam Coates, Andrew Y. Ng
Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded “local receptive fields” that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Specifically, we choose local receptive fields that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive fields (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive fields by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. 1
4 0.73211122 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classification problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports). 1
5 0.68879092 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
Author: Nobuyuki Morioka, Shin'ichi Satoh
Abstract: Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, 1 regularized sparse coding is combined with the spatial pyramid representation to obtain state-of-the-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents “Generalized Lasso based Approximation of Sparse coding” (GLAS). By representing the distribution of sparse coefficients with slice transform, we fit a piece-wise linear mapping function with the generalized lasso. We also propose an efficient post-refinement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that GLAS obtains a comparable performance to 1 regularized sparse coding, yet achieves a significant speed up demonstrating its effectiveness for large-scale visual recognition problems. 1
6 0.65349299 298 nips-2011-Unsupervised learning models of primary cortical receptive fields and receptive field plasticity
7 0.59360647 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
8 0.56514132 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
9 0.54175842 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
10 0.53434634 276 nips-2011-Structured sparse coding via lateral inhibition
11 0.52749074 143 nips-2011-Learning Anchor Planes for Classification
12 0.51747721 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
13 0.51581472 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
14 0.48367807 293 nips-2011-Understanding the Intrinsic Memorability of Images
15 0.47632515 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
16 0.46054983 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
17 0.44086367 275 nips-2011-Structured Learning for Cell Tracking
18 0.43893051 260 nips-2011-Sparse Features for PCA-Like Linear Regression
19 0.43636194 156 nips-2011-Learning to Learn with Compound HD Models
20 0.43488094 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
topicId topicWeight
[(0, 0.028), (4, 0.021), (20, 0.027), (26, 0.019), (31, 0.075), (33, 0.021), (43, 0.057), (45, 0.141), (57, 0.046), (65, 0.346), (74, 0.071), (83, 0.031), (84, 0.016), (99, 0.038)]
simIndex simValue paperId paperTitle
1 0.95277584 298 nips-2011-Unsupervised learning models of primary cortical receptive fields and receptive field plasticity
Author: Maneesh Bhand, Ritvik Mudur, Bipin Suresh, Andrew Saxe, Andrew Y. Ng
Abstract: The efficient coding hypothesis holds that neural receptive fields are adapted to the statistics of the environment, but is agnostic to the timescale of this adaptation, which occurs on both evolutionary and developmental timescales. In this work we focus on that component of adaptation which occurs during an organism’s lifetime, and show that a number of unsupervised feature learning algorithms can account for features of normal receptive field properties across multiple primary sensory cortices. Furthermore, we show that the same algorithms account for altered receptive field properties in response to experimentally altered environmental statistics. Based on these modeling results we propose these models as phenomenological models of receptive field plasticity during an organism’s lifetime. Finally, due to the success of the same models in multiple sensory areas, we suggest that these algorithms may provide a constructive realization of the theory, first proposed by Mountcastle [1], that a qualitatively similar learning algorithm acts throughout primary sensory cortices. 1
2 0.9489516 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
Author: Ke Chen, Ahmad Salman
Abstract: Speech conveys different yet mixed information ranging from linguistic to speaker-specific components, and each of them should be exclusively used in a specific task. However, it is extremely difficult to extract a specific information component given the fact that nearly all existing acoustic representations carry all types of speech information. Thus, the use of the same representation in both speech and speaker recognition hinders a system from producing better performance due to interference of irrelevant information. In this paper, we present a deep neural architecture to extract speaker-specific information from MFCCs. As a result, a multi-objective loss function is proposed for learning speaker-specific characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss. With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-specific representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs and other state-of-the-art techniques in speaker recognition. We discuss relevant issues and relate our approach to previous work. 1
3 0.79187751 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1
4 0.78452975 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning
Author: Quoc V. Le, Alexandre Karpenko, Jiquan Ngiam, Andrew Y. Ng
Abstract: Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it difficult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data. Our formulation reveals formal connections between ICA and sparse autoencoders, which have previously been observed only empirically. Our algorithm can be used in conjunction with off-the-shelf fast unconstrained optimizers. We show that the soft reconstruction cost can also be used to prevent replicated features in tiled convolutional neural networks. Using our method to learn highly overcomplete sparse features and tiled convolutional neural networks, we obtain competitive performances on a wide variety of object recognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and Hollywood2 datasets. 1
same-paper 5 0.7707479 261 nips-2011-Sparse Filtering
Author: Jiquan Ngiam, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, Andrew Y. Ng
Abstract: Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classification. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse filtering, a simple new algorithm which is efficient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function – the sparsity of 2 -normalized features – which can easily be implemented in a few lines of MATLAB code. Sparse filtering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse filtering on natural images, object classification (STL-10), and phone classification (TIMIT), and show that our method works well on a range of different modalities. 1
6 0.76419669 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
7 0.71693462 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
8 0.68803847 244 nips-2011-Selecting Receptive Fields in Deep Networks
9 0.62457538 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
10 0.57389593 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
11 0.5680871 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
12 0.56347066 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
13 0.55856007 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
14 0.55135918 287 nips-2011-The Manifold Tangent Classifier
15 0.55050826 276 nips-2011-Structured sparse coding via lateral inhibition
16 0.54703015 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
17 0.54137772 35 nips-2011-An ideal observer model for identifying the reference frame of objects
18 0.53858846 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
19 0.5381763 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
20 0.53567588 263 nips-2011-Sparse Manifold Clustering and Embedding