nips nips2008 nips2008-62 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. [sent-7, score-0.647]
2 We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. [sent-8, score-0.632]
3 We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. [sent-11, score-0.082]
4 1 Introduction Sparse approximation is a key technique developed in engineering and the sciences which approximates an input signal, X, in terms of a “sparse” combination of fixed bases B. [sent-12, score-0.23]
5 In this notation, each input signal forms a column of an input matrix X, and is generated by multiplying a set of basis vectors B, and a column from a coefficient matrix W , while f (z) is an optional transfer function. [sent-14, score-0.395]
6 This relationship is only approximate, as the input data is assumed to be corrupted by random noise. [sent-15, score-0.112]
7 Sparse coding [2] is closely connected to Independent Component Analysis as well as to certain approaches to matrix factorization. [sent-17, score-0.304]
8 It extends sparse approximation by learning a basis matrix B which represents well a collection of related input signals–the input matrix X–in addition to perˆ forming optimization to compute the best set of weights W . [sent-18, score-0.552]
9 Unfortunately, existing sparse coding algorithms that leverage an efficient, convex sparse approximation step to perform inference on the latent weight vector [4] are difficult to integrate into a larger learning architecture. [sent-19, score-0.921]
10 It has been convincingly demonstrated that back-propagation is a crucial tool for tuning an existing generative model’s output in order to improve supervised performance on a discriminative task. [sent-20, score-0.111]
11 Unfortunately, existing sparse coding architectures proˆ duce a latent representation W that is an unstable, discontinuous function of the inputs and bases; an arbitrarily small change in input can lead to the selection of a completely different set of latent weights. [sent-22, score-0.783]
12 We present an advantageous new approach to coding that uses smoother priors which preserve the sparsity benefits of L1 -regularization while allowing efficient convex inference and producing stable ˆ latent representations W . [sent-23, score-0.536]
13 In particular we examine a prior based on minimizing KL-divergence to 1 the uniform distribution which has long been used for approximation problems [6, 7]. [sent-24, score-0.113]
14 We show this increased stability leads to better semi-supervised classification performance across a wide variety ˆ of applications for classifiers using the latent representation W as input. [sent-25, score-0.201]
15 Additionally, because of the smoothness of the KL-divergence prior, B can be optimized discriminatively for a particular application by gradient descent, leading to outstanding empirical performance. [sent-26, score-0.091]
16 Xj is the jth column of X, X i is the i ith row of X, and Xj is the element in the ith row and jth column. [sent-29, score-0.098]
17 3 Generative Model Sparse coding fits a generative model (1) to unlabeled data, and the MAP estimates of the latent variables of this model have been shown to be useful as input for prediction problems [4]. [sent-32, score-0.708]
18 (1) divides the latent variables into two independent groups, the coefficients W and the basis B, which combine to form the matrix of input examples X. [sent-33, score-0.296]
19 The Maximum A Posteriori (MAP) approximation replaces the integration over W and B in (1) with the maximum value of P (X|W, B)P (W )P (B), and the ˆ ˆ values of the latent variables at the maximum, W and B, are the MAP estimates. [sent-35, score-0.133]
20 ˆ ˆ ˆ Finding W given B is an approximation problem, solving for W and B simultaneously over a set of independent examples is a coding problem. [sent-36, score-0.388]
21 P (X) = P (X|W, B)P (W )P (B)dW dB = B W P (Xi |Wi , B)P (Wi )dW dB (1) P (B) B W i Given B, the negative log of the generative model can be optimized independently for each example, and it is denoted for a generic example x by L in (2). [sent-37, score-0.106]
22 L decomposes into the sum of two terms, a loss function DL (x f (Bw)) between an input example and the reconstruction produced by the transfer function f , and a regularization function DP (w p) that measures a distance between the coefficients for the example w and a parameter vector p. [sent-38, score-0.501]
23 A regularization constant λ controls the relative weight of these two terms. [sent-39, score-0.2]
24 Every exponential family distribution defines a Bregman divergence which serves as a matching loss function for estimating the parameters of the distribution1 . [sent-42, score-0.171]
25 One common choice for the loss/transfer functions is the squared loss function with its matching linear transfer function, DL (x f (Bw)) = i (xi − B i w)2 , which is the matching Bregman Divergence for x drawn from a multidimensional gaussian distribution. [sent-43, score-0.295]
26 The regularization function DP (w p) is also often a Bregman divergence, but may be chosen for other features such as the sparsity of the resulting MAP estimate w. [sent-44, score-0.258]
27 A vector is commonly called ˆ sparse if many elements are exactly zero. [sent-45, score-0.281]
28 That matching transfer function is the gradient φ of the function φ which is associated with the Bregman divergence Dφ (x y) = φ(x) − φ(y) − x − y, φ(y) . [sent-47, score-0.357]
29 2 containing interesting structure from unlabeled data. [sent-49, score-0.098]
30 However, of these only L1 leads to an efficient, convex procedure for inference, and even this prior does not produce differentiable MAP estimates. [sent-50, score-0.133]
31 We argue that if the latent weight vector w is to be used as input to a classifier, a better definition of ˆ “sparsity” is that most elements in w can be replaced by elements in a constant vector p without sigˆ nificantly increasing the loss. [sent-51, score-0.234]
32 One regularization function that produces this form of pseudo-sparsity is the KL-divergence KL(w p). [sent-52, score-0.25]
33 L1 regularization provides sparse solutions because its Fenchel dual [13] is the max function, meaning only the most useful basis vectors participate in the reconstruction. [sent-54, score-0.59]
34 A differentiable approximation to maxi xi is a sum of exponentials, i ex , whose dual is the KL-divergence (4). [sent-55, score-0.124]
35 Regularization i with KL has proven useful in online learning, where it is the implicit prior of the exponentiated gradient descent (EGD) algorithm. [sent-56, score-0.395]
36 The form of KL we use (4) is the full Bregman divergence of the negative entropy function3 . [sent-58, score-0.169]
37 For sparse coding however, it is inconvenient to assume that w 1 = p 1 = 1, so we use the full unnormalized KL instead. [sent-60, score-0.546]
38 ˆ DP (w p) = wi log i wi − wi + pi pi (4) For the prior vector p we use a uniform vector whose L1 magnitude equals the expected L1 magnitude of w. [sent-61, score-0.337]
39 Changing p affects the magnitude of the KL term, so λ in (2) must be adjusted to balance the loss term in the sparse coding objective function (small values of p require small values of λ). [sent-64, score-0.636]
40 Below we provide a) an efficient procedure for inferring w in this model; b) an algorithm for iteraˆ tively updating the bases B, and c) show that this model leads to differentiable estimates of w. [sent-65, score-0.135]
41 4 Implementation To compute w with KL-regularization, we minimize (3) using exponentiated gradient descent (EGD) ˆ with backtracking until convergence (5). [sent-67, score-0.267]
42 The gradient of the objective function (2) with respect to the coefficient for the jth basis vector wj is given in (6) for matching loss/transfer function pairs. [sent-69, score-0.395]
43 Because L1 -regularization produces both positive and negative weights, to compare L1 and KL regularization on the same basis we expand the basis used for KL by adding the negation of each basis vector, which is equivalent to allowing negative weights (see Appendix B). [sent-72, score-0.682]
44 During sparse coding the basis matrix B is updated by Stochastic Gradient Descent (SGD), giving ∂L the update rule Bt+1 = Bt − η ∂B i . [sent-73, score-0.654]
45 This update equation does not depend on the prior chosen j 3 −H(x) = x log(x) In our experiments, if the ratio of backtracking steps to total steps was more than 0. [sent-74, score-0.126]
46 SGD implements an implicit L2 regularizer and is suitable for online learning, however because the magnitude of w is explicitly penalized, the columns of B were constrained to have unit L2 norm to prevent the trivial solution of infinitely large B and infinitely small w. [sent-79, score-0.14]
47 ∂L = wj (f (B i w) − xi ) i ∂Bj 5 (7) Modifying a Generative Model For A Discriminative Task Sparse Coding builds a generative model from unlabeled data that captures structure in that data by learning a basis B. [sent-82, score-0.368]
48 Our hope is that the MAP estimate of basis coefficients w produced for each ˆ input vector x will be useful for predicting a response y associated with x. [sent-83, score-0.27]
49 However, the sparse coding objective function only cares about reconstructing the input well, and does not attempt to make w useful as input for any particular task. [sent-84, score-0.736]
50 Fortunately, since priors such as KL-divergence ˆ regularization produce solutions that are smooth with respect to small changes in B and x, B can be modified through back-propagation to make w more useful for prediction. [sent-85, score-0.298]
51 ˆ λ (8) Then if the regularization function is twice differentiable with respect to w, we can use implicit differentiation on (8) to compute the gradient of w with respect to B, and x [14]. [sent-87, score-0.41]
52 For KL-regularization ˆ ∂w ˆ and the simple case of a linear transfer function with squared loss, ∂B is given in (9), where ei is a unit vector whose ith element is 1. [sent-88, score-0.137]
53 Note that the ability to compute ∂ w means that multiple ∂x layers of sparse coding could be used. [sent-90, score-0.546]
54 In each application, the w produced using KL-regularization were more useful for prediction than those produced ˆ with L1 regularization due to the stability and differentiability provided by KL. [sent-92, score-0.559]
55 1 Sparsity KL-regularization retained the desirable pseudo-sparsity characteristics of L1 , namely that each example, x, produces only a few large elements in w. [sent-94, score-0.089]
56 2 Stability Because the gradient of the KL-divergence regularization function goes to ∞ with increasing w, it produces MAP estimates w that change smoothly with x and B (see Appendix A for more details). [sent-97, score-0.341]
57 ˆ 4 Figure 1: Left: Mean coefficient distribution over the 10,000 digit MNIST test set for various regularization functions. [sent-98, score-0.321]
58 080 Table 1: The 10,000 images of handwritten digits in the MNIST test set were used to show the stability benefits of KL-regularization. [sent-123, score-0.377]
59 KL-regularization provides representations that are significantly more ˆ stable with respect to both uncorrelated additive Gaussian noise (Left), and correlated noise from translating the digit image in a random direction (Right). [sent-125, score-0.121]
60 Table 1 quantifies how KL regularization significantly reduces the effect on w of adding noise to the ˆ input x. [sent-126, score-0.317]
61 This stability improves the usefulness of w for prediction. [sent-127, score-0.188]
62 The switch to KL regularization makes these clusters more distinct, and applying back-propagation further separates the clusters. [sent-130, score-0.2]
63 Figure 2: Shown is the distribution of the eight most confusable digit classes in the input space and in the coefficient spaces produced by sparse approximation. [sent-131, score-0.485]
64 L1 regularization (center) discovers structure in the unlabeled data, but still produces more overlap between classes than KL sparse approximation (right) does with the same basis trained with L1 sparse coding. [sent-134, score-0.992]
65 3 Improved Prediction Performance On all applications, the stability provided by KL-regularization improved performance over L1 , and back-propagation further improved performance when the training set had residual error after an output classifier was trained. [sent-137, score-0.158]
66 1 Handwritten Digit Classification We tested our algorithm on the benchmark MNIST handwritten digits dataset [16]. [sent-140, score-0.257]
67 Each example was first reduced to 180D from 768D by PCA, and then sparse coding was performed using a linear transfer function and squared loss5 . [sent-142, score-0.683]
68 The validation set was used to pick the regularization constant, λ, and the prior mean for KL, p. [sent-143, score-0.261]
69 Switching from L1 -regularized to KL-regularized sparse approximation improved performance in all cases (Table 2). [sent-145, score-0.294]
70 30%, which improves on the best results reported7 for other shallowarchitecture permutation-invariant classifiers operating on the same data set without prior knowledge about the problem8 , (see Table 4). [sent-151, score-0.096]
71 65% Table 2: The ability to optimize the generative model with back-propagation leads to significant performance increases when the training set is not separable by the model learned on the unlabeled data. [sent-177, score-0.209]
72 Shown is the misclassification rate on the MNIST digit classification task. [sent-178, score-0.121]
73 31% Table 3: The stability afforded by the KL-prior improves the performance of all classifier types over the L1 prior. [sent-196, score-0.155]
74 53% Table 4: Test set error of various classifiers on the MNIST handwritten digits database. [sent-204, score-0.257]
75 Convolutional Neural Networks produce the best results on this problem, but they are not invariant to permutations in the input since they contain a strong prior about how pixels are connected. [sent-210, score-0.136]
76 The handwritten English characters dataset9 they used consists of 16x8 pixel images of lowercase letters. [sent-212, score-0.266]
77 In keeping with their work, we padded and scaled the images to match the 28x28 pixel size of the MNIST data, projected onto the same PCA basis that was used for the MNIST digits, and learned a basis from the MNIST digits by L1 -regularized sparse coding. [sent-213, score-0.564]
78 This basis was then used for sparse approximation of the English characters, along with a linear transfer function and squared loss. [sent-214, score-0.539]
79 In this application as well, Table 5 shows that simply switching to a KL prior from L1 for sparse approximation significantly improves the performance of a maxent classifier. [sent-215, score-0.47]
80 Furthermore, the KL prior allows online improvement of the sparse coding basis as more labeled data for the characterrecognition task becomes available. [sent-216, score-0.76]
81 This improvement increases with the size of the training set, as more information becomes available about the target character recognition task. [sent-217, score-0.142]
82 3 Comparison to sLDA: Movie Review Sentiment Regression KL-regularized sparse coding bears some similarities to the supervised LDA (sLDA) model introduced in [19], and we provide results for the movie review sentiment classification task [20] used in that work. [sent-246, score-0.787]
83 Since the input is a probability distribution, we use ˆ ˆ ¯ Bw a normalized exponential transfer function, f (B, w) = eeBw 1 , to compute the reconstruction of the input. [sent-248, score-0.212]
84 Table 6 shows that the sparse coding generative model we use is competitive with and perhaps slightly better than LDA. [sent-250, score-0.619]
85 534 Algorithm LDA [19] 64D unsupervised KL sparse coding 256D unsupervised KL sparse coding L1 -regularized regression [19] sLDA [19] L2 -regularized regression 256D KL-regularized coding with backprop Table 6: Movie review sentiment prediction task. [sent-259, score-1.765]
86 KL-regularized sparse coding compares favorably with LDA and sLDA. [sent-260, score-0.546]
87 7 Conclusion This paper demonstrates on a diverse set of applications the advantages of using a differentiable, smooth prior for sparse coding. [sent-261, score-0.303]
88 In particular, a KL-divergence regularization function has significant 9 Available at http://ai. [sent-262, score-0.2]
89 edu/˜btaskar/ocr/ Given that the word counts used as input are very sparse to begin with, classifiers whose regret bounds depend on the L2 norm of the gradient of the input (such as L2 -regularized least squares) do quite well, achieving a predictive R2 value on this application of 0. [sent-264, score-0.519]
90 10 7 advantages over other sparse priors such as L1 because it retains the important aspects of sparsity, while adding stability and differentiability to the MAP estimate w. [sent-266, score-0.53]
91 Differentiability in particular ˆ is shown to lead to state-of-the-art performance by allowing the generative model learned from unlabeled data by sparse-coding to be adapted to a supervised loss function. [sent-267, score-0.251]
92 Tropp, “Algorithms for simultaneous sparse approximation: part ii: Convex relaxation,” Signal Process. [sent-273, score-0.242]
93 Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1? [sent-280, score-0.446]
94 Ng, “Self-taught learning: Transfer learning from unlabeled data,” in ICML ’07: Proceedings of the 24th international conference on Machine learning, 2007. [sent-295, score-0.098]
95 Ghosh, “Clustering with bregman divergences,” Journal of Machine Learning Research, vol. [sent-323, score-0.171]
96 Smaragdis, “Sparse overcomplete latent variable decomposition of counts data,” in NIPS, 2007. [sent-331, score-0.151]
97 Warmuth, “Exponentiated gradient versus gradient descent for linear predictors,” Information and Computation, pp. [sent-334, score-0.238]
98 Lippert, “Value regularization and fenchel duality,” The Journal of Machine Learning Research, vol. [sent-342, score-0.246]
99 Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in ICDAR ’03: Proceedings of the Seventh International Conference on Document Analysis and Recognition. [sent-375, score-0.087]
100 Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales,” in Proceedings of the ACL, 2005, pp. [sent-385, score-0.108]
wordName wordTfidf (topN-words)
[('kl', 0.386), ('coding', 0.304), ('sparse', 0.242), ('regularization', 0.2), ('mnist', 0.184), ('bregman', 0.171), ('backprop', 0.16), ('handwritten', 0.151), ('bw', 0.146), ('transfer', 0.137), ('classi', 0.127), ('egd', 0.122), ('digit', 0.121), ('stability', 0.12), ('basis', 0.108), ('sentiment', 0.108), ('digits', 0.106), ('coef', 0.102), ('unlabeled', 0.098), ('movie', 0.095), ('slda', 0.091), ('gradient', 0.091), ('wj', 0.089), ('latent', 0.081), ('maxent', 0.08), ('dl', 0.076), ('input', 0.075), ('generative', 0.073), ('differentiable', 0.072), ('map', 0.072), ('divergence', 0.071), ('character', 0.069), ('sgd', 0.068), ('differentiability', 0.068), ('dp', 0.066), ('entropy', 0.065), ('backtracking', 0.065), ('bases', 0.063), ('appendix', 0.063), ('lowercase', 0.062), ('prior', 0.061), ('lda', 0.061), ('wi', 0.06), ('priors', 0.058), ('english', 0.058), ('matching', 0.058), ('sparsity', 0.058), ('descent', 0.056), ('exponentiated', 0.055), ('geophysics', 0.053), ('characters', 0.053), ('nn', 0.053), ('approximation', 0.052), ('cients', 0.051), ('produces', 0.05), ('pca', 0.049), ('jth', 0.049), ('bradley', 0.049), ('magnitude', 0.048), ('table', 0.047), ('produced', 0.047), ('implicit', 0.047), ('er', 0.047), ('fenchel', 0.046), ('cation', 0.045), ('online', 0.045), ('ers', 0.045), ('document', 0.044), ('convolutional', 0.043), ('adding', 0.042), ('loss', 0.042), ('dw', 0.041), ('approximates', 0.04), ('useful', 0.04), ('lp', 0.039), ('elements', 0.039), ('backpropagation', 0.039), ('superscripts', 0.039), ('training', 0.038), ('lee', 0.038), ('supervised', 0.038), ('subscripts', 0.038), ('corrupted', 0.037), ('prediction', 0.037), ('counts', 0.036), ('preserve', 0.035), ('db', 0.035), ('raina', 0.035), ('recognition', 0.035), ('improves', 0.035), ('bengio', 0.034), ('overcomplete', 0.034), ('negative', 0.033), ('bj', 0.033), ('bt', 0.033), ('robotics', 0.033), ('usefulness', 0.033), ('examples', 0.032), ('regression', 0.032), ('ts', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
2 0.22101159 226 nips-2008-Supervised Dictionary Learning
Author: Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, Francis R. Bach
Abstract: It is now well established that sparse signal models are well suited for restoration tasks and can be effectively learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of purely reconstructive ones. This paper proposes a new step in that direction, with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary and discriminative class models. The linear version of the proposed model admits a simple probabilistic interpretation, while its most general variant admits an interpretation in terms of kernels. An optimization framework for learning all the components of the proposed model is presented, along with experimental results on standard handwritten digit and texture classification tasks. 1
3 0.17075519 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
Author: Kai Yu, Wei Xu, Yihong Gong
Abstract: In this paper we aim to train deep neural networks for rapid visual recognition. The task is highly challenging, largely due to the lack of a meaningful regularizer on the functions realized by the networks. We propose a novel regularization method that takes advantage of kernel methods, where an oracle kernel function represents prior knowledge about the recognition task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate encouraging results on a wide range of recognition tasks, in terms of both accuracy and speed. 1
4 0.16947286 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
Author: Tong Zhang
Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1
5 0.16681021 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
6 0.15593839 214 nips-2008-Sparse Online Learning via Truncated Gradient
7 0.15360081 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
8 0.12567566 118 nips-2008-Learning Transformational Invariants from Natural Movies
9 0.1138758 216 nips-2008-Sparse probabilistic projections
10 0.11264227 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
11 0.11215384 196 nips-2008-Relative Margin Machines
12 0.10590298 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
13 0.10500816 66 nips-2008-Dynamic visual attention: searching for coding length increments
14 0.096437104 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
15 0.096397042 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
16 0.094989449 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
17 0.091457911 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
18 0.090618469 202 nips-2008-Robust Regression and Lasso
19 0.090301998 238 nips-2008-Theory of matching pursuit
20 0.089132927 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
topicId topicWeight
[(0, -0.325), (1, -0.136), (2, -0.053), (3, 0.019), (4, 0.01), (5, 0.079), (6, -0.1), (7, 0.046), (8, -0.154), (9, 0.125), (10, 0.078), (11, 0.003), (12, -0.099), (13, 0.014), (14, -0.08), (15, -0.162), (16, -0.069), (17, -0.063), (18, 0.055), (19, 0.02), (20, -0.075), (21, -0.048), (22, -0.079), (23, -0.073), (24, -0.079), (25, -0.028), (26, -0.027), (27, -0.093), (28, -0.138), (29, -0.063), (30, -0.036), (31, -0.049), (32, 0.06), (33, 0.029), (34, 0.03), (35, 0.039), (36, 0.049), (37, 0.005), (38, 0.066), (39, 0.022), (40, -0.036), (41, 0.077), (42, -0.017), (43, -0.016), (44, -0.116), (45, 0.027), (46, -0.091), (47, 0.049), (48, 0.044), (49, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.97495979 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
2 0.83154625 226 nips-2008-Supervised Dictionary Learning
Author: Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, Francis R. Bach
Abstract: It is now well established that sparse signal models are well suited for restoration tasks and can be effectively learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of purely reconstructive ones. This paper proposes a new step in that direction, with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary and discriminative class models. The linear version of the proposed model admits a simple probabilistic interpretation, while its most general variant admits an interpretation in terms of kernels. An optimization framework for learning all the components of the proposed model is presented, along with experimental results on standard handwritten digit and texture classification tasks. 1
3 0.70794308 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
Author: Kai Yu, Wei Xu, Yihong Gong
Abstract: In this paper we aim to train deep neural networks for rapid visual recognition. The task is highly challenging, largely due to the lack of a meaningful regularizer on the functions realized by the networks. We propose a novel regularization method that takes advantage of kernel methods, where an oracle kernel function represents prior knowledge about the recognition task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate encouraging results on a wide range of recognition tasks, in terms of both accuracy and speed. 1
4 0.66076541 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
Author: Tong Zhang
Abstract: Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1
5 0.63585472 214 nips-2008-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
6 0.62382162 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
7 0.61670709 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
8 0.61255395 194 nips-2008-Regularized Learning with Networks of Features
9 0.60260433 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
10 0.57527953 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
11 0.5687663 185 nips-2008-Privacy-preserving logistic regression
12 0.54713327 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
13 0.54297757 196 nips-2008-Relative Margin Machines
14 0.53993827 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
15 0.52292949 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
16 0.51538724 66 nips-2008-Dynamic visual attention: searching for coding length increments
17 0.51341558 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
18 0.50497377 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
19 0.5020774 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
20 0.49777862 228 nips-2008-Support Vector Machines with a Reject Option
topicId topicWeight
[(6, 0.114), (7, 0.126), (12, 0.051), (14, 0.143), (15, 0.014), (28, 0.156), (57, 0.107), (59, 0.011), (63, 0.048), (71, 0.028), (77, 0.056), (83, 0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.89330637 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
2 0.84015465 75 nips-2008-Estimating vector fields using sparse basis field expansions
Author: Stefan Haufe, Vadim V. Nikulin, Andreas Ziehe, Klaus-Robert Müller, Guido Nolte
Abstract: We introduce a novel framework for estimating vector fields using sparse basis field expansions (S-FLEX). The notion of basis fields, which are an extension of scalar basis functions, arises naturally in our framework from a rotational invariance requirement. We consider a regression setting as well as inverse problems. All variants discussed lead to second-order cone programming formulations. While our framework is generally applicable to any type of vector field, we focus in this paper on applying it to solving the EEG/MEG inverse problem. It is shown that significantly more precise and neurophysiologically more plausible location and shape estimates of cerebral current sources from EEG/MEG measurements become possible with our method when comparing to the state-of-the-art. 1
3 0.82942879 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
4 0.82897687 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
5 0.82405436 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
Author: Mehmet K. Muezzinoglu, Alexander Vergara, Ramon Huerta, Thomas Nowotny, Nikolai Rulkov, Henry Abarbanel, Allen Selverston, Mikhail Rabinovich
Abstract: The odor transduction process has a large time constant and is susceptible to various types of noise. Therefore, the olfactory code at the sensor/receptor level is in general a slow and highly variable indicator of the input odor in both natural and artificial situations. Insects overcome this problem by using a neuronal device in their Antennal Lobe (AL), which transforms the identity code of olfactory receptors to a spatio-temporal code. This transformation improves the decision of the Mushroom Bodies (MBs), the subsequent classifier, in both speed and accuracy. Here we propose a rate model based on two intrinsic mechanisms in the insect AL, namely integration and inhibition. Then we present a MB classifier model that resembles the sparse and random structure of insect MB. A local Hebbian learning procedure governs the plasticity in the model. These formulations not only help to understand the signal conditioning and classification methods of insect olfactory systems, but also can be leveraged in synthetic problems. Among them, we consider here the discrimination of odor mixtures from pure odors. We show on a set of records from metal-oxide gas sensors that the cascade of these two new models facilitates fast and accurate discrimination of even highly imbalanced mixtures from pure odors. 1
6 0.82188064 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
7 0.82113451 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
8 0.82052743 143 nips-2008-Multi-label Multiple Kernel Learning
9 0.81996787 226 nips-2008-Supervised Dictionary Learning
10 0.81974077 202 nips-2008-Robust Regression and Lasso
11 0.8196696 200 nips-2008-Robust Kernel Principal Component Analysis
12 0.81826961 66 nips-2008-Dynamic visual attention: searching for coding length increments
13 0.81812239 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
14 0.81809509 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
15 0.8179161 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
16 0.81759751 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
17 0.81514895 137 nips-2008-Modeling Short-term Noise Dependence of Spike Counts in Macaque Prefrontal Cortex
18 0.81190068 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
19 0.8103081 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
20 0.8098191 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification