nips nips2013 nips2013-251 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Given only a few weight values for each feature it is possible to accurately predict the remaining values. [sent-12, score-0.244]
2 In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. [sent-15, score-0.237]
3 1 Introduction Recent work on scaling deep networks has led to the construction of the largest artificial neural networks to date. [sent-16, score-0.483]
4 It is now possible to train networks with tens of millions [13] or even over a billion parameters [7, 16]. [sent-17, score-0.242]
5 If we can reduce the number of parameters which must be learned and communicated over the network of fixed size, then we can reduce the number of machines required to train it, and hence also reduce the overhead of coordination in a distributed framework. [sent-30, score-0.397]
6 In this work we study techniques for reducing the number of free parameters in neural networks by exploiting the fact that the weights in learned networks tend to be structured. [sent-31, score-0.499]
7 Our technique is also completely orthogonal to the choice of activation function as well as other learning optimizations; it can work alongside other recent advances in neural network training such as dropout [12], rectified units [20] and maxout [9] without modification. [sent-33, score-0.369]
8 1 Figure 1: The first column in each block shows four learned features (parameters of a deep model). [sent-34, score-0.4]
9 From left to right the blocks are: (1) a convnet trained on STL-10 (2) an MLP trained on MNIST, (3) a convnet trained on CIFAR-10, (4) Reconstruction ICA trained on Hyv¨ rinen’s natural image a dataset (5) Reconstruction ICA trained on STL-10. [sent-37, score-0.557]
10 The intuition motivating the techniques in this paper is the well known observation that the first layer features of a neural network trained on natural image patches tend to be globally smooth with local edge features, similar to local Gabor features [6, 13]. [sent-38, score-0.835]
11 Given this structure, representing the value of each pixel in the feature separately is redundant, since it is highly likely that the value of a pixel will be equal to a weighted average of its neighbours. [sent-39, score-0.236]
12 By factoring the weight matrix we are able to directly control the size of the parameterization by controlling the rank of the weight matrix. [sent-45, score-0.25]
13 We show that by carefully constructing one of the factors, while learning only the other factor, we can train networks with vastly fewer parameters which achieve the same performance as full networks with the same structure. [sent-47, score-0.371]
14 In the best cases we are able to predict more than 95% of the parameters of a network without any drop in predictive accuracy. [sent-54, score-0.251]
15 Copies of these parameters can be safely distributed across machines without any of the synchronization overhead incurred by distributing dynamic parameters. [sent-65, score-0.295]
16 2 2 Low rank weight matrices Deep networks are composed of several layers of transformations of the form h = g(vW), where v is an nv -dimensional input, h is an nh -dimensional output, and W is an nv ⇥ nh matrix of parameters. [sent-66, score-0.629]
17 A column of W contains the weights connecting each unit in the visible layer to a single unit in the hidden layer. [sent-67, score-0.614]
18 3 Feature prediction We can exploit the structure in the features of a deep network to represent the features in a much lower dimensional space. [sent-78, score-0.516]
19 To do this we consider the weights connected to a single hidden unit as a function w : W ! [sent-79, score-0.227]
20 In this view the columns of U form a dictionary of basis functions, and the features of the network are linear combinations of these features parameterized by V. [sent-83, score-0.649]
21 The problem thus becomes one of choosing a good base dictionary for representing network features. [sent-84, score-0.397]
22 1 Choice of dictionary The base dictionary for feature prediction can be constructed in several ways. [sent-86, score-0.722]
23 An obvious choice is to train a single layer unsupervised model and use the features from that model as a dictionary. [sent-87, score-0.433]
24 One way to achieve this is via kernel ridge regression [25]. [sent-92, score-0.234]
25 The kernel enables us to make smooth predictions of the parameter vector over the entire domain W using the standard kernel ridge predictor: w = kT (K↵ + I) ↵ 1 w↵ , where k↵ is a matrix whose elements are given by (k↵ )ij = k(i, j) for i 2 ↵ and j 2 W, and ridge regularization coefficient. [sent-96, score-0.468]
26 2 is a A concrete example In this section we describe the feature prediction process as it applies to features derived from image patches using kernel ridge regression, since the intuition is strongest in this case. [sent-99, score-0.614]
27 We defer a discussion of how to select a kernel for deep layers as well as for non-image data in the visible layer to a later section. [sent-100, score-0.796]
28 3 If v is a vectorized image patch corresponding to the visible layer of a standard neural network then the hidden activity induced by this patch is given by h = g(vW), where g is the network nonlinearity and W = [w1 , . [sent-102, score-0.955]
29 , wnh ] is a weight matrix whose columns each correspond to features which are to be matched to the visible layer. [sent-105, score-0.37]
30 For image patches, where we expect smoothness in pixel space, an appropriate kernel is the squared exponential kernel ✓ ◆ (ix jx )2 + (iy jy )2 k(i, j) = exp 2 2 where is a length scale parameter which controls the degree of smoothness. [sent-116, score-0.448]
31 Here ↵ has a convenient interpretation as the set of pixel locations in the image, each corresponding to a basis function in the dictionary defined by the kernel. [sent-117, score-0.442]
32 More generically we will use ↵ to index a collection of dictionary elements in the remainder of the paper, even when a dictionary element may not correspond directly to a pixel location as in this example. [sent-118, score-0.641]
33 3 Interpretation as pooling So far we have motivated our technique as a method for predicting features in a neural network; however, the same approach can also be interpreted as a linear pooling process. [sent-120, score-0.312]
34 Recall that the hidden activations in a standard neural network before applying the nonlinearity are given by g 1 (h) = vW. [sent-121, score-0.264]
35 Under this interpretation we can think of a predicted layer as being composed to two layers internally. [sent-124, score-0.559]
36 The first is a linear layer which applies a fixed pooling operator given by U↵ , and the second is an ordinary fully connected layer with |↵| visible units. [sent-125, score-0.845]
37 , ↵J corresponding to elements from a dictionary U. [sent-134, score-0.285]
38 The output of the layer is obtained by concatenating the output of each column. [sent-144, score-0.308]
39 Introducing additional columns into the network increases the number of static parameters but the number of dynamic parameters remains fixed. [sent-151, score-0.523]
40 The increase in static parameters comes from the fact that each column has its own dictionary. [sent-152, score-0.225]
41 The reason that there is not a corresponding increase in the number of dynamic parameters is that for a fixed size hidden layer the hidden units are divided between the columns. [sent-153, score-0.798]
42 The number of dynamic parameters depends only on the number of hidden units and the size of each dictionary. [sent-154, score-0.387]
43 As above, we re-order the operations to obtain g 1 (h) = v↵ w↵ resulting in a structure similar to a layer in an ordinary MLP. [sent-158, score-0.352]
44 5 Constructing dictionaries We now turn our attention to selecting an appropriate dictionary for different layers of the network. [sent-165, score-0.547]
45 The appropriate choice of dictionary inevitably depends on the structure of the weight space. [sent-166, score-0.369]
46 When the weight space has a topological structure where we expect smoothness, for example when the weights correspond to pixels in an image patch, we can choose a kernel-based dictionary to enforce the type of smoothness we expect. [sent-167, score-0.606]
47 An obvious choice here is to use a shallow unsupervised feature learning, such as an autoencoder, to build a dictionary for the layer. [sent-169, score-0.428]
48 Since the correlations in hidden activities depend on the weights in lower layers we cannot initialize kernels in deep layers in this way without training the previous layers. [sent-172, score-0.664]
49 We handle this by pre-training each layer as an autoencoder. [sent-173, score-0.308]
50 We construct the kernel using the empirical covariance of the hidden units over the data using the pre-trained weights. [sent-174, score-0.307]
51 Once each layer has been pre-trained in this way 1 The vectorized filter bank W = U↵ w↵ must be reshaped before the convolution takes place. [sent-175, score-0.395]
52 0 Proportion of Parameters Learned Proportion of parameters learned Figure 4: Left: Comparing the performance of different dictionaries when predicting the weights in the first two layers of an MLP network on MNIST. [sent-200, score-0.566]
53 The legend shows the dictionary type in layer1– layer2 (see main text for details). [sent-201, score-0.285]
54 we fine-tune the entire network with backpropagation, but in this phase the kernel parameters are fixed. [sent-203, score-0.288]
55 We train several MLP models on MNIST using different strategies for constructing the dictionary, different numbers of columns and different degrees of reduction in the number of dynamic parameters used in each feature. [sent-207, score-0.305]
56 The networks in this experiment all have two hidden layers with a 784–500–500–10 architecture and use a sigmoid activation function. [sent-209, score-0.435]
57 In all cases we preform parameter prediction in the first and second layers only; the final softmax layer is never predicted. [sent-211, score-0.676]
58 This layer contains approximately 1% of the total network parameters, so a substantial savings is possible even if features in this layer are not predicted. [sent-212, score-0.813]
59 We divide the hidden units in each layer equally between columns (so each column connects to 50 units in the layer above). [sent-214, score-1.082]
60 Error The different dictionaries are as follows: nokernel is an ordinary model with no feature prediction (shown as a horizontal Convnet CIFAR-10 line). [sent-215, score-0.449]
61 4 Con is random connections (the dictionary is random columns 0. [sent-218, score-0.367]
62 AE is a dictionary Proportion of parameters learned pre-trained as an autoencoder. [sent-232, score-0.418]
63 The networks used in this experiment have two hidden layers with 1024 units. [sent-241, score-0.376]
64 2 Convolutional network Figure 5 shows the performance of a convnet [17] on CIFAR-10. [sent-244, score-0.231]
65 The first convolutional layer filters the 32 ⇥ 32 ⇥ 3 input image using 48 filters of size 8 ⇥ 8 ⇥ 3. [sent-245, score-0.564]
66 The second convolutional layer applies 64 filters of size 8 ⇥ 8 ⇥ 48 to the output of the first layer. [sent-246, score-0.51]
67 The third convolutional layer further transforms the output of the second layer by applying 64 filters of size 5 ⇥ 5 ⇥ 64. [sent-247, score-0.818]
68 The output of the third layer is input to a fully connected layer with 500 hidden units and finally into a softmax layer with 10 outputs. [sent-248, score-1.251]
69 The convolutional layers each have one column and the fully connected layer has five columns. [sent-250, score-0.798]
70 Convolutional layers have a natural topological structure to exploit, so we use an dictionary constructed with the squared exponential kernel in each convolutional layer. [sent-251, score-0.835]
71 The input to the fully connected layer at the top of the network comes from a convolutional layer so we use ridge regression with the squared exponential kernel to predict parameters in this layer as well. [sent-252, score-1.723]
72 RICA is a single layer architecture, and we predict parameters a squared exponential kernel dictionary with a length scale of 1. [sent-259, score-0.882]
73 The nokernel line shows the performance of RICA with no feature prediction on the same task. [sent-261, score-0.287]
74 In both cases we are able to predict more than half of the dynamic parameters without a substantial drop in accuracy. [sent-262, score-0.249]
75 One of the models is ordinary RICA with no parameter prediction and the other has 50% of the parameters in each feature predicted using squared exponential kernel dictionary with a length scale of 1. [sent-264, score-0.768]
76 0; since 50% of the parameters in each feature are predicted, the second model has twice as many features with the same number of dynamic parameters. [sent-265, score-0.362]
77 5 Related work and future directions Several other methods for limiting the number of parameters in a neural network have been explored in the literature. [sent-266, score-0.234]
78 The most common approach to limiting the number of parameters is to use locally connected features [6]. [sent-269, score-0.223]
79 The size of the parameterization of locally connected networks can be further reduced by using tiled convolutional networks [10] in which groups of feature weights which tile the input STL-10 0. [sent-270, score-0.783]
80 44 15750 29700 43650 57600 Number of dynamic parameters 5000 23750 42500 61250 80000 Number of dynamic parameters Figure 6: Left: Comparison of the performance of RICA with and without parameter prediction on CIFAR-10 and STL-10. [sent-300, score-0.424]
81 Right: Comparison of RICA, and RICA with 50% parameter prediction using the same number of dynamic parameters (i. [sent-301, score-0.241]
82 Convolutional neural networks [13] are even more restrictive and force a feature to have tied weights for all receptive fields. [sent-307, score-0.384]
83 [23] involves approximating linear dictionaries with other dictionaries in a similar manner to how we approximate network features. [sent-310, score-0.348]
84 [22] study approximating convolutional filter banks with linear combinations of separable filters. [sent-312, score-0.241]
85 Both of these works focus on shallow single layer models, in contrast to our focus on deep networks. [sent-313, score-0.533]
86 The techniques described in this paper are orthogonal to the parameter reduction achieved by tying weights in a tiled or convolutional pattern. [sent-314, score-0.364]
87 Tying weights effectively reduces the number of feature maps by constraining features at different locations to share parameters. [sent-315, score-0.281]
88 Our approach reduces the number of parameters required to represent each feature and it is straightforward to incorporate into a tiled or convolutional network. [sent-316, score-0.431]
89 [3] control the number of parameters by removing connections between layers in a ¸ convolutional network at random. [sent-318, score-0.531]
90 Recent work has shown that state of the art results on several benchmark tasks in computer vision can be achieved by training neural networks with several columns of representation [2, 13]. [sent-328, score-0.26]
91 Unlike the work of [2], we do not consider deep columns in this paper; however, collimation is an attractive way for increasing parallelism within a network, as the columns operate completely independently. [sent-332, score-0.34]
92 Major differences between our technique and the factored RBM include the fact that the factored RBM is a specific model, whereas our technique can be applied more broadly—even to factored RBMs. [sent-336, score-0.46]
93 In addition, in a factored RBM all factors are learned, whereas in our approach the dictionary is fixed judiciously. [sent-337, score-0.393]
94 Using different types of kernels to encode different types of prior knowledge on the weight space, or even learning the kernel functions directly as part of the optimization procedure as in [27] are possibilities that deserve exploration. [sent-343, score-0.225]
95 When no natural topology on the weight space is available we infer a topology for the dictionary from empirical statistics; however, it may be possible to instead construct the dictionary to induce a desired topology on the weight space directly. [sent-344, score-0.894]
96 This has parallels to other work on inducing topology in representations [10] as well as work on learning pooling structures in deep networks [4]. [sent-345, score-0.412]
97 6 Conclusion We have shown how to achieve significant reductions in the number of dynamic parameters in deep models. [sent-346, score-0.359]
98 The idea is orthogonal but complementary to recent advances in deep learning, such as dropout, rectified units and maxout. [sent-347, score-0.277]
99 It creates many avenues for future work, such as improving large scale industrial implementations of deep networks, but also brings into question whether we have the right parameterizations in deep learning. [sent-348, score-0.352]
100 Improving neural networks by preventing co-adaptation of feature detectors. [sent-434, score-0.272]
wordName wordTfidf (topN-words)
[('rica', 0.357), ('layer', 0.308), ('dictionary', 0.285), ('convolutional', 0.202), ('deep', 0.176), ('layers', 0.144), ('nokernel', 0.135), ('ridge', 0.131), ('networks', 0.129), ('convnet', 0.119), ('dictionaries', 0.118), ('network', 0.112), ('dynamic', 0.11), ('preform', 0.108), ('factored', 0.108), ('ica', 0.105), ('kernel', 0.103), ('hidden', 0.103), ('units', 0.101), ('feature', 0.094), ('features', 0.085), ('weight', 0.084), ('columns', 0.082), ('columnar', 0.081), ('lters', 0.079), ('column', 0.079), ('static', 0.073), ('parameters', 0.073), ('ranzato', 0.073), ('pixel', 0.071), ('nv', 0.07), ('smoothness', 0.07), ('technique', 0.068), ('mlp', 0.068), ('vu', 0.068), ('machines', 0.067), ('phone', 0.066), ('nh', 0.066), ('predict', 0.066), ('visible', 0.065), ('connected', 0.065), ('predicted', 0.064), ('coates', 0.064), ('rbm', 0.062), ('tiled', 0.062), ('autoencoder', 0.062), ('learned', 0.06), ('architecture', 0.059), ('weights', 0.059), ('softmax', 0.058), ('prediction', 0.058), ('pooling', 0.055), ('topological', 0.054), ('ciresan', 0.054), ('emp', 0.054), ('lcehre', 0.054), ('rigamonti', 0.054), ('wnh', 0.054), ('image', 0.054), ('trained', 0.053), ('patch', 0.053), ('receptive', 0.053), ('uv', 0.052), ('topology', 0.052), ('kt', 0.052), ('proportion', 0.051), ('dean', 0.05), ('patches', 0.05), ('shallow', 0.049), ('neural', 0.049), ('recti', 0.048), ('lowrank', 0.048), ('iy', 0.048), ('squared', 0.047), ('lter', 0.046), ('vectorized', 0.046), ('reconstruction', 0.045), ('overhead', 0.045), ('ordinary', 0.044), ('phones', 0.044), ('vw', 0.044), ('lang', 0.044), ('timit', 0.044), ('copies', 0.044), ('speech', 0.043), ('extremely', 0.043), ('parameterization', 0.043), ('locations', 0.043), ('interpretation', 0.043), ('rubinstein', 0.041), ('montr', 0.041), ('tying', 0.041), ('reshaped', 0.041), ('train', 0.04), ('maxout', 0.039), ('banks', 0.039), ('factoring', 0.039), ('krizhevsky', 0.039), ('intuition', 0.039), ('kernels', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
2 0.27233282 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
3 0.25816742 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
4 0.20471068 5 nips-2013-A Deep Architecture for Matching Short Texts
Author: Zhengdong Lu, Hang Li
Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1
5 0.20171933 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
6 0.19117932 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
7 0.18377578 30 nips-2013-Adaptive dropout for training deep neural networks
8 0.1701676 64 nips-2013-Compete to Compute
9 0.14869171 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
10 0.14703515 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
11 0.14639826 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
12 0.13861178 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
13 0.13678145 65 nips-2013-Compressive Feature Learning
14 0.13463968 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
15 0.12544033 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
16 0.1236204 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
17 0.11859081 84 nips-2013-Deep Neural Networks for Object Detection
18 0.11708864 160 nips-2013-Learning Stochastic Feedforward Neural Networks
19 0.10611805 339 nips-2013-Understanding Dropout
20 0.10496832 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
topicId topicWeight
[(0, 0.252), (1, 0.144), (2, -0.196), (3, -0.117), (4, 0.104), (5, -0.19), (6, -0.114), (7, 0.087), (8, -0.006), (9, -0.188), (10, 0.19), (11, 0.003), (12, -0.049), (13, 0.038), (14, 0.075), (15, 0.104), (16, -0.022), (17, -0.042), (18, -0.065), (19, -0.073), (20, 0.05), (21, -0.067), (22, 0.046), (23, -0.032), (24, 0.04), (25, -0.026), (26, -0.14), (27, -0.001), (28, 0.107), (29, 0.002), (30, 0.034), (31, 0.044), (32, -0.001), (33, 0.045), (34, -0.039), (35, 0.109), (36, -0.006), (37, 0.016), (38, 0.029), (39, -0.089), (40, 0.052), (41, -0.096), (42, 0.029), (43, -0.046), (44, -0.084), (45, 0.026), (46, -0.009), (47, -0.05), (48, -0.074), (49, -0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.96775764 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
2 0.89096612 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
3 0.83081836 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
4 0.81555158 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
Author: Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
Abstract: As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classification benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the stateof-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture significantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a smaller computational learning cost. Our hybrid architecture allows us to assess how the performance of a conventional hand-crafted image classification pipeline changes with increased depth. We also show that convolutional networks and Fisher vector encodings are complementary in the sense that their combination further improves the accuracy. 1
5 0.74417877 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
6 0.69091266 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
7 0.67898637 5 nips-2013-A Deep Architecture for Matching Short Texts
8 0.6713044 30 nips-2013-Adaptive dropout for training deep neural networks
9 0.63408327 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
10 0.61189234 160 nips-2013-Learning Stochastic Feedforward Neural Networks
11 0.61062616 85 nips-2013-Deep content-based music recommendation
12 0.60651284 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
13 0.58497286 84 nips-2013-Deep Neural Networks for Object Detection
14 0.58436412 75 nips-2013-Convex Two-Layer Modeling
15 0.56767881 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
16 0.55775112 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking
17 0.54387259 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
18 0.52724749 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
19 0.49421921 65 nips-2013-Compressive Feature Learning
20 0.48931772 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
topicId topicWeight
[(2, 0.015), (16, 0.041), (33, 0.191), (34, 0.13), (37, 0.147), (41, 0.023), (49, 0.044), (56, 0.098), (70, 0.05), (85, 0.03), (89, 0.045), (93, 0.115), (95, 0.01)]
simIndex simValue paperId paperTitle
1 0.92150062 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann
Abstract: We propose an efficient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1
same-paper 2 0.91334504 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
3 0.90166843 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?
Author: Qiang Liu, Alex Ihler, Mark Steyvers
Abstract: We study the problem of estimating continuous quantities, such as prices, probabilities, and point spreads, using a crowdsourcing approach. A challenging aspect of combining the crowd’s answers is that workers’ reliabilities and biases are usually unknown and highly diverse. Control items with known answers can be used to evaluate workers’ performance, and hence improve the combined results on the target items with unknown answers. This raises the problem of how many control items to use when the total number of items each workers can answer is limited: more control items evaluates the workers better, but leaves fewer resources for the target items that are of direct interest, and vice versa. We give theoretical results for this problem under different scenarios, and provide a simple rule of thumb for crowdsourcing practitioners. As a byproduct, we also provide theoretical analysis of the accuracy of different consensus methods. 1
4 0.87296629 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
Author: Tamir Hazan, Subhransu Maji, Joseph Keshet, Tommi Jaakkola
Abstract: In this work we develop efficient methods for learning random MAP predictors for structured label problems. In particular, we construct posterior distributions over perturbations that can be adjusted via stochastic gradient methods. We show that any smooth posterior distribution would suffice to define a smooth PAC-Bayesian risk bound suitable for gradient methods. In addition, we relate the posterior distributions to computational properties of the MAP predictors. We suggest multiplicative posteriors to learn super-modular potential functions that accompany specialized MAP predictors such as graph-cuts. We also describe label-augmented posterior models that can use efficient MAP approximations, such as those arising from linear program relaxations. 1
5 0.8676821 99 nips-2013-Dropout Training as Adaptive Regularization
Author: Stefan Wager, Sida Wang, Percy Liang
Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1
6 0.86609316 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
7 0.86456287 30 nips-2013-Adaptive dropout for training deep neural networks
8 0.86318451 64 nips-2013-Compete to Compute
9 0.8622309 5 nips-2013-A Deep Architecture for Matching Short Texts
10 0.8618142 183 nips-2013-Mapping paradigm ontologies to and from the brain
11 0.8607989 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
12 0.86038315 201 nips-2013-Multi-Task Bayesian Optimization
13 0.85583246 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
14 0.85569167 331 nips-2013-Top-Down Regularization of Deep Belief Networks
15 0.85223573 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
16 0.85106122 301 nips-2013-Sparse Additive Text Models with Low Rank Background
17 0.85030633 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
18 0.84950697 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
19 0.84920222 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
20 0.8481344 215 nips-2013-On Decomposing the Proximal Map