nips nips2013 nips2013-64 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract Local competition among neighboring neurons is common in biological neural networks (NNs). [sent-2, score-0.58]
2 NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. [sent-4, score-0.281]
3 One of the long-studied properties of biological neural circuits which has yet to fully impact the machine learning community is the nature of local competition. [sent-7, score-0.144]
4 In this paper, we propose a biologically inspired mechanism for artificial neural networks that is based on local competition, and ultimately relies on local winner-take-all (LWTA) behavior. [sent-9, score-0.23]
5 Our experiments also show evidence that a type of modularity emerges in LWTA networks trained in a supervised setting, such that different modules (subnetworks) respond to different inputs. [sent-12, score-0.195]
6 We then show how LWTA networks perform on a variety of tasks, and how it helps buffer against catastrophic forgetting. [sent-15, score-0.228]
7 2 Neuroscience Background Competitive interactions between neurons and neural circuits have long played an important role in biological models of brain processes. [sent-16, score-0.402]
8 The earliest models to describe the emergence of winner-take-all (WTA) behavior from local competition were based on Grossberg’s shunting short-term memory equations [4], which showed that a center-surround structure not only enables WTA dynamics, but also contrast enhancement, and normalization. [sent-21, score-0.148]
9 Analysis of their dynamics showed that networks with slower-than-linear signal functions uniformize input patterns; linear signal functions preserve and normalize input patterns; and faster-than-linear signal functions enable WTA dynamics. [sent-22, score-0.22]
10 The functional properties of competitive interactions have been further studied to show, among other things, the effects of distance-dependent kernels [8], inhibitory time lags [8, 9], development of self-organizing maps [10, 11, 12], and the role of WTA networks in attention [13]. [sent-24, score-0.242]
11 Biological models have also been extended to show how competitive interactions in spiking neural networks give rise to (soft) WTA dynamics [14], as well as how they may be efficiently constructed in VLSI [15, 16]. [sent-25, score-0.283]
12 Although competitive interactions, and WTA dynamics have been studied extensively in the biological literature, it is only more recently that they have been considered from computational or machine learning perspectives. [sent-26, score-0.126]
13 For example, Maas [17, 18] showed that feedforward neural networks with WTA dynamics as the only non-linearity are as computationally powerful as networks with threshold or sigmoidal gates; and, networks employing only soft WTA competition are universal function approximators. [sent-27, score-0.616]
14 Moreover, these results hold, even when the network weights are strictly positive—a finding which has ramifications for our understanding of biological neural circuits, as well as the development of neural networks for pattern recognition. [sent-28, score-0.352]
15 Nonetheless, networks employing local competition have existed since the late 80s [21], and, along with [22], serve as a primary inspiration for the present work. [sent-30, score-0.249]
16 More recently, maxout networks [19] have leveraged locally competitive interactions in combination with a technique known as dropout [20] to obtain the best results on certain benchmark problems. [sent-31, score-0.376]
17 3 Networks with local winner-take-all blocks This section describes the general network architecture with locally competing neurons. [sent-32, score-0.217]
18 The network consists of B blocks which are organized into layers (Figure 1). [sent-33, score-0.239]
19 B, contains n computational units (neurons), and produces an output vector yi , determined by the local interactions between the individual neuron activations in the block: j yi = g(h1 , h2 . [sent-36, score-0.298]
20 n, is the activation of the j-th neuron in block i computed by: i T hi = f (wij x), (2) where x is the input vector from neurons in the previous layer, wij is the weight vector of neuron j in block i, and f (·) is a (generally non-linear) activation function. [sent-41, score-0.661]
21 The output activations y are passed as inputs to the next layer. [sent-42, score-0.119]
22 In order to investigate the capabilities of the hard winner-take-all interaction function in isolation, f (x) = x 2 Figure 1: A Local Winner-Take-All (LWTA) network with blocks of size two showing the winning neuron in each block (shaded) for a given input example. [sent-48, score-0.346]
23 Activations flow forward only through the winning neurons, errors are backpropagated through the active neurons. [sent-49, score-0.147]
24 The active neurons form a subnetwork of the full network which changes depending on the inputs. [sent-51, score-0.443]
25 During training the error signal is only backpropagated through the winning neurons. [sent-54, score-0.142]
26 In a LWTA layer, there are as many neurons as there are blocks active at any one time for a given input pattern1 . [sent-55, score-0.385]
27 We denote a layer with blocks of size n as LWTA-n. [sent-56, score-0.177]
28 For each input pattern presented to a network, only a subgraph of the full network is active, e. [sent-57, score-0.116]
29 Training on a dataset consists of simultaneously training an exponential number of models that share parameters, as well as learning which model should be active for each pattern. [sent-60, score-0.119]
30 Unlike networks with sigmoidal units, where all of the free parameters need to be set properly for all input patterns, only a subset is used for any given input, so that patterns coming from very different sub-distributions can potentially be modelled more efficiently through specialization. [sent-61, score-0.239]
31 This modular property is similar to that of networks with rectified linear units (ReLU) which have recently been shown to be very good at several learning tasks (links with ReLU are discussed in section 4. [sent-62, score-0.232]
32 1 Comparison with related methods Max-pooling Neural networks with max-pooling layers [23] have been found to be very useful, especially for image classification tasks where they have achieved state-of-the-art performance [24, 25]. [sent-65, score-0.216]
33 These layers are usually used in convolutional neural networks to subsample the representation obtained after convolving the input with a learned filter, by dividing the representation into pools and selecting the maximum in each one. [sent-66, score-0.357]
34 1 However, there is always the possibility that the winning neuron in a block has an activation of exactly zero, so that the block has no output. [sent-68, score-0.321]
35 (a) In max-pooling, each group of neurons in a layer has a single set of output weights that transmits the winning unit’s activation (0. [sent-77, score-0.56]
36 The activations flow into subsequent units via a different set of connections depending on the winning unit. [sent-82, score-0.232]
37 2 Dropout Dropout [20] can be interpreted as a model-averaging technique that jointly trains several models sharing subsets of parameters and input dimensions, or as data augmentation when applied to the input layer [19, 20]. [sent-85, score-0.193]
38 This is achieved by probabilistically omitting (“dropping”) units from a network for each example during training, so that those neurons do not participate in forward/backward propagation. [sent-86, score-0.435]
39 Consider, hypothetically, training an LWTA network with blocks of size two, and selecting the winner in each block at random. [sent-87, score-0.278]
40 This is similar to training a neural network with a dropout probability of 0. [sent-88, score-0.287]
41 Dropout is a regularization technique while in LWTA the interaction between neurons in a block replaces the per-neuron non-linear activation. [sent-91, score-0.3]
42 Dropout is believed to improve generalization performance since it forces the units to learn independent features, without relying on other units being active. [sent-92, score-0.196]
43 During testing, when propagating an input through the network, all units in a layer trained with dropout are used with their output weights suitably scaled. [sent-93, score-0.44]
44 A fraction of the units will be inactive for each input pattern depending on their total inputs. [sent-95, score-0.191]
45 3 Rectified Linear units Rectified Linear Units (ReLU) are simply linear neurons that clamp negative activations to zero (f (x) = x if x > 0, f (x) = 0 otherwise). [sent-99, score-0.435]
46 ReLU networks were shown to be useful for Restricted Boltzmann Machines [26], outperformed sigmoidal activation functions in deep neural networks [27], and have been used to obtain the best results on several benchmark problems across multiple domains [24, 28]. [sent-100, score-0.5]
47 Consider an LWTA block with two neurons compared to two ReLU neurons, where x1 and x2 are the weighted sum of the inputs to each neuron. [sent-101, score-0.329]
48 The difference is that in LWTA both neurons are never active or inactive at the same time, and the activations and errors flow through exactly one neuron in the block. [sent-104, score-0.433]
49 For ReLU neurons, being inactive (saturation) is a potential drawback since neurons that 4 Table 1: Comparison of rectified linear activation and LWTA-2. [sent-105, score-0.388]
50 Continued research along these lines validates this hypothesis [29], but it is expected that it is possible to train ReLU networks better. [sent-108, score-0.134]
51 While many of the above arguments for and against ReLU networks apply to LWTA networks, there is a notable difference. [sent-109, score-0.134]
52 During training of an LWTA network, inactive neurons can become active due to training of the other neurons in the same block. [sent-110, score-0.665]
53 5 Experiments In the following experiments, LWTA networks were tested on various supervised learning datasets, demonstrating their ability to learn useful internal representations without utilizing any other non-linearities. [sent-112, score-0.158]
54 In order to clearly assess the utility of local competition, no special strategies such as augmenting data with transformations, noise or dropout were used. [sent-113, score-0.147]
55 We also did not encourage sparse representations in the hidden layers by adding activation penalties to the objective function, a common technique also for ReLU units. [sent-114, score-0.233]
56 L2 weight decay was used for the convolutional network (section 5. [sent-118, score-0.171]
57 1 Permutation Invariant MNIST The MNIST handwritten digit recognition task consists of 70,000 28x28 images (60,000 training, 10,000 test) of the 10 digits centered by their center of mass [33]. [sent-122, score-0.148]
58 5 Table 2: Test set errors on the permutation invariant MNIST dataset for methods without data augmentation or unsupervised pre-training Activation Sigmoid [32] ReLU [27] ReLU + dropout in hidden layers [20] LWTA-2 Test Error 1. [sent-129, score-0.284]
59 28% Table 3: Test set errors on MNIST dataset for convolutional architectures with no data augmentation. [sent-133, score-0.135]
60 Results marked with an asterisk use layer-wise unsupervised feature learning to pre-train the network and global fine tuning. [sent-134, score-0.116]
61 Architecture 2-layer CNN + 2 layer MLP [34] * 2-layer ReLU CNN + 2 layer LWTA-2 3-layer ReLU CNN [35] 2-layer CNN + 2 layer MLP [36] * 3-layer ReLU CNN + stochastic pooling [33] 3-layer maxout + dropout [19] Test Error 0. [sent-135, score-0.498]
62 28%, consisted of three LWTA layers of 500 blocks followed by a 10-way softmax layer. [sent-143, score-0.192]
63 The performance of LWTA is comparable to that of a ReLU network with dropout in the hidden layers. [sent-146, score-0.208]
64 Using dropout in input layers as well, lower error rates of 1. [sent-147, score-0.224]
65 2 Convolutional Network on MNIST For this experiment, a convolutional network (CNN) was used consisting of 7 × 7 filters in the first layer followed by a second layer of 6 × 6, with 16 and 32 maps respectively, and ReLU activation. [sent-151, score-0.416]
66 Every convolutional layer is followed by a 2 × 2 max-pooling operation. [sent-152, score-0.191]
67 We then use two LWTA-2 layers each with 64 blocks and finally a 10-way softmax output layer. [sent-153, score-0.22]
68 3 Amazon Sentiment Analysis LWTA networks were tested on the Amazon sentiment analysis dataset [37] since ReLU units have been shown to perform well in this domain [27, 38]. [sent-158, score-0.309]
69 ReLU activation was used on this dataset in the context of unsupervised learning with denoising autoencoders to obtain sparse feature representations which were used for classification. [sent-165, score-0.18]
70 We trained an LWTA-2 network with three layers of 500 blocks each in a supervised setting to directly classify each review as positive or negative using a 2-way softmax output layer. [sent-166, score-0.428]
71 6 Table 4: LWTA networks outperform sigmoid and ReLU activation in remembering dataset P1 after training on dataset P2. [sent-173, score-0.385]
72 07% Implicit long term memory This section examines the effect of the LWTA architecture on catastrophic forgetting. [sent-186, score-0.124]
73 That is, does the fact that the network implements multiple models allow it to retain information about dataset A, even after being trained on a different dataset B? [sent-187, score-0.208]
74 To test for this implicit long term memory, the MNIST training and test sets were each divided into two parts, P1 containing only digits {0, 1, 2, 3, 4}, and P2 consisting of the remaining digits {5, 6, 7, 8, 9}. [sent-188, score-0.22]
75 Three different network architectures were compared: (1) three LWTA layers each with 500 blocks of size 2, (2) three layers each with 1000 sigmoidal neurons, and (3) three layers each of 1000 ReLU neurons. [sent-189, score-0.487]
76 All networks have a 5-way softmax output layer representing the probability of an example belonging to each of the five classes. [sent-190, score-0.317]
77 All networks were initialized with the same parameters, and trained with a fixed learning rate and momentum. [sent-191, score-0.195]
78 The weights for the output layer (corresponding to the softmax classifier) were then stored, and the network was trained further, starting with new initial random output layer weights, to reach the same log-likelihood value on P2. [sent-195, score-0.474]
79 Finally, the output layer weights saved from P1 were restored, and the network was evaluated on the P1 test set. [sent-196, score-0.23]
80 Table 4 shows that the LWTA network remembers what was learned from P1 much better than sigmoid and ReLU networks, though it is notable that the sigmoid network performs much worse than both LWTA and ReLU. [sent-198, score-0.28]
81 While the test error values depend on the learning rate and momentum used, LWTA networks tended to remember better than the ReLU network by about a factor of two in most cases, and sigmoid networks always performed much worse. [sent-199, score-0.439]
82 Although standard network architectures are known to suffer from catastrophic forgetting, we not only show here, for the first time, that ReLU networks are actually quite good in this regard, and moreover, that they are outperformed by LWTA. [sent-200, score-0.346]
83 The neurons encoding specific features in one dataset are not affected much during training on another dataset, whereas neurons encoding common features can be reused. [sent-202, score-0.587]
84 7 Analysis of subnetworks A network with a single LWTA-m of N blocks consists of mN subnetworks which can be selected and trained for individual examples while training over a dataset. [sent-204, score-0.635]
85 After training, we expect the subnetworks consisting of active neurons for examples from the same class to have more neurons in common compared to subnetworks being activated for different classes. [sent-205, score-0.961]
86 In the case of relatively simple datasets like MNIST, it is possible to examine the number of common neurons between mean subnetworks which are used for each class. [sent-206, score-0.457]
87 To do this, which neurons were active in the layer for each example in a subset of 10,000 examples were recorded. [sent-207, score-0.405]
88 For each class, the subnetwork consisting of neurons active for at least 90% of the examples was designated the representative mean subnetwork, which was then compared to all other class subnetworks by counting the number of neurons in common. [sent-208, score-0.808]
89 Figure 3a shows the fraction of neurons in common between the mean subnetworks of each pair of digits. [sent-209, score-0.486]
90 Digits that are morphologically similar such as “3” and “8” have subnetworks with more neurons in common than the subnetworks for digits “1” and “2” or “1” and “5” which are intuitively less similar. [sent-210, score-0.721]
91 To verify that this subnetwork specialization is a result of training, we looked at the fraction of common neurons between all pairs of digits for the 7 untrained trained 0. [sent-211, score-0.495]
92 2 0 10 20 30 40 50 Fraction of neurons in common Digits 0 1 2 3 4 5 6 7 8 9 0. [sent-220, score-0.27]
93 1 MNIST digit pairs (b) (a) Figure 3: (a) Each entry in the matrix denotes the fraction of neurons that a pair of MNIST digits has in common, on average, in the subnetworks that are most active for each of the two digit classes. [sent-221, score-0.677]
94 (b) The fraction of neurons in common in the subnetworks of each of the 55 possible digit pairs, before and after training. [sent-222, score-0.531]
95 Clearly, the subnetworks were much more similar prior to training, and the full network has learned to partition its parameters to reflect the structure of the data. [sent-224, score-0.278]
96 8 Conclusion and future research directions Our LWTA networks automatically self-modularize into multiple parameter-sharing subnetworks responding to different input representations. [sent-225, score-0.346]
97 Without significant degradation of state-of-the-art results on digit recognition and sentiment analysis, LWTA networks also avoid catastrophic forgetting, thus retaining useful representations of one set of inputs even after being trained to classify another. [sent-226, score-0.489]
98 Improving neural networks by preventing co-adaptation of feature detectors, 2012. [sent-308, score-0.17]
99 Best practices for convolutional neural networks applied to visual document analysis. [sent-359, score-0.25]
100 Stochastic pooling for regularization of deep convolutional neural networks. [sent-369, score-0.152]
wordName wordTfidf (topN-words)
[('lwta', 0.608), ('relu', 0.421), ('neurons', 0.246), ('subnetworks', 0.187), ('wta', 0.151), ('erent', 0.143), ('networks', 0.134), ('dropout', 0.117), ('layer', 0.111), ('di', 0.108), ('activation', 0.103), ('units', 0.098), ('catastrophic', 0.094), ('network', 0.091), ('competition', 0.085), ('layers', 0.082), ('convolutional', 0.08), ('recti', 0.078), ('digits', 0.077), ('cnn', 0.072), ('winning', 0.072), ('blocks', 0.066), ('rey', 0.066), ('activations', 0.062), ('mnist', 0.062), ('trained', 0.061), ('subnetwork', 0.058), ('sigmoidal', 0.057), ('biological', 0.055), ('block', 0.054), ('geo', 0.049), ('rupesh', 0.049), ('sigmoid', 0.049), ('sentiment', 0.049), ('active', 0.048), ('maxout', 0.048), ('forgetting', 0.046), ('digit', 0.045), ('softmax', 0.044), ('continual', 0.044), ('juergen', 0.044), ('training', 0.043), ('interactions', 0.042), ('inactive', 0.039), ('neuron', 0.038), ('dynamics', 0.036), ('deep', 0.036), ('neural', 0.036), ('yoshua', 0.035), ('competitive', 0.035), ('brazier', 0.033), ('cudamat', 0.033), ('dvds', 0.033), ('interneuron', 0.033), ('rgen', 0.033), ('shunting', 0.033), ('sohrob', 0.033), ('yann', 0.032), ('augmentation', 0.032), ('inhibitory', 0.031), ('momentum', 0.031), ('mf', 0.03), ('architecture', 0.03), ('local', 0.03), ('negative', 0.029), ('fraction', 0.029), ('gnumpy', 0.029), ('forget', 0.029), ('nns', 0.029), ('vlsi', 0.029), ('inputs', 0.029), ('dataset', 0.028), ('output', 0.028), ('ow', 0.027), ('backpropagated', 0.027), ('mary', 0.027), ('wolfgang', 0.027), ('classify', 0.027), ('architectures', 0.027), ('hj', 0.026), ('recognition', 0.026), ('er', 0.026), ('srivastava', 0.026), ('unsupervised', 0.025), ('antoine', 0.025), ('aurelio', 0.025), ('input', 0.025), ('organization', 0.025), ('common', 0.024), ('erence', 0.024), ('hippocampal', 0.024), ('winner', 0.024), ('representations', 0.024), ('consisting', 0.023), ('circuits', 0.023), ('electronics', 0.023), ('bordes', 0.023), ('formation', 0.023), ('xavier', 0.023), ('patterns', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
2 0.21971199 30 nips-2013-Adaptive dropout for training deep neural networks
Author: Jimmy Ba, Brendan Frey
Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1
3 0.1701676 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
4 0.16883615 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems
Author: Hesham Mostafa, Lorenz. K. Mueller, Giacomo Indiveri
Abstract: We present a recurrent neuronal network, modeled as a continuous-time dynamical system, that can solve constraint satisfaction problems. Discrete variables are represented by coupled Winner-Take-All (WTA) networks, and their values are encoded in localized patterns of oscillations that are learned by the recurrent weights in these networks. Constraints over the variables are encoded in the network connectivity. Although there are no sources of noise, the network can escape from local optima in its search for solutions that satisfy all constraints by modifying the effective network connectivity through oscillations. If there is no solution that satisfies all constraints, the network state changes in a seemingly random manner and its trajectory approximates a sampling procedure that selects a variable assignment with a probability that increases with the fraction of constraints satisfied by this assignment. External evidence, or input to the network, can force variables to specific values. When new inputs are applied, the network re-evaluates the entire set of variables in its search for states that satisfy the maximum number of constraints, while being consistent with the external input. Our results demonstrate that the proposed network architecture can perform a deterministic search for the optimal solution to problems with non-convex cost functions. The network is inspired by canonical microcircuit models of the cortex and suggests possible dynamical mechanisms to solve constraint satisfaction problems that can be present in biological networks, or implemented in neuromorphic electronic circuits. 1
5 0.16676927 339 nips-2013-Understanding Dropout
Author: Pierre Baldi, Peter J. Sadowski
Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1
6 0.14337701 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
7 0.13485058 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
8 0.12680694 331 nips-2013-Top-Down Regularization of Deep Belief Networks
9 0.12575954 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
10 0.11936332 5 nips-2013-A Deep Architecture for Matching Short Texts
11 0.11038092 210 nips-2013-Noise-Enhanced Associative Memories
12 0.10881405 99 nips-2013-Dropout Training as Adaptive Regularization
13 0.10415907 344 nips-2013-Using multiple samples to learn mixture models
14 0.1009387 121 nips-2013-Firing rate predictions in optimal balanced networks
15 0.10028106 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
16 0.095454812 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
17 0.093969278 208 nips-2013-Neural representation of action sequences: how far can a simple snippet-matching model take us?
18 0.093267284 75 nips-2013-Convex Two-Layer Modeling
19 0.086786918 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
20 0.085372023 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
topicId topicWeight
[(0, 0.156), (1, 0.104), (2, -0.191), (3, -0.14), (4, -0.081), (5, -0.152), (6, -0.104), (7, -0.012), (8, 0.02), (9, -0.123), (10, 0.235), (11, -0.015), (12, -0.011), (13, 0.065), (14, 0.068), (15, 0.017), (16, -0.048), (17, 0.035), (18, 0.059), (19, -0.042), (20, 0.044), (21, 0.024), (22, -0.077), (23, -0.003), (24, -0.023), (25, -0.057), (26, -0.03), (27, 0.026), (28, -0.005), (29, 0.004), (30, -0.008), (31, -0.071), (32, 0.029), (33, 0.048), (34, -0.045), (35, 0.03), (36, 0.074), (37, 0.046), (38, 0.047), (39, 0.015), (40, -0.049), (41, -0.033), (42, 0.022), (43, 0.059), (44, 0.012), (45, -0.005), (46, -0.082), (47, -0.122), (48, -0.031), (49, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.94956678 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
2 0.78052449 30 nips-2013-Adaptive dropout for training deep neural networks
Author: Jimmy Ba, Brendan Frey
Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1
3 0.6862666 339 nips-2013-Understanding Dropout
Author: Pierre Baldi, Peter J. Sadowski
Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1
4 0.66081548 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
5 0.64839083 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
7 0.59851313 121 nips-2013-Firing rate predictions in optimal balanced networks
8 0.5719409 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
10 0.5624423 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
11 0.55398297 5 nips-2013-A Deep Architecture for Matching Short Texts
12 0.55040014 331 nips-2013-Top-Down Regularization of Deep Belief Networks
13 0.5458473 210 nips-2013-Noise-Enhanced Associative Memories
14 0.51092637 99 nips-2013-Dropout Training as Adaptive Regularization
15 0.5005846 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
16 0.49944809 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
17 0.47643629 208 nips-2013-Neural representation of action sequences: how far can a simple snippet-matching model take us?
18 0.47249398 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
19 0.46046612 157 nips-2013-Learning Multi-level Sparse Representations
20 0.44043389 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
topicId topicWeight
[(2, 0.012), (16, 0.027), (17, 0.164), (21, 0.013), (33, 0.135), (34, 0.109), (37, 0.015), (41, 0.018), (49, 0.094), (56, 0.065), (70, 0.092), (85, 0.025), (89, 0.026), (93, 0.086), (95, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.86138773 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
2 0.81953019 284 nips-2013-Robust Spatial Filtering with Beta Divergence
Author: Wojciech Samek, Duncan Blythe, Klaus-Robert Müller, Motoaki Kawanabe
Abstract: The efficiency of Brain-Computer Interfaces (BCI) largely depends upon a reliable extraction of informative features from the high-dimensional EEG signal. A crucial step in this protocol is the computation of spatial filters. The Common Spatial Patterns (CSP) algorithm computes filters that maximize the difference in band power between two conditions, thus it is tailored to extract the relevant information in motor imagery experiments. However, CSP is highly sensitive to artifacts in the EEG data, i.e. few outliers may alter the estimate drastically and decrease classification performance. Inspired by concepts from the field of information geometry we propose a novel approach for robustifying CSP. More precisely, we formulate CSP as a divergence maximization problem and utilize the property of a particular type of divergence, namely beta divergence, for robustifying the estimation of spatial filters in the presence of artifacts in the data. We demonstrate the usefulness of our method on toy data and on EEG recordings from 80 subjects. 1
3 0.79975343 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
4 0.79225731 121 nips-2013-Firing rate predictions in optimal balanced networks
Author: David G. Barrett, Sophie Denève, Christian K. Machens
Abstract: How are firing rates in a spiking network related to neural input, connectivity and network function? This is an important problem because firing rates are a key measure of network activity, in both the study of neural computation and neural network dynamics. However, it is a difficult problem, because the spiking mechanism of individual neurons is highly non-linear, and these individual neurons interact strongly through connectivity. We develop a new technique for calculating firing rates in optimal balanced networks. These are particularly interesting networks because they provide an optimal spike-based signal representation while producing cortex-like spiking activity through a dynamic balance of excitation and inhibition. We can calculate firing rates by treating balanced network dynamics as an algorithm for optimising signal representation. We identify this algorithm and then calculate firing rates by finding the solution to the algorithm. Our firing rate calculation relates network firing rates directly to network input, connectivity and function. This allows us to explain the function and underlying mechanism of tuning curves in a variety of systems. 1
5 0.7700389 157 nips-2013-Learning Multi-level Sparse Representations
Author: Ferran Diego Andilla, Fred A. Hamprecht
Abstract: Bilinear approximation of a matrix is a powerful paradigm of unsupervised learning. In some applications, however, there is a natural hierarchy of concepts that ought to be reflected in the unsupervised analysis. For example, in the neurosciences image sequence considered here, there are the semantic concepts of pixel → neuron → assembly that should find their counterpart in the unsupervised analysis. Driven by this concrete problem, we propose a decomposition of the matrix of observations into a product of more than two sparse matrices, with the rank decreasing from lower to higher levels. In contrast to prior work, we allow for both hierarchical and heterarchical relations of lower-level to higher-level concepts. In addition, we learn the nature of these relations rather than imposing them. Finally, we describe an optimization scheme that allows to optimize the decomposition over all levels jointly, rather than in a greedy level-by-level fashion. The proposed bilevel SHMF (sparse heterarchical matrix factorization) is the first formalism that allows to simultaneously interpret a calcium imaging sequence in terms of the constituent neurons, their membership in assemblies, and the time courses of both neurons and assemblies. Experiments show that the proposed model fully recovers the structure from difficult synthetic data designed to imitate the experimental data. More importantly, bilevel SHMF yields plausible interpretations of real-world Calcium imaging data. 1
6 0.76901722 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
7 0.76834542 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
8 0.75593394 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning
9 0.75017792 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
10 0.74730974 183 nips-2013-Mapping paradigm ontologies to and from the brain
11 0.74669027 5 nips-2013-A Deep Architecture for Matching Short Texts
12 0.74667495 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
13 0.74646842 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
14 0.74556106 331 nips-2013-Top-Down Regularization of Deep Belief Networks
15 0.74476922 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
16 0.74370944 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
17 0.74304038 251 nips-2013-Predicting Parameters in Deep Learning
19 0.73624414 30 nips-2013-Adaptive dropout for training deep neural networks
20 0.73584032 301 nips-2013-Sparse Additive Text Models with Low Rank Background