nips nips2013 nips2013-334 knowledge-graph by maker-knowledge-mining

334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

Source: pdf

Author: Michiel Hermans, Benjamin Schrauwen

Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difﬁcult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. [sent-4, score-0.266]

2 In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. [sent-5, score-0.39]

3 Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. [sent-6, score-1.349]

4 We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. [sent-8, score-0.347]

5 1 Introduction The last decade, machine learning has seen the rise of neural networks composed of multiple layers, which are often termed deep neural networks (DNN). [sent-10, score-0.299]

6 Each layer processes some part of the task we wish to solve, and passes it on to the next. [sent-13, score-0.528]

7 In this sense, the DNN can be seen as a processing pipeline, in which each layer solves a part of the task before passing it on to the next, until ﬁnally the last layer provides the output. [sent-14, score-1.002]

8 One type of network that debatably falls into the category of deep networks is the recurrent neural network (RNN). [sent-15, score-0.625]

9 The comparison to common deep networks falls short, however, when we consider the functionality of the network architecture. [sent-17, score-0.319]

10 For RNNs, the primary function of the layers is to introduce memory, not hierarchical processing. [sent-18, score-0.218]

11 New information is added in every ‘layer’ (every network iteration), and the network can pass this information on for an indeﬁnite number of network updates, essentially providing the RNN with unlimited memory depth. [sent-19, score-0.411]

12 Whereas in DNNs input is only presented at the bottom layer, and output is only produced at the highest layer, RNNs generally receive input and produce output at each time step. [sent-20, score-0.293]

13 As such, the network updates do not provide hierarchical processing of the information per se, only in the respect that older data (provided several time steps ago) passes through the recursion more often. [sent-21, score-0.232]

14 More likely, the recurrent weights in an RNN learn during the training phase to select what information they need to pass onwards, and what they need to discard. [sent-23, score-0.237]

15 Each layer can be interpreted as an RNN that receives the time series of the previous layer as input. [sent-29, score-1.002]

16 Right: The two alternative architectures that we study in this paper, where the looped arrows represent the recurrent weights. [sent-30, score-0.23]

17 Either only the top layer connects to the output (DRNN-1O), or all layers do (DRNN-AO). [sent-31, score-0.836]

18 One potential weakness of a common RNN is that we may need complex, hierarchical processing of the current network input, but this information only passes through one layer of processing before going to the output. [sent-32, score-0.713]

19 Common RNNs do not explicitly support multiple time scales, and any temporal hierarchy that is present in the input signal needs to be embedded implicitly in the network dynamics. [sent-36, score-0.291]

20 The architecture we study in this paper is essentially a common DNN (a multilayer perceptron) with temporal feedback loops in each layer, which we call a deep recurrent neural network (DRNN). [sent-43, score-0.57]

21 Each network update, new information travels up the hierarchy, and temporal context is added in each layer (see Figure 1). [sent-44, score-0.698]

22 Each layer in the hierarchy is a recurrent neural network, and each subsequent layer receives the hidden state of the previous layer as input time series. [sent-46, score-1.882]

23 We suspect that DRNNs are wellsuited to capture temporal hierarchies, and character-based language modeling is an excellent realworld task to validate this claim, as the distribution of characters is highly nonlinear and covers both short- and long-term dependencies. [sent-49, score-0.332]

24 Using only stochastic gradient descent (SGD) we are able to get state-of-the-art performance for recurrent networks on a Wikipedia-based text corpus, which was previously only obtained using the far more advanced Hessian-free training algorithm [19]. [sent-51, score-0.426]

25 ¯ We denote the hidden state of the i-th layer with ai (t). [sent-57, score-0.565]

26 s Here, Wi and Zi are the recurrent connections and the connections from the lower layer or input time series, respectively. [sent-59, score-0.8]

27 The bottom layer has fading memory of the input signal. [sent-62, score-0.675]

28 The next layer has fading memory of the hidden state of the bottom layer, and consequently a fading memory of the input which reaches further in the past, and so on for each additional layer. [sent-63, score-0.817]

29 We will consider two scenarios: that where only the highest layer in the hierarchy couples to the output (DRNN-1O), and that where all layers do (DRNN-AO). [sent-67, score-0.88]

30 In the two respective cases, y(t) is given by: y(t) = softmax (U¯L (t)) , a (1) where U is the matrix with the output weights, and L y(t) = softmax Ui ¯i (t) , a (2) i=1 such that Ui corresponds to the output weights of the i-th layer. [sent-68, score-0.3]

31 The reason that we use output connections at each layer is twofold. [sent-70, score-0.642]

32 If we use backpropagation through time, the error will propagate from the top layer down the hierarchy, but it will have diminished in magnitude once it reaches the lower layers, such that they are not trained effectively. [sent-72, score-0.603]

33 Adding output connections at each layer amends this problem to some degree as the training error reaches all layers directly. [sent-73, score-0.869]

34 Secondly, having output connections at each layer provides us with a crude measure of its role in solving the task. [sent-74, score-0.642]

35 We can for instance measure the decay of performance by leaving out an individual layer’s contribution, or study which layer contributes most to predicting characters in speciﬁc instances. [sent-75, score-0.786]

36 (3) In the case where we use output connections at the top layer only, we use an incremental layer-wise method to train the network, which was necessary to reach good performance. [sent-87, score-0.717]

37 We add layers one by one and at all times an output layer only exists at the current top layer. [sent-88, score-0.836]

38 When adding a layer, the previous output weights are discarded and new output weights are initialised connecting from the new top layer. [sent-89, score-0.301]

39 In this way each layer has at least some time during training in which it is directly 3 coupled to the output, and as such can be trained effectively. [sent-90, score-0.595]

40 Over the course of each of these training stages we used the same training strategy as described before: training the full network with BPTT and linearly reducing the learning rate to zero before a new layer is added. [sent-91, score-0.75]

41 Notice the difference to common layer-wise training schemes where only a single layer is trained at a time. [sent-92, score-0.628]

42 We always train the full network after each layer is added. [sent-93, score-0.641]

43 4 billion characters long, of which the ﬁnal 10 million is used for testing. [sent-96, score-0.246]

44 5, except for the the top layer of the DRNN-1O, where we picked η0 = 0. [sent-114, score-0.553]

45 The network with output connections only at the top layer had a different number of parameter updates per training stage, T = {0. [sent-117, score-0.88]

46 As such, for each additional layer the network is trained for more iterations. [sent-121, score-0.668]

47 All sequences were 250 characters long, and the ﬁrst 50 characters were disregarded during the backwards pass, as they may have insufﬁcient temporal context. [sent-123, score-0.5]

48 gz In [19] only 86 character are used, but most of the additional characters in our set are exceedingly rare, such that cross-entropy is not affected meaningfully by this difference. [sent-134, score-0.342]

49 3 In our experience the networks are so large that there is very little difference in performance for different initialisations 4 The decision for 5 layers is based on a previous set of experiments (results not shown). [sent-135, score-0.255]

50 2 1 2 3 4 Removed layer 5 Figure 2: Increase in BPC on the test set from removing the output contribution of a single layer of the DRNN-AO. [sent-150, score-1.187]

51 Both DRNNs perform well and are roughly similar to the state-of-the-art for recurrent networks with the same number of trainable parameters5 , which was established with a multiplicative RNN (MRNN), trained with Hessian-free optimization in the course of 5 days on a cluster of 8 GPUs6 . [sent-153, score-0.375]

52 To check how each layer inﬂuences performance in the case of the DRNN-AO, we performed tests in which the output of a single layer is set to zero. [sent-156, score-1.102]

53 If for instance removing the top layer output contribution does not signiﬁcantly harm performance, this essentially means that it is redundant (as it does no preprocessing for higher layers). [sent-158, score-0.704]

54 Furthermore we can use this test to get an overall indication of which role a particular layer has in producing output. [sent-159, score-0.535]

55 Note that these experiments only have a limited interpretability, as the individual layer contributions are likely not independent. [sent-160, score-0.553]

56 Perhaps some layers provide strong negative output bias which compensates for strong positive bias of another, or strong synergies might exists between them. [sent-161, score-0.283]

57 First we measure the increase in test BPC by removing a single layer’s output contribution, which can then be used as an indicator for the importance of this layer for directly generating output. [sent-162, score-0.658]

58 The contribution of the top layer is the most important, and that of the bottom layer second important. [sent-164, score-1.15]

59 The intermediate layers contribute less to the direct output and seem to be more important in preprocessing the data for the top layer. [sent-165, score-0.335]

60 As in [19], we also used the networks in a generative mode, where we use the output probabilities of the DRNN-AO to recursively sample a new input character in order to complete a given sentence. [sent-166, score-0.315]

61 The left one is generated by the intact network, the middle one by leaving out the contribution of the ﬁrst layer, and the right one by leaving out the contribution of the top layer. [sent-188, score-0.287]

62 0 RNN layer 1 layer 2 layer 3 layer 4 layer 5 −1 10 RNN DRNN−1O DRNN−AO layer 1 layer 2 layer 3 layer 4 layer 5 1 10 average increase in BPC normalised average distance 10 −2 10 0 10 −1 10 −2 10 −3 10 −3 20 40 60 80 nr. [sent-189, score-5.073]

63 of presented characters 10 100 20 40 60 80 nr. [sent-190, score-0.223]

64 of presented characters 100 Figure 3: Left panel: normalised average distance between hidden states of a perturbed and unperturbed network as a function of presented characters. [sent-191, score-0.528]

65 The coloured full lines are for the individual layers of the DRNN-1O, and the coloured dashed lines are those of the layers of the DRNN-AO. [sent-193, score-0.484]

66 Right panel: Average increase in BPC between a perturbed and unperturbed network as a function of presented characters. [sent-195, score-0.251]

67 Coloured lines correspond to the individual contributions of the layers in the DRNN-AO. [sent-197, score-0.235]

68 The text sample of the intact network shows short-term correct grammar, phrases, punctuation and mostly existing words. [sent-200, score-0.332]

69 The text sample with the bottom layer output contribution disabled very rapidly becomes ‘unstable’, and starts to produce long strings of rare characters, indicating that the contribution of the bottom layer is essential in modeling some of the most basic statistics of the Wikipedia text corpus. [sent-201, score-1.652]

70 We veriﬁed this further by using such a random string of characters as initialization of the intact network, and observed that it consistently fell back to producing ‘normal’ text. [sent-202, score-0.288]

71 The text sample with the top layer disabled is interesting in the sense that it produces roughly word-length strings of common characters (letters and spaces), of which substrings resemble common syllables. [sent-203, score-1.022]

72 This suggests that the top layer output contribution captures text statistics longer than word-length sequences. [sent-204, score-0.846]

73 Time scales In order to gauge at what time scale each individual layer operates, we have performed several experiments on the models. [sent-205, score-0.564]

74 First of all we considered an experiment in which we run the DRNN on two identical text sequences from the test set, but after 100 characters we introduce a typo in one of them (by replacing it by a character randomly sampled from the full set). [sent-206, score-0.585]

75 We record the hidden states after the typo as a function of time for both the perturbed and unperturbed network 6 output 15 10 5 0 −5 prob. [sent-207, score-0.457]

76 presented characters 350 400 450 500 Figure 4: Network output example for a particularly long phrase between parentheses (296 characters), sampled from the test set. [sent-211, score-0.592]

77 The vertical dashed lines indicate the opening and closing parentheses in the input text sequence. [sent-212, score-0.494]

78 Top panel: output traces for the closing parenthesis character for each layer in the DRNN-AO. [sent-213, score-1.028]

79 Bottom panel: total predicted output probability of the closing parenthesis sign of the DRNN-AO. [sent-215, score-0.408]

80 In order to do so we measured the average difference in BPC between normal text and a perturbed copy, in which we replaced the ﬁrst 100 characters by text randomly sampled from elsewhere in the test set. [sent-218, score-0.537]

81 The left panel shows how fast each individual layer in the DRNNs forgets the typo-perturbation. [sent-222, score-0.605]

82 The DRNN-AO has very short time-scales in the three bottom layers and longer memory only appears for the two top ones, whereas in the DRNN-1O, the bottom two layers have relatively short time scales, but the top three layers have virtually the same, very long time scale. [sent-224, score-0.862]

83 This is almost certainly caused by the way in which we trained the DRNN-1O, such that intermediate layers already assumed long memory when they were at the top of the hierarchy. [sent-225, score-0.379]

84 The time scales of the individual layers of the DRNN-AO are also depicted (by using the perturbed hidden states of an individual layer and the unperturbed states for the other layers for generating output), which largely conﬁrms the result from the typo-perturbation test. [sent-230, score-1.106]

85 We have also performed a test to see what the time scales of an untrained DRNN are (by performing the typo test), which showed that here the differences in time-scales for each layer were far smaller (results not shown). [sent-232, score-0.662]

86 Long-term interactions: parentheses In order to get a clearer picture on some of the long-term dependencies the DRNNs have learned we look at their capability of closing parentheses, even when the phrase between parentheses is long. [sent-234, score-0.473]

87 To see how well the networks remember the opening of a parenthesis, we observe the DRNN-AO output for the closing parenthesis-character8 . [sent-235, score-0.414]

88 We both show the output probability and the individual layers’ output 8 Results on the DRNN-1O are qualitatively similar. [sent-237, score-0.228]

89 7 contribution for the closing parenthesis (before they are added up and sent to the softmax function). [sent-238, score-0.409]

90 The output of the top layer for the closing parenthesis is increased strongly for the whole duration of the phrase, and is reduced immediately after it is closed. [sent-239, score-0.961]

91 The total output probability shows a similar pattern, showing momentary high probabilities for the closing parenthesis only during the parenthesized phrase, and extremely low probabilities elsewhere. [sent-240, score-0.441]

92 When several sentences appear between parentheses (which occasionally happens in the text corpus), the network reduces the closing bracket probability (i. [sent-242, score-0.569]

93 Similarly, if a sentence starts with an opening bracket it will not increase closing parenthesis probability at all, essentially ignoring it. [sent-245, score-0.452]

94 The fact that the DRNN is able to remember the opening parenthesis for sequences longer than it has been trained on indicates that it has learned to model parentheses as a pseudo-stable attractor-like state, rather than memorizing parenthesized phrases of different lengths. [sent-247, score-0.484]

95 A test is deemed unsuccessful if the closing parenthesis doesn’t appear in 500 characters, or if it produces a second opening parenthesis. [sent-249, score-0.423]

96 The results presented in this section hint at the fact that DRNNs might ﬁnd it easier to learn longterm relations between input characters than common RNNs. [sent-253, score-0.28]

97 We also present experimental evidence for the appearance of a hierarchy of time-scales present in the layers of the DRNNs. [sent-260, score-0.279]

98 Finally we have demonstrated that in certain cases the DRNNs can have extensive memory of several hundred characters long. [sent-261, score-0.283]

99 Another one is to extend common pre-training schemes, such as the deep belief network approach [9] and deep auto-encoders [10, 20] for DRNNs. [sent-263, score-0.344]

100 Sequence labelling in structured domains with hierarchia cal recurrent neural networks. [sent-315, score-0.222]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('layer', 0.501), ('rnn', 0.333), ('drnns', 0.259), ('characters', 0.223), ('bpc', 0.203), ('drnn', 0.203), ('recurrent', 0.193), ('layers', 0.183), ('closing', 0.161), ('parenthesis', 0.147), ('rnns', 0.147), ('character', 0.119), ('network', 0.117), ('text', 0.117), ('parentheses', 0.111), ('output', 0.1), ('deep', 0.097), ('hierarchy', 0.096), ('typo', 0.092), ('phrase', 0.09), ('opening', 0.081), ('networks', 0.072), ('unperturbed', 0.065), ('intact', 0.065), ('dnn', 0.06), ('memory', 0.06), ('trainable', 0.06), ('temporal', 0.054), ('wikipedia', 0.052), ('top', 0.052), ('contribution', 0.051), ('trained', 0.05), ('softmax', 0.05), ('life', 0.049), ('initialised', 0.049), ('architecture', 0.047), ('perturbed', 0.046), ('coloured', 0.045), ('dnns', 0.045), ('folded', 0.045), ('fading', 0.045), ('bottom', 0.045), ('speech', 0.044), ('training', 0.044), ('panel', 0.043), ('connections', 0.041), ('bracket', 0.04), ('normalised', 0.04), ('architectures', 0.037), ('hidden', 0.037), ('phrases', 0.037), ('bptt', 0.037), ('disabled', 0.037), ('lstm', 0.037), ('mrnn', 0.037), ('paq', 0.037), ('scales', 0.035), ('hierarchical', 0.035), ('long', 0.034), ('leaving', 0.034), ('test', 0.034), ('common', 0.033), ('forgets', 0.033), ('punctuation', 0.033), ('graves', 0.033), ('echo', 0.033), ('ghent', 0.033), ('parenthesized', 0.033), ('language', 0.032), ('martens', 0.03), ('timit', 0.03), ('perturbation', 0.03), ('neural', 0.029), ('ens', 0.028), ('consortium', 0.028), ('older', 0.028), ('individual', 0.028), ('wi', 0.028), ('rare', 0.027), ('theano', 0.027), ('ai', 0.027), ('passes', 0.027), ('context', 0.026), ('prize', 0.026), ('strings', 0.026), ('longer', 0.025), ('tasks', 0.025), ('updates', 0.025), ('inde', 0.025), ('contributions', 0.024), ('zi', 0.024), ('tanh', 0.024), ('input', 0.024), ('sentences', 0.023), ('suspect', 0.023), ('corpus', 0.023), ('million', 0.023), ('increase', 0.023), ('train', 0.023), ('mohamed', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

Author: Michiel Hermans, Benjamin Schrauwen

2 0.27233282 251 nips-2013-Predicting Parameters in Deep Learning

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

Abstract: We demonstrate that there is signiﬁcant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1

3 0.23008844 331 nips-2013-Top-Down Regularization of Deep Belief Networks

Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1

4 0.21123239 75 nips-2013-Convex Two-Layer Modeling

Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans

Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difﬁcult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1

5 0.19447953 5 nips-2013-A Deep Architecture for Matching Short Texts

Author: Zhengdong Lu, Hang Li

Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufﬁcient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More speciﬁcally, we apply this model to matching tasks in natural language, e.g., ﬁnding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1

6 0.17619461 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

7 0.14003336 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

8 0.13485058 64 nips-2013-Compete to Compute

9 0.13151756 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

10 0.11374186 226 nips-2013-One-shot learning by inverting a compositional causal process

11 0.10961588 84 nips-2013-Deep Neural Networks for Object Detection

12 0.10400105 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality

13 0.088583879 30 nips-2013-Adaptive dropout for training deep neural networks

14 0.086170167 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

15 0.085817635 160 nips-2013-Learning Stochastic Feedforward Neural Networks

16 0.083339117 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems

17 0.080342561 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

18 0.073992223 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion

19 0.073837399 339 nips-2013-Understanding Dropout

20 0.065342851 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.16), (1, 0.082), (2, -0.174), (3, -0.118), (4, 0.084), (5, -0.167), (6, -0.081), (7, 0.058), (8, 0.038), (9, -0.168), (10, 0.186), (11, -0.002), (12, -0.008), (13, 0.048), (14, 0.047), (15, 0.036), (16, 0.04), (17, 0.001), (18, -0.026), (19, -0.024), (20, 0.036), (21, -0.124), (22, 0.08), (23, 0.035), (24, 0.069), (25, 0.06), (26, -0.178), (27, -0.009), (28, 0.08), (29, 0.026), (30, -0.007), (31, -0.025), (32, -0.046), (33, -0.061), (34, -0.017), (35, 0.129), (36, -0.034), (37, 0.019), (38, -0.017), (39, -0.033), (40, 0.05), (41, -0.052), (42, -0.008), (43, -0.03), (44, -0.007), (45, 0.044), (46, 0.001), (47, -0.111), (48, -0.111), (49, -0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9735868 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

Author: Michiel Hermans, Benjamin Schrauwen

2 0.82882428 251 nips-2013-Predicting Parameters in Deep Learning

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

3 0.82444739 331 nips-2013-Top-Down Regularization of Deep Belief Networks

Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

4 0.72892594 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

Author: Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

Abstract: As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classiﬁcation benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the stateof-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture signiﬁcantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a smaller computational learning cost. Our hybrid architecture allows us to assess how the performance of a conventional hand-crafted image classiﬁcation pipeline changes with increased depth. We also show that convolutional networks and Fisher vector encodings are complementary in the sense that their combination further improves the accuracy. 1

5 0.69042206 64 nips-2013-Compete to Compute

Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber

Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artiﬁcial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1

6 0.60381633 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

7 0.59100562 5 nips-2013-A Deep Architecture for Matching Short Texts

8 0.57596713 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

9 0.5479582 75 nips-2013-Convex Two-Layer Modeling

10 0.54569954 30 nips-2013-Adaptive dropout for training deep neural networks

11 0.54494798 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

12 0.51781636 160 nips-2013-Learning Stochastic Feedforward Neural Networks

13 0.51707619 85 nips-2013-Deep content-based music recommendation

14 0.51123357 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

15 0.50074321 84 nips-2013-Deep Neural Networks for Object Detection

16 0.48418358 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems

17 0.47721592 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

18 0.45782623 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

19 0.45303476 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality

20 0.42396879 264 nips-2013-Reciprocally Coupled Local Estimators Implement Bayesian Information Integration Distributively

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.205), (16, 0.036), (33, 0.194), (34, 0.071), (36, 0.011), (41, 0.024), (43, 0.012), (49, 0.04), (56, 0.079), (70, 0.065), (85, 0.029), (89, 0.025), (93, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86516881 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

Author: Michiel Hermans, Benjamin Schrauwen

2 0.78543323 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements

Author: Divyanshu Vats, Richard Baraniuk

Abstract: We consider the problem of accurately estimating a high-dimensional sparse vector using a small number of linear measurements that are contaminated by noise. It is well known that standard computationally tractable sparse recovery algorithms, such as the Lasso, OMP, and their various extensions, perform poorly when the measurement matrix contains highly correlated columns. We develop a simple greedy algorithm, called SWAP, that iteratively swaps variables until a desired loss function cannot be decreased any further. SWAP is surprisingly effective in handling measurement matrices with high correlations. We prove that SWAP can easily be used as a wrapper around standard sparse recovery algorithms for improved performance. We theoretically quantify the statistical guarantees of SWAP and complement our analysis with numerical results on synthetic and real data.

3 0.7813707 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

Author: Corinna Cortes, Marius Kloft, Mehryar Mohri

Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby beneﬁt from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efﬁcient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classiﬁcation tasks. 1

4 0.7602849 331 nips-2013-Top-Down Regularization of Deep Belief Networks

Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

5 0.75607967 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classiﬁcation error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

6 0.75530708 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

7 0.74832469 251 nips-2013-Predicting Parameters in Deep Learning

8 0.74754441 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

9 0.74705786 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

10 0.74701971 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

11 0.74535334 64 nips-2013-Compete to Compute

12 0.74443632 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

13 0.74415433 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

14 0.74402928 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

15 0.74279302 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

16 0.74229819 335 nips-2013-Transfer Learning in a Transductive Setting

17 0.74138534 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality

18 0.7390554 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

19 0.73829818 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

20 0.73740858 301 nips-2013-Sparse Additive Text Models with Low Rank Background