nips nips2008 nips2008-237 knowledge-graph by maker-knowledge-mining

237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine

Source: pdf

Author: Ilya Sutskever, Geoffrey E. Hinton, Graham W. Taylor

Abstract: The Temporal Restricted Boltzmann Machine (TRBM) is a probabilistic model for sequences that is able to successfully model (i.e., generate nice-looking samples of) several very high dimensional sequences, such as motion capture data and the pixels of low resolution videos of balls bouncing in a box. The major disadvantage of the TRBM is that exact inference is extremely hard, since even computing a Gibbs update for a single variable of the posterior is exponentially expensive. This difﬁculty has necessitated the use of a heuristic inference procedure, that nonetheless was accurate enough for successful learning. In this paper we introduce the Recurrent TRBM, which is a very slight modiﬁcation of the TRBM for which exact inference is very easy and exact gradient learning is almost tractable. We demonstrate that the RTRBM is better than an analogous TRBM at generating motion capture and videos of bouncing balls. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 , generate nice-looking samples of) several very high dimensional sequences, such as motion capture data and the pixels of low resolution videos of balls bouncing in a box. [sent-5, score-0.359]

2 The major disadvantage of the TRBM is that exact inference is extremely hard, since even computing a Gibbs update for a single variable of the posterior is exponentially expensive. [sent-6, score-0.133]

3 This difﬁculty has necessitated the use of a heuristic inference procedure, that nonetheless was accurate enough for successful learning. [sent-7, score-0.079]

4 In this paper we introduce the Recurrent TRBM, which is a very slight modiﬁcation of the TRBM for which exact inference is very easy and exact gradient learning is almost tractable. [sent-8, score-0.142]

5 We demonstrate that the RTRBM is better than an analogous TRBM at generating motion capture and videos of bouncing balls. [sent-9, score-0.277]

6 It was shown to be able to generate realistic motion capture data [14], and low resolution videos of 2 balls bouncing in a box [13], as well as complete and denoise such sequences. [sent-14, score-0.33]

7 As a probabilistic model, the TRBM is a directed graphical model consisting of a sequence of Restricted Boltzmann Machines (RBMs) [3], where the state of one or more previous RBMs determines the biases of the RBM in next timestep. [sent-15, score-0.105]

8 This probabilistic formulation straightforwardly implies a learning procedure where approximate inference is followed by learning. [sent-16, score-0.071]

9 Exact inference in TRBMs, on the other hand, is highly non-trivial, since computing even a single Gibbs update requires computing the ratio of two RBM partition functions. [sent-18, score-0.11]

10 The approximate inference procedure used in [13] was heuristic and was not even derived from a variational principle. [sent-19, score-0.071]

11 Despite the similarity, exact inference is very easy in the RTRBM and computing the gradient of the log likelihood is feasible (up to the error introduced by the use of Contrastive Divergence). [sent-21, score-0.183]

12 We demonstrate that the RTRBM is able to generate more realistic samples than an equivalent TRBM for the motion capture data and for the pixels of videos of bouncing balls. [sent-22, score-0.306]

13 In general, we use i to index visible vectors V and j to index hidden vectors H. [sent-31, score-0.136]

14 (2) Using this equation does not change the form of the gradients and the conditional distribution P (H|v). [sent-33, score-0.066]

15 Computing the exact values of the expectations · P (H,V ) is computationally intractable, and much work has been done on methods for computing approximate values for the expectations that are good enough for practical learning and inference tasks (e. [sent-37, score-0.095]

16 We will approximate the gradients with respect to the RBM’s parameters using the Contrastive Divergence [3] learning procedure, CDn , whose updates are computed by the following algorithm. [sent-40, score-0.063]

17 The RBM also plays a critical role in deep belief networks [4], [5], but we do not use this connection in this paper. [sent-46, score-0.051]

18 The TRBM, as described in the introduction, is a sequence of RBMs arranged in such a way that in any given timestep, the RBM’s biases depend only on the state of the RBM in the previous timestep. [sent-48, score-0.052]

19 Figure 1: The graphical structure of a TRBM: a directed sequence of RBMs. [sent-50, score-0.056]

20 The TRBM deﬁnes a probability distribuT T tion P (V1T = v1 , H1 = hT ) by the equation 1 T T P (v1 , hT ) = 1 P (vt , ht |ht−1 )P0 (v1 , h1 ) (4) t=2 which is identical to the deﬁning equation of the HMM. [sent-56, score-0.629]

21 The conditional distribution P (Vt , Ht |ht−1 ) is that of an RBM, whose biases for Ht are a function of ht−1 . [sent-57, score-0.053]

22 Speciﬁcally, ⊤ ⊤ P (vt , ht |ht−1 ) = exp vt bV + vt W ht + h⊤ (bH + W ′ ht−1 ) /Z(ht−1 ) t (5) where bV , bH and W are as in Eq. [sent-58, score-1.76]

23 5, except that the (undeﬁned) term W ′ h0 is replaced by the term binit , so the hidden units receive a special initial bias at P0 ; we will often write P (V1 , H1 |h0 ) for P0 (V1 , H1 ) and W ′ h0 for binit . [sent-66, score-0.231]

24 It follows from these equations that the TRBM is a directed graphical model that has an (undirected) RBM at each timestep (a related directed sequence of Boltzmann Machines has been considered in [7]). [sent-67, score-0.125]

25 As in most probabilistic models, the weight update is computed by solving the inference problem and computing the weight update as if the inferred variables were observed. [sent-68, score-0.164]

26 If the hidden variables are observed, equation 4 implies that the gradient of the log likelihood with T respect to the TRBM’s parameters is t=1 ∇log P (vt , ht |ht−1 ), and each term, being the gradient of the log likelihood of an RBM, can be approximated using CDn . [sent-70, score-0.81]

27 Inference in a TRBM Unfortunately, the TRBM’s inference problem is harder than that of a typical undirected graphical (j) model, because even computing the probability P (Ht = 1| everything else) involves evaluating the exact ratio of two RBM partition functions, which can be seen from Eq. [sent-72, score-0.116]

28 This difﬁculty necessitated the use of a heuristic inference procedure [13], which is based on the observation that the t distribution P (Ht |ht−1 , v1 ) = P (Ht |ht−1 , vt ) is factorial by deﬁnition. [sent-74, score-0.458]

29 The statement h ∼ P ′ (H) means that h is sampled from the factorial distribution P ′ (H), so each h(j) is set to 1 with 2 This is a slightly simpliﬁed description of the inference procedure in [13]. [sent-78, score-0.094]

30 The conditional distribution Q(Vt , Ht |ht−1 ) is given by the equation ′ ⊤ ′ ⊤ ′ ′ Q(vt , ht |ht−1 ) = exp vt W ht + vt bV + ht (bH + W ht−1 ) /Z(ht−1 ), which is essentially the ′ same as the TRBM’s conditional distribution P from equation 5. [sent-81, score-2.393]

31 The symbol P stands for the distribution of some TRBM, while the symbol Q stands for the distribution deﬁned by an RTRBM. [sent-86, score-0.058]

32 Note that the outcome of the operation · ← P (Ht |vt , ht−1 ) is s(W vt + W ′ ht−1 + bH ). [sent-87, score-0.355]

33 To sample from a TRBM P , we need to perform a directed pass, sampling from each RBM on every timestep. [sent-91, score-0.049]

34 sample ht ∼ P (Ht |vt , ht−1 ) 3 where step 1 requires sampling from the marginals of a Boltzmann Machine (by integrating out Ht ), which involves running a Markov chain. [sent-95, score-0.565]

35 By deﬁnition, RTRBMs and TRBMs are parameterized in the same way, so from now on we will assume that P and Q have identical parameters, which are W, W ′ , bV , bH , and binit . [sent-96, score-0.074]

36 set ht ← P (Ht |vt , ht−1 ) We can infer that Q(Vt |ht−1 ) = P (Vt |ht−1 ) because of step 1 in Algorithm 3, which is also con′ sistent with the equation given in ﬁgure 2 where Ht is integrated out. [sent-100, score-0.571]

37 The difference may seem small, since the operations ht ∼ P (Ht |vt , ht−1 ) and ht ← P (Ht |vt , ht−1 ) appear similar. [sent-102, score-1.082]

38 However, this difference signiﬁcantly alters the inference and learning procedures of the RTRBM; in particular, it can already be seen that Ht are real-valued for the RTRBM. [sent-103, score-0.044]

39 The reason inference is easy is similar to the reason inference in square ICAs is easy [1]: There is a unique and an easily computable value of the hidden variables that has a nonzero posterior probability. [sent-108, score-0.232]

40 Any other value for h1 is never produced by a generative process that outputs v1 and thus has posterior probability 0. [sent-111, score-0.056]

41 As before, since v2 was produced at the end of step 1, then the fact that step 2 has been executed implies that h2 can be computed by h2 ← P (H2 |v2 , h1 ) (recall that at this point h1 is known with absolute certainty). [sent-116, score-0.057]

42 If the same reasoning is repeated t times, then all of ht is uniquely determined and is easily computed 1 when V1t is known. [sent-117, score-0.56]

43 This is because Q(Ht |vt , ht−1 ) = δs(W vt +bH +W ′ ht−1 ) (Ht ). [sent-119, score-0.339]

44 The resulting inference algorithm is simple: Algorithm 4 (inference in RTRBMs) for 1 ≤ t ≤ T : 1. [sent-120, score-0.044]

45 ht ← P (Ht |vt , ht−1 ) T Let h(v)T denote the output of the inference algorithm on input v1 , in which case the posterior is 1 described by T T T Q(H1 |v1 ) = δh(v)T (H1 ). [sent-121, score-0.601]

46 2 Learning in RTRBMs Learning in RTRBMs may seem easy once inference is solved, since the main difﬁculty in learning TRBMs is the inference problem. [sent-123, score-0.112]

47 To be precise, 1 1 1 1 T the gradient ∇log Q(v1 , hT ) is undeﬁned because δs(W ′ ht−1 +bH +W T vt ) (ht ) is not, in general, a 1 continuous function of W . [sent-125, score-0.371]

48 T t−1 T Notice that the RTRBM’s log probability satisﬁes log Q(v1 ) = t=1 log Q(vt |v1 ), so we could T t−1 try computing the sum ∇ t=1 log Q(vt |v1 ). [sent-127, score-0.188]

49 The key observation that makes the computation feasible is the equation t−1 Q(Vt |v1 ) = Q(Vt |h(v)t−1 ) (8) t where h(v)t−1 is the value computed by the RTRBM inference algorithm with inputs v1 . [sent-128, score-0.093]

50 t−1 The equality Q(Vt |v1 ) = Q(Vt |h(v)t−1 ) allows us to deﬁne a recurrent neural network (RNN) [10] whose parameters are identical to those of the RTRBM, and whose cost function is equal to the log likelihood of the RTRBM. [sent-131, score-0.117]

51 This is useful because it is easy to compute gradients with respect to the RNN’s parameters using the backpropagation through time algorithm [10]. [sent-132, score-0.092]

52 The RNN has a pair of variables at each timestep, {(vt , rt )}T , where vt are the input variables and rt are the RNN’s t=1 T hidden variables (all of which are deterministic). [sent-133, score-0.741]

53 The hiddens r1 are computed by the equation rt = s(W vt + bH + W ′ rt−1 ) (9) where W ′ rt−1 is replaced with binit when t = 1. [sent-134, score-0.593]

54 This is a valid deﬁnition T of an RNN whose cumulative objective for the sequence v1 is T log Q(vt |rt−1 ) O= (10) t=1 T where Q(v1 |r0 ) = Q0 (v1 ). [sent-137, score-0.057]

55 But since rt as computed in equation 9 on input v1 is identical to h(v)t , t−1 the equality log Q(vt |rt−1 ) = log Q(vt |v1 ) holds. [sent-138, score-0.298]

56 10 yields T T t−1 T log Q(vt |v1 ) = log Q(v1 ) log Q(vt |rt−1 ) = O= t=1 (11) t=1 which is the log probability of the corresponding RTRBM. [sent-140, score-0.168]

57 T This means that ∇O = ∇ log Q(v1 ) can be computed with the backpropagation through time algorithm [10], where the contribution of the gradient from each timestep is computed with Contrastive Divergence. [sent-141, score-0.193]

58 3 Details of the backpropagation through time algorithm The backpropagation through time algorithm is identical to the usual backpropagation algorithm where the feedforward neural network is turned “on its side”. [sent-143, score-0.141]

59 Speciﬁcally, the algorithm maintains a term ∂O/∂rt which is computed from ∂O/∂rt+1 and ∂ log Q(vt+1 |rt )/∂rt using the chain rule, by the equation ⊤ ∂O/∂rt = W ′ (rt+1 . [sent-144, score-0.091]

60 ∂O/∂rt+1 ) + W ′⊤ ∂ log Q(vt |rt−1 )/∂bH (12) where a. [sent-146, score-0.042]

61 (1 − rt ) arises from the derivative of the logistic function s′ (x) = s(x). [sent-148, score-0.159]

62 (1 − s(x)), and ∂ log Q(vt+1 |rt )/∂bH is computed by CD. [sent-149, score-0.061]

63 Once ∂O/∂rt is computed for all t, the gradients of the parameters can be computed using the following equations T ∂O ∂W ′ = ∂O ∂W = rt−1 (rt . [sent-150, score-0.069]

64 ∂O/∂rt+1 ) t=1 ⊤ T ∂ log Q(vt |rt−1 )/∂W (14) + t=1 The ﬁrst summation in Eq. [sent-154, score-0.042]

65 14 arises from the use of W as weights for inference for computing rt and the second summation arises from the use of W as RBM parameters for computing log Q(vt |rt−1 ). [sent-155, score-0.321]

66 Each term of the form ∂ log Q(vt+1 |rt )/∂W is also computed with CD. [sent-156, score-0.061]

67 It is also seen that the gradient would be computed exactly if CD were to return the exact gradient of the RBM’s log probability. [sent-159, score-0.146]

68 The results in [14, 13] were obtained using TRBMs that had several delay-taps, which means that each hidden unit could directly observe several previous timesteps. [sent-161, score-0.067]

69 To demonstrate that the RTRBM learns to use the hidden units to store information, we did not use delay-taps for the RTRBM nor the TRBM, which causes the results to be worse (but not much) than in [14, 13]. [sent-162, score-0.117]

70 In all experiments, the RTRBM and the TRBM had the same number of hidden units, their parameters were initialized in the same manner, and they were trained for the same number of weight updates. [sent-164, score-0.107]

71 When sampling from the TRBM, we would use the sampling procedure of the RTRBM using the TRBM’s parameters to eliminate the additional noise from its hidden units. [sent-165, score-0.119]

72 Unfortunately, the evaluation metric is entirely qualitative since computing the log probability on a test set is infeasible for both the TRBM and the RTRBM. [sent-167, score-0.062]

73 Figure 3: This ﬁgure shows the receptive ﬁelds of the ﬁrst 36 hidden units of the RTRBM on the left, and the corresponding hidden-to-hidden weights between these units on the right: the ith row on the right corresponds to the ith receptive ﬁeld on the left, when counted left-to-right. [sent-169, score-0.205]

74 Hidden units 18 and 19 exhibit unusually strong hidden-to-hidden connections; they are also the ones with the weakest visible-hidden connections, which effectively makes them belong to another hidden layer. [sent-170, score-0.117]

75 1 Videos of bouncing balls We used a dataset consisting of videos of 3 balls bouncing in a box. [sent-172, score-0.366]

76 The videos are of length 100 and of resolution 30×30. [sent-173, score-0.142]

77 The task is to learn to generate videos at the pixel level. [sent-175, score-0.128]

78 Both the RTRBM and the TRBM had 400 hidden units. [sent-179, score-0.067]

79 Samples from these models are provided as videos 1,2 (RTRBM) and videos 3,4 (TRBM). [sent-180, score-0.256]

80 The real-values in the videos are the conditional probabilities of the pixels [13]. [sent-183, score-0.157]

81 The RTRBM’s samples are noticeably better than the TRBM’s samples; a key difference between these samples is that the balls produced by the TRBM moved in a random walk, while those produced by the RTRBM moved in a more persistent direction. [sent-184, score-0.153]

82 An examination of the visible to hidden connection weights of the RTRBM reveals a number of hidden units that are not connected to visible units. [sent-185, score-0.308]

83 These units have the most active hidden to hidden connections, which must be used to propagate information through time. [sent-186, score-0.184]

84 In particular, these units are the only units that ′ do not have a strong self connection (i. [sent-187, score-0.116]

85 No such separation of units is found in the TRBM and all its hidden units have large visible to hidden connections. [sent-190, score-0.281]

86 2 Motion capture data We used a dataset that represents human motion capture data by sequences of joint angle, translations, and rotations of the base of the spine [14]. [sent-192, score-0.103]

87 Each frame has 49 dimensions, and both models have 200 hidden units. [sent-194, score-0.078]

88 The data is real-valued, so the TRBM and the RTRBM were adapted to have Gaussian visible variables using equation 2. [sent-195, score-0.09]

89 The samples produced by the RTRBM exhibit less sticking and foot-skate than those produced by the TRBM; samples from these models are provided as videos 6,7 (RTRBM) and videos 8,9 (TRBM); video 10 is a sample training sequence. [sent-196, score-0.351]

90 9, where the gradient was normalized by the length of the sequence for each gradient computation. [sent-200, score-0.079]

91 The weights are updated after computing the gradient on a single sequence. [sent-201, score-0.066]

92 The visible to hidden weights, W , were initialized with static CD5 (without using the (R)TRBM learning rules) on 30 sequences (which resulted in 30 weight updates) with learning rate of 0. [sent-203, score-0.158]

93 The weights W ′ and the biases were initialized with a sample from spherical Gaussian of standard-deviation 0. [sent-207, score-0.074]

94 For the bouncing balls problem the initial learning rate was 0. [sent-209, score-0.119]

95 6 Conclusions In this paper we introduced the RTRBM, which is a probabilistic model as powerful as the intractable TRBM that has an exact inference and an almost exact learning procedure. [sent-212, score-0.098]

96 The common disadvantage of the RTRBM is that it is a recurrent neural network, a type of model known to have difﬁculties learning to use its hidden units to their full potential [2]. [sent-213, score-0.185]

97 For Matlab playback of motion and generation of videos, we have adapted portions of Neil Lawrence’s motion capture toolbox (http://www. [sent-220, score-0.119]

98 A tutorial on hidden Markov models and selected applications inspeech recognition. [sent-277, score-0.067]

99 Training restricted boltzmann machines using approximations to the likelihood gradient. [sent-314, score-0.101]

100 A new class of upper bounds on the log partition function. [sent-323, score-0.057]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ht', 0.541), ('trbm', 0.5), ('rtrbm', 0.386), ('vt', 0.339), ('rbm', 0.177), ('rt', 0.148), ('bh', 0.137), ('videos', 0.128), ('bv', 0.114), ('rnn', 0.091), ('bouncing', 0.08), ('rtrbms', 0.068), ('trbms', 0.068), ('hidden', 0.067), ('boltzmann', 0.066), ('binit', 0.057), ('motion', 0.05), ('units', 0.05), ('visible', 0.047), ('timestep', 0.044), ('inference', 0.044), ('log', 0.042), ('contrastive', 0.04), ('balls', 0.039), ('biases', 0.037), ('backpropagation', 0.037), ('hinton', 0.034), ('recurrent', 0.034), ('gradient', 0.032), ('equation', 0.03), ('cdn', 0.03), ('produced', 0.026), ('factorial', 0.025), ('directed', 0.025), ('rbms', 0.024), ('easy', 0.024), ('necessitated', 0.023), ('xta', 0.023), ('exact', 0.021), ('connections', 0.021), ('restricted', 0.021), ('disadvantage', 0.021), ('gradients', 0.02), ('unde', 0.02), ('nh', 0.02), ('momentum', 0.02), ('ilya', 0.02), ('computing', 0.02), ('computed', 0.019), ('capture', 0.019), ('osindero', 0.018), ('sutskever', 0.018), ('deep', 0.018), ('identical', 0.017), ('belief', 0.017), ('weight', 0.017), ('stands', 0.017), ('samples', 0.016), ('nv', 0.016), ('conditional', 0.016), ('operation', 0.016), ('connection', 0.016), ('graphical', 0.016), ('posterior', 0.016), ('partition', 0.015), ('sequence', 0.015), ('sequences', 0.015), ('url', 0.015), ('moved', 0.015), ('procedure', 0.015), ('weights', 0.014), ('machines', 0.014), ('resolution', 0.014), ('generative', 0.014), ('culty', 0.013), ('neural', 0.013), ('blind', 0.013), ('sampling', 0.013), ('pixels', 0.013), ('variables', 0.013), ('updates', 0.013), ('heuristic', 0.012), ('cd', 0.012), ('probabilistic', 0.012), ('initialized', 0.012), ('executed', 0.012), ('symbol', 0.012), ('receptive', 0.012), ('arises', 0.011), ('frame', 0.011), ('divergence', 0.011), ('vectors', 0.011), ('wainwright', 0.011), ('tion', 0.011), ('sample', 0.011), ('parameters', 0.011), ('update', 0.011), ('done', 0.01), ('statement', 0.01), ('pass', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine

Author: Ilya Sutskever, Geoffrey E. Hinton, Graham W. Taylor

2 0.16747773 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines

Author: Vinod Nair, Geoffrey E. Hinton

Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are deﬁned implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reﬂect the class structure in the data. 1

3 0.11770208 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries

Author: Silvia Chiappa, Jens Kober, Jan R. Peters

Abstract: Motor primitives or motion templates have become an important concept for both modeling human motor control as well as generating robot behaviors using imitation learning. Recent impressive results range from humanoid robot movement generation to timing models of human motions. The automatic generation of skill libraries containing multiple motion templates is an important step in robot learning. Such a skill learning system needs to cluster similar movements together and represent each resulting motion template as a generative model which is subsequently used for the execution of the behavior by a robot system. In this paper, we show how human trajectories captured as multi-dimensional time-series can be clustered using Bayesian mixtures of linear Gaussian state-space models based on the similarity of their dynamics. The appropriate number of templates is automatically determined by enforcing a parsimonious parametrization. As the resulting model is intractable, we introduce a novel approximation method based on variational Bayes, which is especially designed to enable the use of efﬁcient inference algorithms. On recorded human Balero movements, this method is not only capable of ﬁnding reasonable motion templates but also yields a generative model which works well in the execution of this complex task on a simulated anthropomorphic SARCOS arm.

4 0.10793016 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising

Author: Steffen Bickel, Christoph Sawade, Tobias Scheffer

Abstract: We address the problem of learning classiﬁers for several related tasks that may differ in their joint distribution of input and output variables. For each task, small – possibly even empty – labeled samples and large unlabeled samples are available. While the unlabeled samples reﬂect the target distribution, the labeled samples may be biased. This setting is motivated by the problem of predicting sociodemographic features for users of web portals, based on the content which they have accessed. Here, questionnaires offered to a portion of each portal’s users produce biased samples. We derive a transfer learning procedure that produces resampling weights which match the pool of all examples to the target distribution of any given task. Transfer learning enables us to make predictions even for new portals with few or no training data and improves the overall prediction accuracy. 1

5 0.089729905 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images

Author: Tanya Schmah, Geoffrey E. Hinton, Steven L. Small, Stephen Strother, Richard S. Zemel

Abstract: Neuroimaging datasets often have a very large number of voxels and a very small number of training cases, which means that overﬁtting of models for this data can become a very serious problem. Working with a set of fMRI images from a study on stroke recovery, we consider a classiﬁcation task for which logistic regression performs poorly, even when L1- or L2- regularized. We show that much better discrimination can be achieved by ﬁtting a generative model to each separate condition and then seeing which model is most likely to have generated the data. We compare discriminative training of exactly the same set of models, and we also consider convex blends of generative and discriminative training. 1

6 0.074193478 244 nips-2008-Unifying the Sensory and Motor Components of Sensorimotor Adaptation

7 0.064251892 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models

8 0.056112587 234 nips-2008-The Infinite Factorial Hidden Markov Model

9 0.055551417 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

10 0.052821901 224 nips-2008-Structured ranking learning using cumulative distribution networks

11 0.050668456 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

12 0.044957951 206 nips-2008-Sequential effects: Superstition or rational behavior?

13 0.043362174 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

14 0.0431365 168 nips-2008-Online Metric Learning and Fast Similarity Search

15 0.041599836 219 nips-2008-Spectral Hashing

16 0.040587403 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

17 0.039051052 44 nips-2008-Characteristic Kernels on Groups and Semigroups

18 0.037235074 118 nips-2008-Learning Transformational Invariants from Natural Movies

19 0.036169924 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

20 0.032945823 119 nips-2008-Learning a discriminative hidden part model for human action recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.101), (1, 0.015), (2, 0.043), (3, -0.026), (4, 0.012), (5, 0.028), (6, -0.024), (7, 0.06), (8, 0.073), (9, -0.028), (10, 0.011), (11, 0.08), (12, 0.014), (13, 0.111), (14, -0.1), (15, -0.139), (16, -0.16), (17, -0.131), (18, -0.072), (19, 0.115), (20, -0.037), (21, -0.051), (22, -0.017), (23, -0.022), (24, 0.037), (25, 0.101), (26, -0.027), (27, 0.125), (28, 0.083), (29, 0.097), (30, -0.018), (31, 0.012), (32, -0.044), (33, -0.018), (34, -0.008), (35, 0.048), (36, -0.067), (37, -0.079), (38, -0.064), (39, 0.072), (40, 0.088), (41, -0.08), (42, -0.091), (43, 0.063), (44, 0.053), (45, 0.063), (46, -0.004), (47, -0.052), (48, -0.116), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95203781 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine

Author: Ilya Sutskever, Geoffrey E. Hinton, Graham W. Taylor

2 0.71060944 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines

Author: Vinod Nair, Geoffrey E. Hinton

3 0.64214665 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images

Author: Tanya Schmah, Geoffrey E. Hinton, Steven L. Small, Stephen Strother, Richard S. Zemel

4 0.41325808 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries

Author: Silvia Chiappa, Jens Kober, Jan R. Peters

5 0.40279061 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models

Author: Iain Murray, Ruslan Salakhutdinov

Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1

6 0.39303014 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising

7 0.35315171 234 nips-2008-The Infinite Factorial Hidden Markov Model

8 0.34909052 244 nips-2008-Unifying the Sensory and Motor Components of Sensorimotor Adaptation

9 0.31438014 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

10 0.3012386 110 nips-2008-Kernel-ARMA for Hand Tracking and Brain-Machine interfacing During 3D Motor Control

11 0.2866683 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks

12 0.2851631 249 nips-2008-Variational Mixture of Gaussian Process Experts

13 0.25955534 236 nips-2008-The Mondrian Process

14 0.25447068 65 nips-2008-Domain Adaptation with Multiple Sources

15 0.25316086 211 nips-2008-Simple Local Models for Complex Dynamical Systems

16 0.25295371 233 nips-2008-The Gaussian Process Density Sampler

17 0.24244347 81 nips-2008-Extracting State Transition Dynamics from Multiple Spike Trains with Correlated Poisson HMM

18 0.23274691 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

19 0.23103981 118 nips-2008-Learning Transformational Invariants from Natural Movies

20 0.23080748 219 nips-2008-Spectral Hashing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.043), (7, 0.059), (12, 0.039), (28, 0.145), (32, 0.049), (39, 0.35), (57, 0.078), (63, 0.025), (77, 0.029), (78, 0.012), (83, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71891284 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine

Author: Ilya Sutskever, Geoffrey E. Hinton, Graham W. Taylor

2 0.62331605 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification

Author: Doug Downey, Oren Etzioni

Abstract: Is accurate classiﬁcation possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction—where the probability of class membership increases monotonically with the MF’s value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classiﬁcation applications. On the classic “20 Newsgroups” data set, a learner given an MF and unlabeled data achieves classiﬁcation accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data. 1

3 0.47703761 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words

Author: Kate Saenko, Trevor Darrell

Abstract: Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary deﬁnitions. The deﬁnitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classiﬁer is trained on the resulting sense-speciﬁc images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classiﬁcation experiments show that our dictionarybased approach outperforms baseline methods. 1

4 0.47659031 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines

Author: Vinod Nair, Geoffrey E. Hinton

5 0.46676448 4 nips-2008-A Scalable Hierarchical Distributed Language Model

Author: Andriy Mnih, Geoffrey E. Hinton

Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1

6 0.46367025 118 nips-2008-Learning Transformational Invariants from Natural Movies

7 0.46327096 28 nips-2008-Asynchronous Distributed Learning of Topic Models

8 0.46205842 138 nips-2008-Modeling human function learning with Gaussian processes

9 0.46126977 200 nips-2008-Robust Kernel Principal Component Analysis

10 0.46111786 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

11 0.46074939 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

12 0.45921093 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images

13 0.4575257 219 nips-2008-Spectral Hashing

14 0.45739511 62 nips-2008-Differentiable Sparse Coding

15 0.45713589 66 nips-2008-Dynamic visual attention: searching for coding length increments

16 0.45590103 31 nips-2008-Bayesian Exponential Family PCA

17 0.45520529 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

18 0.4550173 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks

19 0.45475534 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences

20 0.45439821 231 nips-2008-Temporal Dynamics of Cognitive Control