jmlr jmlr2012 jmlr2012-55 knowledge-graph by maker-knowledge-mining

55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Source: pdf

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

Abstract: Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artiﬁcial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classiﬁers. We study the Classiﬁcation RBM (ClassRBM), a variant on the RBM adapted to the classiﬁcation setting. We study different strategies for training the ClassRBM and show that competitive classiﬁcation performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classiﬁcation problems, namely semi-supervised and multitask learning. Keywords: restricted Boltzmann machine, classiﬁcation, discriminative learning, generative learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We study different strategies for training the ClassRBM and show that competitive classiﬁcation performances can be reached when appropriately combining discriminative and generative training objectives. [sent-14, score-0.496]

2 Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. [sent-15, score-0.424]

3 Keywords: restricted Boltzmann machine, classiﬁcation, discriminative learning, generative learning 1. [sent-17, score-0.342]

4 Introduction The restricted Boltzmann machine (RBM) is a probabilistic model that uses a layer of hidden binary variables or units to model the distribution of a visible layer of variables. [sent-18, score-0.352]

5 Then, the RBM is used in one of two ways: either its hidden layer is used to preprocess the input data by replacing it with the represenc 2012 Hugo Larochelle, Michael Mandel, Razvan Pascanu and Yoshua Bengio. [sent-26, score-0.235]

6 Moreover, since the RBM is trained in an unsupervised manner, it is blind to the nature of the supervised task that needs to be solved and provides no guarantees that the information extracted by its hidden layer will be useful. [sent-30, score-0.241]

7 We compare different training strategies for the ClassRBM, which rely on discriminative and/or generative learning objectives. [sent-35, score-0.404]

8 Moreover, since the generative learning objective doesn’t allow for the exact computation of the gradient with respect to the ClassRBM’s parameters, we compare the use of different approximations of that gradient. [sent-37, score-0.316]

9 We also address the issue of generative learning on very high dimensional inputs and propose an approach to reduce the computational cost per example of training. [sent-38, score-0.237]

10 ,C} using a hidden layer of binary stochastic units h = (h1 , . [sent-48, score-0.237]

11 When conditioning on the visible layer, we have p(h|y, x) = ∏ p(h j |y, x), with p(h j = 1|y, x) = sigm(c j +U jy + ∑ W ji xi ) j i where sigm(a) = 1/(1 + exp(−a)) is the logistic sigmoid function. [sent-60, score-0.155]

12 When conditioning on the hidden layer, we have p(x|h) = ∏ p(xi |h), with p(xi = 1|h) = sigm(bi + ∑ W ji h j ) , i p(y|h) = j exp(dy + ∑ j U jy h j ) . [sent-61, score-0.248]

13 1 + exp(cn +UHy + ∑ Wni xi ) i i = exp(dy + ∑ log(1 + exp(c j +U jy + ∑ W ji xi ))) j i = exp(dy + ∑ softplus(c j +U jy + ∑ W ji xi )) j i where softplus(a) = log(1 + exp(a)), then we can write ∑h1 ∈{0,1} · · · ∑hH ∈{0,1} exp(hT Wx + bT x + cT h + dT ey + hT Uey ) ∑y∗ ∈{1,. [sent-66, score-0.347]

14 ,C} ∑h1 ∈{0,1} · · · ∑hH ∈{0,1} exp(hT Wx + bT x + cT h + dT ey∗ + hT Uey∗ ) exp(dy + ∑ j softplus(c j +U jy + ∑i W ji xi )) = ∗ ∑y∗ ∈{1,. [sent-69, score-0.155]

15 These ﬁlters are shared across the different classes, but different classes will make comparisons with different ﬁlters by controlling the class-dependent biases U jy in the softplus(c j +U jy + ∑i W ji xi ) terms. [sent-79, score-0.295]

16 Training Objectives In order to train a ClassRBM to solve a particular classiﬁcation problem, we can simply deﬁne an objective to minimize for all examples in the training set Dtrain = {(xt , yt )}. [sent-82, score-0.232]

17 The next sections describe the different training objectives which will be considered here, starting with the one most commonly used, that is, the generative training objective. [sent-83, score-0.363]

18 1 Generative Training Objective Given that we have a model which deﬁnes a value for the joint probability p(y, x), a natural choice for a training objective is the generative training objective |Dtrain | Lgen (Dtrain ) = − ∑ log p(yt , xt ) . [sent-85, score-0.556]

19 (3) t=1 This is the most popular training objective for RBMs, for which a lot of effort has been put to obtain better estimates for its gradient (Hinton, 2002; Tieleman, 2008; Tieleman and Hinton, 2009). [sent-86, score-0.168]

20 Indeed, as mentioned previously, computing p(yt , xt ) for some example (xt , yt ) is generally intractable, as is computing log p(yt , xt ) and its gradient with respect to any ClassRBM parameter θ ∂ log p(yt , xt ) ∂ ∂ = −Eh|yt ,xt E(yt , xt , h) + Ey,x,h E(y, x, h) . [sent-87, score-0.564]

21 One which is known to work well in practice is the contrastive divergence estimator (Hinton, 2002). [sent-90, score-0.181]

22 This approximation replaces the expectation by a point estimate at a sample generated after a limited number of Gibbs sampling steps, with the sampler’s initial state for the visible variables initialized at the training example (xt , yt ). [sent-91, score-0.206]

23 We will consider this procedure for now and postpone the consideration of other estimates of the generative gradient to Section 7. [sent-95, score-0.256]

24 Since we are in a supervised learning setting and are only interested in obtaining a good prediction of the target given the input, it might be more appropriate to ignore this unsupervised part of the generative objective and focus on the supervised part. [sent-99, score-0.27]

25 Doing so is referred to as discriminative training, where the following training objective is used: |Dtrain | Ldisc (Dtrain ) = − ∑ log p(yt |xt ) . [sent-100, score-0.254]

26 An important advantage of the discriminative training objective is that it is possible to compute its gradient with respect to the ClassRBM’s parameters exactly. [sent-103, score-0.3]

27 The general form of the gradient for a single example (xt , yt ) is ∂ log p(yt |xt ) ∂ ∂ = −Eh|yt ,xt E(yt , xt , h) + Ey,h|x E(y, x, h) ∂θ ∂θ ∂θ and, more speciﬁcally, we obtain ∂ log p(yt |xt ) = 1y=yt − p(y|xt ), ∂dy 647 ∀y ∈ {1, . [sent-104, score-0.258]

28 ,C} L AROCHELLE , M ANDEL , PASCANU AND B ENGIO for the target biases d and ∂oy∗ j (xt ) ∂oy j (xt ) ∂ log p(yt |xt ) = ∑ sigm(oyt j (xt )) t − ∑ sigm(oy∗ j (xt ))p(y∗ |xt ) ∂θ ∂θ ∂θ j j,y∗ for the other parameters θ ∈ {c, U, W}, where oy j (x) = c j + ∑k W jk xk +U jy . [sent-107, score-0.201]

29 Hybrid Training Objective In order to get an idea of when and why generative training can be better than discriminative training or vice versa, we can look at some of the known theoretical properties of both approaches. [sent-111, score-0.466]

30 However, when the model is misspeciﬁed, the discriminative training objective allows the model to reach a better performance for sufﬁciently large training sets (i. [sent-113, score-0.316]

31 In Liang and Jordan (2008), it is also shown that for models from the general exponential family, parameter estimates based on the generative training objective have smaller variance than discriminative estimates. [sent-116, score-0.464]

32 However, if the model is misspeciﬁed, generative estimates will asymptotically yield models with worse performances. [sent-117, score-0.21]

33 These results suggest interpreting the model trained with the generative training objective as being more regularized than the model trained with the discriminative objective. [sent-118, score-0.53]

34 When the ultimate task is classiﬁcation, adding the generative training objective to the discriminative training objective can be seen as a way to regularize the discriminative training objective. [sent-119, score-0.78]

35 (2009) proposed a different approach to discriminative training of RBMs, where each class is associated with its own individual RBM, as in a Bayes classiﬁer. [sent-138, score-0.194]

36 (2011) highlight that softplus hidden activation functions tend to yield better performances than logistic-shaped functions (including the hyperbolic tangent). [sent-149, score-0.212]

37 This might be one explanation behind the slightly superior performance of discriminatively trained ClassRBMs, compared to neural networks with hyperbolic tangent hidden units (see Sections 7. [sent-150, score-0.155]

38 However, the main advantage of working in the framework of RBMs is that it provides a natural way to introduce generative learning, which can provide data set-dependent regularization and, as we will see, can be used to extend learning in the semi-supervised setting. [sent-153, score-0.21]

39 As mentioned earlier, a form of generative learning can be introduced in standard neural networks with unsupervised pre-training, simply by using RBMs to initialize the hidden layer weights. [sent-154, score-0.418]

40 649 L AROCHELLE , M ANDEL , PASCANU AND B ENGIO the neural network is inﬂuenced by generative learning is not well controlled, while the hybrid objective provides an explicit handle on the role played by generative learning. [sent-157, score-0.603]

41 1, on a log scale), H (50 to 6000), the generative learning weight α for hybrid training (0 to 0. [sent-167, score-0.395]

42 In general, bigger values of H were found to be more appropriate with more generative learning. [sent-171, score-0.21]

43 We give as a comparison the results of a Gaussian kernel SVM, a random forest classiﬁer2 and of a regular neural network (with random initialization, one hidden layer and hyperbolic tangent hidden activation functions). [sent-183, score-0.301]

44 First, we observe that discriminative training of the ClassRBM outperforms generative training. [sent-184, score-0.404]

45 However, hybrid training appears able to make the best out of both worlds and outperforms the other approaches. [sent-185, score-0.185]

46 They suggest to introduce sparsity in the hidden layer of an RBM by setting the biases c in the hidden layer to a value that maintains the average of the conditional expected value of these neurons to an arbitrarily small value, and so after each iteration through the whole training set. [sent-189, score-0.546]

47 1, on a log scale) and the hidden layer size, testing bigger hidden layer sizes than previously selected. [sent-198, score-0.416]

48 The discriminative power of the hybrid ClassRBM can be better understood by looking a the rows of the weight matrix W, which act as ﬁlter features. [sent-205, score-0.255]

49 In practice, we ﬁnd that the most inﬂuential hyper-parameters are the learning rate and the generative learning weight. [sent-208, score-0.21]

50 Conveniently, we also ﬁnd that that the best learning rate value is the same for each values of the generative learning weight we tested. [sent-209, score-0.21]

51 In other words, ﬁnding a good learning rate does not require having found the best value for the generative learning weight. [sent-210, score-0.21]

52 Once these two hyper-parameters are set to good values, we also ﬁnd that a wide range of hidden layer sizes (between 750 to 3000) yield a competitive performance, that is, under 1. [sent-211, score-0.208]

53 Once again, hybrid training of the ClassRBM outperforms the other approaches, including the SVM and neural network classiﬁers. [sent-246, score-0.185]

54 Notice that here generative training performs better than discriminative training. [sent-247, score-0.404]

55 In order to get a better understanding of how the hybrid ClassRBM solves this classiﬁcation problem, we looked at the weights connecting each of the classes to the hidden neurons. [sent-265, score-0.216]

56 We see that the ClassRBM ·y does not use strictly non-overlapping sets of neurons for different newsgroups, but shares some of those neurons for newsgroups that are semantically related. [sent-268, score-0.144]

57 3 Variations on the Generative Learning Gradient Estimator In the previous sections, we considered contrastive divergence for estimating the generative learning gradient. [sent-282, score-0.391]

58 Hence, to obtain hybrid training we can simply weight the second summation term by α. [sent-324, score-0.185]

59 However, by computing ∑i W ji xi for all j only once and reusing those terms to obtain the terms ∑i=k W ji xi for any k, we can obtain a procedure that is still linear in D, as is the CD gradient estimator. [sent-329, score-0.144]

60 7% Table 4: Comparison of the classiﬁcation performances using different generative gradient estimators. [sent-340, score-0.286]

61 To perform stochastic gradient descent, a gradient step update is made according to the objective of Equation 7 every time a training example is visited. [sent-341, score-0.214]

62 Another generative gradient estimator for RBMs that has been recently proposed is the Persistent CD (PCD) estimator (Tieleman, 2008). [sent-346, score-0.256]

63 Is the choice of the generative gradient estimator in the hybrid objective crucial for obtaining good classiﬁcation performances? [sent-354, score-0.439]

64 To answer this question, we have trained ClassRBMs using the hybrid objective on the MNIST and 20 Newsgroup data sets, with the different generative gradient estimators. [sent-355, score-0.472]

65 4 Scaling Up to Large Input Spaces Computing the generative gradient is typically much more computationally expensive than the discriminative gradient. [sent-368, score-0.388]

66 While the computations for estimating the discriminative gradient can take advantage of the sparsity of the input (mainly when multiplying the input with the ﬁlters), estimating the generative gradient for all of the estimators in Section 7. [sent-371, score-0.488]

67 It would hence be beneﬁcial to derive a more general generative gradient estimator that would allow us to control more directly its computational cost, and perhaps let us trade a little bit of accuracy for more computational efﬁciency. [sent-373, score-0.256]

68 In such a setting, we might want to reduce the computational time required by the generative learning objective so that updating the parameters of the ClassRBM for a training example can be done before the next sample is given. [sent-375, score-0.332]

69 At one extreme, discriminative learning is very efﬁcient since we are only modelling the (conditional) distribution of the target variable while, at the other extreme, generative learning is much more expensive because the distribution of the target and all input variables is being modelled. [sent-377, score-0.369]

70 This training objective actually corresponds to a particular type of composite likelihood estimator (Lindsay, 1988; Liang and Jordan, 2008). [sent-385, score-0.155]

71 Here, we in addition propose to approximate the gradients of the log p(yt , xS |x\S ) terms ∂ log p(yt , xS |x\S ) ∂ ∂ = −Eh|yt ,xt E(yt , xt , h) + Ey,xS ,h|x\S E(y, x, h) ∂θ ∂θ ∂θ by using contrastive divergence with one step of Gibbs sampling. [sent-386, score-0.283]

72 9% Table 5: Evaluation of the composite likelihood variant of contrastive divergence on the 20 Newsgroups data set, with an input dimensionality of 25247. [sent-393, score-0.241]

73 We experimentally investigated how varying L impacts the performance of ClassRBMs trained using a hybrid objective based on the generative objective of Equation 8. [sent-394, score-0.486]

74 The performance of purely discriminative training in the large vocabulary setting, which is essentially equivalent to setting L = 0, is also improved on. [sent-398, score-0.194]

75 The idea of combining composite likelihood objectives and contrastive divergence has also been combined previously by Asuncion et al. [sent-401, score-0.243]

76 (2010) focused on models for which standard contrastive divergence with Gibbs sampling corresponds to sampling only a single randomly selected variable at each step. [sent-404, score-0.249]

77 In this case, contrastive divergence with one sampling step actually corresponds to a stochastic version of pseudolikelihood (Hyv¨ rinen, a 2006). [sent-405, score-0.276]

78 Their work can be understood as an investigation of how to improve contrastive divergence using ideas from composite likelihood objectives. [sent-410, score-0.214]

79 However, for RBMs, block-Gibbs sampling is actually the standard, where we ﬁrst sample all hidden units and then all input variables in one iteration. [sent-411, score-0.183]

80 What we propose instead, is to apply contrastive divergence to a composite likelihood objective, such that we approximate the gradients on the log p(yt , xS |x\S ) terms. [sent-414, score-0.214]

81 Semi-supervised Learning In certain situations, in addition to a (possibly small) set of labeled training examples Dtrain , even more data can be obtained in the form of an unlabeled training set Dunlab = {(xt )}. [sent-417, score-0.152]

82 657 L AROCHELLE , M ANDEL , PASCANU AND B ENGIO Because a ClassRBM is a proper generative model, a very natural notion of consistency in this context is that unlabeled training data have high likelihood under it. [sent-422, score-0.3]

83 ∂θ ∂θ ∂θ The contrastive divergence approximation proceeds slightly differently here. [sent-424, score-0.181]

84 In the latter sampling version, the online training update for this objective can be described as replacing the statement y0 ← yt with y0 ∼ p(y|xt ) in Algorithm 1. [sent-426, score-0.266]

85 Online training by stochastic gradient descent then corresponds to applying two gradients updates: one for the objective LTYPE and one for the unlabeled data objective Lunsup . [sent-429, score-0.256]

86 Model selection covered all the parameters of the hybrid ClassRBM as well as the unsupervised objective weight β of Equation 10, with β = 0. [sent-438, score-0.183]

87 The results are given in Table 6, where we observe that semi-supervised learning consistently improves the performance of the ClassRBM trained based on the hybrid objective. [sent-461, score-0.156]

88 While log-linear models depend much more on the discriminative quality of the features that are fed as input, the ClassRBM can learn useful features through its hidden layer and model non-linear decision boundaries. [sent-470, score-0.34]

89 The conditional distribution of y given 659 L AROCHELLE , M ANDEL , PASCANU AND B ENGIO h then becomes: p(y|h) = ∏ p(yc |h), with p(yc = 1|h) = sigm(dc + ∑ U jc h j ). [sent-480, score-0.157]

90 Running the following message passing procedure to convergence µc ← sigm dc + ∑ U jc τ j ∀ c ∈ {1, . [sent-486, score-0.29]

91 ,C}, j τ j ← sigm c j + ∑ U jc µc + ∑ W ji xi c ∀ j ∈ {1, . [sent-489, score-0.339]

92 As for learning, the discriminative gradient expression, which is now ∂ log p(yt |xt ) ∂ ∂ = −Eh|yt ,xt E(yt , xt , h) + Ey,h|x E(y, x, h) ∂θ ∂θ ∂θ must also be approximated, speciﬁcally the second expectation over y and h. [sent-503, score-0.28]

93 This approach was ﬁrst described by Welling and Hinton (2002) and is known as mean ﬁeld contrastive divergence. [sent-509, score-0.139]

94 3, the intractability of discriminative maximum likelihood training can be avoided by using a pseudolikelihood objective − ∑C log p(yc |y\c x) for which exact gradients c=1 can be computed. [sent-513, score-0.315]

95 As we see, contrastive divergence tends to outperform other approaches for training the ClassRBM, either when mean ﬁeld or loopy belief propagation is used at test time. [sent-567, score-0.351]

96 The same model selection procedure was used to select the baselines’ hyper-parameters, namely the learning rate (both baselines) and hidden layer size (neural network baseline only). [sent-575, score-0.208]

97 Contrastive divergence and loopy belief propagation was used in this comparison, for discriminative training. [sent-576, score-0.282]

98 In particular, we highlighted the importance of combining generative and discriminative training and we explored the impact of using different generative gradient estimators on the classiﬁcation performance of the ClassRBM. [sent-608, score-0.66]

99 generative classiﬁers: A comparison of logistic regression and naive bayes. [sent-814, score-0.21]

100 Generative versus discriminative training of RBMs for classiﬁcation of fMRI images. [sent-838, score-0.194]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('classrbm', 0.612), ('generative', 0.21), ('dtrain', 0.186), ('jc', 0.157), ('rbm', 0.157), ('contrastive', 0.139), ('sigm', 0.133), ('discriminative', 0.132), ('rbms', 0.13), ('pascanu', 0.129), ('hybrid', 0.123), ('yc', 0.121), ('achines', 0.115), ('andel', 0.115), ('oltzmann', 0.115), ('layer', 0.115), ('yt', 0.11), ('jy', 0.106), ('xt', 0.102), ('arochelle', 0.099), ('engio', 0.099), ('estricted', 0.099), ('hidden', 0.093), ('softplus', 0.089), ('mandel', 0.083), ('mnist', 0.08), ('boltzmann', 0.079), ('larochelle', 0.076), ('newsgroups', 0.076), ('lbp', 0.071), ('mf', 0.071), ('tags', 0.068), ('tag', 0.068), ('hinton', 0.068), ('yoshua', 0.067), ('training', 0.062), ('hugo', 0.062), ('geoffrey', 0.061), ('pseudolikelihood', 0.061), ('objective', 0.06), ('deep', 0.06), ('gibbs', 0.06), ('bengio', 0.056), ('lassification', 0.055), ('multitask', 0.055), ('classrbms', 0.053), ('dunlab', 0.053), ('nnet', 0.053), ('tieleman', 0.053), ('welling', 0.053), ('ji', 0.049), ('classi', 0.046), ('gradient', 0.046), ('music', 0.045), ('aroc', 0.044), ('eh', 0.044), ('mturk', 0.044), ('wx', 0.044), ('belief', 0.043), ('divergence', 0.042), ('hh', 0.038), ('cd', 0.037), ('ey', 0.037), ('dy', 0.036), ('cdata', 0.035), ('majmin', 0.035), ('majorminer', 0.035), ('razvan', 0.035), ('uey', 0.035), ('neurons', 0.034), ('xk', 0.034), ('propagation', 0.034), ('sampling', 0.034), ('biases', 0.034), ('andrew', 0.034), ('trained', 0.033), ('ht', 0.033), ('composite', 0.033), ('item', 0.032), ('loopy', 0.031), ('editors', 0.031), ('hc', 0.03), ('mnih', 0.03), ('pcd', 0.03), ('mccallum', 0.03), ('performances', 0.03), ('lters', 0.03), ('newsgroup', 0.029), ('units', 0.029), ('objectives', 0.029), ('exp', 0.029), ('unlabeled', 0.028), ('xs', 0.028), ('oy', 0.027), ('input', 0.027), ('inputs', 0.027), ('druck', 0.027), ('iro', 0.027), ('lgen', 0.027), ('lhybrid', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

2 0.11573008 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

Author: James Bergstra, Yoshua Bengio

Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent Ă˘€œHigh ThroughputĂ˘€? methods achieve surprising successĂ˘€”they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword

3 0.083059117 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a ﬁnite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only speciﬁed up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artiﬁcially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image

4 0.079384588 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which ﬁnd latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can ﬁnd representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it ﬁnds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically signiﬁcant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning

5 0.065113582 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efﬁcient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efﬁcient, crowdsourcing

6 0.060279042 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

7 0.057281464 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

8 0.048766226 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

9 0.046756946 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

10 0.046407271 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

11 0.041711543 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

12 0.04114088 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

13 0.041138262 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

14 0.040879849 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

15 0.040178291 97 jmlr-2012-Regularization Techniques for Learning with Matrices

16 0.040178191 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

17 0.039642692 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

18 0.035495289 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

19 0.034951322 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

20 0.034368373 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.174), (1, -0.062), (2, 0.119), (3, -0.048), (4, 0.066), (5, 0.004), (6, 0.101), (7, 0.049), (8, -0.139), (9, -0.004), (10, 0.028), (11, 0.049), (12, 0.012), (13, 0.067), (14, -0.008), (15, 0.176), (16, 0.041), (17, 0.098), (18, 0.184), (19, 0.042), (20, -0.133), (21, -0.101), (22, 0.292), (23, 0.156), (24, 0.03), (25, -0.029), (26, 0.153), (27, -0.032), (28, -0.103), (29, -0.062), (30, -0.072), (31, -0.1), (32, -0.065), (33, 0.007), (34, 0.006), (35, -0.016), (36, 0.072), (37, -0.121), (38, -0.011), (39, -0.099), (40, -0.09), (41, -0.025), (42, 0.087), (43, 0.089), (44, 0.217), (45, 0.075), (46, 0.125), (47, -0.128), (48, -0.043), (49, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92400342 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

2 0.62657505 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

3 0.59161413 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

Author: James Bergstra, Yoshua Bengio

4 0.37795433 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

5 0.30741224 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

6 0.27285597 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

7 0.26411587 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

8 0.25842068 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

9 0.24777034 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

10 0.24718952 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

11 0.24150869 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

12 0.225986 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

13 0.21771325 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

14 0.214185 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

15 0.20846786 53 jmlr-2012-Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences

16 0.20825376 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

17 0.2009871 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

18 0.19302507 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

19 0.18740986 3 jmlr-2012-A Geometric Approach to Sample Compression

20 0.18346296 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.02), (14, 0.382), (21, 0.027), (26, 0.055), (27, 0.022), (29, 0.04), (35, 0.046), (49, 0.02), (56, 0.02), (57, 0.01), (64, 0.017), (69, 0.02), (75, 0.062), (79, 0.011), (81, 0.031), (92, 0.056), (96, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83282882 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

Author: Dominik Schnitzer, Arthur Flexer, Markus Schedl, Gerhard Widmer

Abstract: ‘Hubness’ has recently been identiﬁed as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. The work presented here discusses two classes of methods that try to symmetrize nearest neighbor relations and investigates to what extent they can mitigate the negative effects of hubs. We evaluate local distance scaling and propose a global variant which has the advantage of being easy to approximate for large data sets and of having a probabilistic interpretation. Both local and global approaches are shown to be effective especially for high-dimensional data sets, which are affected by high hubness. Both methods lead to a strong decrease of hubness in these data sets, while at the same time improving properties like classiﬁcation accuracy. We evaluate the methods on a large number of public machine learning data sets and synthetic data. Finally we present a real-world application where we are able to achieve signiﬁcantly higher retrieval quality. Keywords: local and global scaling, shared near neighbors, hubness, classiﬁcation, curse of dimensionality, nearest neighbor relation

same-paper 2 0.73034942 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

3 0.37122658 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

4 0.3619228 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identiﬁcation of an appropriate distance metric on the input data. It is often beneﬁcial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semideﬁnite. Semideﬁnite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semideﬁnite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semideﬁnite matrices as weak learners within an efﬁcient and scalable boosting-based learning process. The resulting methods are easy to implement, efﬁcient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semideﬁnite matrix with trace and rank being one rather than a classiﬁer or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classiﬁcation accuracy and running time. Keywords: Mahalanobis distance, semideﬁnite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

5 0.33509046 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

Author: Neil D. Lawrence

Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random ﬁelds (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter ﬁtting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random ﬁeld via the graphical lasso.

6 0.33405149 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

7 0.32892105 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

8 0.3288638 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

9 0.32679161 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

10 0.32561868 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

11 0.32396626 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

12 0.32377413 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

13 0.32260025 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

14 0.32215163 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

15 0.32168695 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

16 0.32021374 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

17 0.32003656 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

18 0.31892407 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

19 0.31805903 103 jmlr-2012-Sampling Methods for the Nyström Method

20 0.31738064 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods