jmlr jmlr2013 jmlr2013-115 knowledge-graph by maker-knowledge-mining

115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation


Source: pdf

Author: Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen

Abstract: Imputing missing values in high dimensional time-series is a difficult problem. This paper presents a strategy for training energy-based graphical models for imputation directly, bypassing difficulties probabilistic approaches would face. The training strategy is inspired by recent work on optimization-based learning (Domke, 2012) and allows complex neural models with convolutional and recurrent structures to be trained for imputation tasks. In this work, we use this training strategy to derive learning rules for three substantially different neural architectures. Inference in these models is done by either truncated gradient descent or variational mean-field iterations. In our experiments, we found that the training methods outperform the Contrastive Divergence learning algorithm. Moreover, the training methods can easily handle missing values in the training data itself during learning. We demonstrate the performance of this learning scheme and the three models we introduce on one artificial and two real-world data sets. Keywords: neural networks, energy-based models, time-series, missing values, optimization

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The training strategy is inspired by recent work on optimization-based learning (Domke, 2012) and allows complex neural models with convolutional and recurrent structures to be trained for imputation tasks. [sent-9, score-0.848]

2 Moreover, the training methods can easily handle missing values in the training data itself during learning. [sent-13, score-0.375]

3 It appears that more complicated models are needed to make good predictions about missing values in high dimensional time-series. [sent-23, score-0.408]

4 Examples of neural networks that are able to deal with temporal sequences are recurrent neural networks and onec 2013 Phil´ mon Brakel, Dirk Stroobandt and Benjamin Schrauwen. [sent-27, score-0.362]

5 Nonetheless, there has been some work on training neural networks for missing value recovery in a discriminative way. [sent-32, score-0.396]

6 (2007) trained autoencoder neural networks to impute missing values in non-temporal data. [sent-34, score-0.457]

7 Gupta and Lam (1996) trained neural networks for missing value imputation by using some of the input dimensions as input and the remaining ones as output. [sent-37, score-0.555]

8 This requires many neural networks to be trained and limits the available datapoints for each network to those without any missing input dimensions. [sent-38, score-0.451]

9 At first sight, generative models appear to be a more natural approach to deal with missing values. [sent-42, score-0.405]

10 Non-parametric models like the Gaussian process latent variable model (Lawrence, 2003) have also been used to develop models for sequential tasks like synthesizing and imputing human motion capture data. [sent-57, score-0.364]

11 Another tractable non-linear dynamical system is based on a combination of a recurrent neural network and the neural autoregressive distribution estimator (Larochelle and Murray, 2011; BoulangerLewandowski et al. [sent-81, score-0.384]

12 We will discuss Dynamical Factor Graphs and the generative recurrent neural network model in more detail in Section 7. [sent-83, score-0.362]

13 Overall, however, discriminative energy-based models allow for a broader class of possible models to be applied to missing value imputation while maintaining tractability. [sent-85, score-0.587]

14 In this paper, we extend their approaches to models for time-series and missing value imputation. [sent-95, score-0.358]

15 We show that models can be trained for imputation directly and that the approach is not limited to gradient based optimization. [sent-97, score-0.392]

16 Furthermore, we also show that quite complex models with recurrent dependencies, that would be very difficult to train as probabilistic models, can be learned this way. [sent-99, score-0.426]

17 The second model is a recurrent neural network that is coupled to a set of hidden variables as well. [sent-101, score-0.474]

18 Since Ω will be sampled from some distribution, the actual objective that is minimized during training is the expectation of the sum squared error under a distribution over the missing values as defined for Ndata sequences by O= 1 Ndata ˆ ∑ P(Ω) ∑ V(n)i j − V(n)i j 2 n=1 ∑ Ω (i, j)∈Ω 2 . [sent-108, score-0.357]

19 The selection of P(Ω) during training is task dependent and should reflect prior knowledge about the structure of the missing values. [sent-110, score-0.322]

20 Most of those loss functions contain a contrastive term that requires one to identify some specific ‘most offending’ points in the energy landscape that may be difficult to find when the energy landscape is non-convex. [sent-136, score-0.472]

21 This leads to an energy function of the form E(V, H), where H are the hidden variables, which need to be marginalized out to obtain the energy value for V. [sent-139, score-0.411]

22 This summation (or integration) over hidden variables can be greatly simplified by designing models such that the hidden variables are conditionally independent given an observation. [sent-141, score-0.407]

23 2775 B RAKEL , S TROOBANDT AND S CHRAUWEN The first two models we will describe have a tractable free energy function E(V) and we chose to use gradient descent to optimize it. [sent-146, score-0.429]

24 The third model has no tractable free energy E(V) and we chose to use coordinate descent to optimize a variational bound on the free energy instead. [sent-147, score-0.526]

25 This model has the same energy function as the convolutional RBM (Lee et al. [sent-155, score-0.397]

26 While correlations between the visible units are not directly parametrized, the hidden variables allow these correlations to be modelled implicitly. [sent-160, score-0.369]

27 Fortunately, because the hidden units are binary and independent given the output of the function gconv (·), the total free energy can efficiently be calculated analytically and is given by T E(V) = ∑ t vt − bv 2σ2 2 − ∑ log 1 + exp gconv j (V,t; W) + bh . [sent-164, score-0.804]

28 The gradient of the free-energy function with respect to the function value gconv j (V,t; W) is given by the negative sigmoid function: ∂E(V) = − 1 + exp gconv j (V,t; W) + bh ∂gconv j (V,t; W) −1 . [sent-167, score-0.331]

29 The wavy circles represent the hidden units, the circles filled with dots the visible units and empty circles represent deterministic functions. [sent-174, score-0.341]

30 , 2008) but in our model the units that define the energy are in a separate layer and the visible variables are not independent given the hidden variables. [sent-190, score-0.541]

31 The energy of the model is defined by the following equation: T E(V, H) = ∑ t=1 vt − bv 2σ2 2 T − ht−1 Wht − htT Avt − htT bh , where h0 is defined to be 0. [sent-200, score-0.322]

32 Note that this energy function is very similar to Equation 3, but the convolution has been replaced with a matrix multiplication and there is an additional term that parametrizes correlations between hidden units at adjacent time steps. [sent-201, score-0.422]

33 1 Inference Since the hidden units of the DTBM are not independent from each other, we are not able to use the free energy formulation from Equation 4. [sent-208, score-0.454]

34 Even when values of the visible units are all known, inference for the hidden units is still intractable. [sent-209, score-0.562]

35 Optimizing this bound will lead to values of the variational parameters that approach a mode of the distribution in a similar way that a minimization of the free energy by means of gradient descent will. [sent-232, score-0.378]

36 Ideally, the variational parameters for the hidden units should be updated in an alternating fashion. [sent-235, score-0.344]

37 The odd units will be mutually independent given the visible variables and the even hidden variables and vice versa. [sent-236, score-0.397]

38 The loss gradients of both the models that use gradient descent inference can be computed in a similar way. [sent-243, score-0.362]

39 1 Backpropagation Through Gradient Descent To train the models that use gradient descent inference (i. [sent-247, score-0.413]

40 , the convolutional and recurrent models), we backpropagated loss gradients through the gradient descent steps like in Domke (2011). [sent-249, score-0.655]

41 Given an input pattern and a set of indices that point to the missing values, a prediction was first obtained by doing K steps of gradient descent with step size λ on the free energy. [sent-250, score-0.445]

42 Note that this procedure is similar to the backpropagation through time procedure for recurrent neural networks. [sent-252, score-0.328]

43 The gradient with respect to the parameters was used to train the models with stochastic gradient descent. [sent-253, score-0.359]

44 The numbers represent the separate iterations at which both the hidden units and the unknown visible units are updated. [sent-286, score-0.529]

45 While the number of gradient descent steps for the convolutional model and the recurrent neural network model can be very low, the number of mean-field iterations has a more profound influence on the behavior of the model. [sent-288, score-0.753]

46 The last experiment also investigated the robustness of the models when there were not only missing values in the test set but also in the train data. [sent-300, score-0.468]

47 This model is quite similar to the REBM as it also employs separate sets of deterministic recurrent units and stochastic hidden units. [sent-305, score-0.541]

48 This makes it possible to use Contrastive Divergence learning but also renders the model unable to incorporate future information during inference because the information from the recurrent units is considered to be fixed. [sent-308, score-0.48]

49 After this, the models were trained again on both the train 2. [sent-333, score-0.325]

50 The REBM had 200 recurrent units and we used a step size of . [sent-352, score-0.378]

51 For this experiment, the Energy-Based Models that required missing values during training were provided with missing values from the same distribution that was used to select them for evaluation. [sent-373, score-0.591]

52 This makes motion capture reconstruction an interesting task for evaluating more complex models for missing value imputation. [sent-422, score-0.486]

53 , 2007), a Conditional Restricted Boltzmann Machine (CRBM) was trained to impute missing values in motion capture as well. [sent-424, score-0.554]

54 Finally, the RNN-RBM had 200 recurrent units and was trained with 5 CD iterations. [sent-448, score-0.504]

55 14 Hidden units 200 50 200 200 200 100 Table 3: Parameter settings for training the models on the motion capture data. [sent-462, score-0.421]

56 The CRBM was used in a generative way by conditioning it on the samples it generated at the previous time steps while clamping the observed values and only sampling those that were missing as was done the work by Taylor et al. [sent-479, score-0.316]

57 The convolutional and recurrent models clearly outperform the CRBM and nearest neighbour interpolation on the reconstruction of the left leg. [sent-485, score-0.541]

58 For this sequence the markers of the left leg were missing in the region between the vertical striped lines. [sent-543, score-0.367]

59 when reconstructing the markers of the missing leg. [sent-544, score-0.357]

60 3 Missing Training Data So far, all experiments were done by training on data without actual missing values; values were only truly unknown during testing. [sent-549, score-0.322]

61 In practice, a useful model for missing value imputation should also be able to deal with actual missing values in the train set. [sent-550, score-0.777]

62 For generative models, missing values in the train data shouldn’t pose a problem because they can be marginalized out. [sent-551, score-0.426]

63 For the models we proposed in this paper, missing values in the train data are easily dealt with. [sent-553, score-0.468]

64 15 Hidden units 300 100 300 300 50 Table 5: Parameter settings for training the models on the robot data. [sent-561, score-0.327]

65 A very similar method has been used in earlier work to train neural networks for classification when missing values are present (Bengio and Gingras, 1996). [sent-564, score-0.41]

66 To see how well our models deal with missing training data, we conducted an additional series of experiments. [sent-565, score-0.411]

67 1 DATA The data we used to investigate the effect of missing training data consists of the measurements of the 24 ultrasound sensors of a SCITOS G5 robot navigating a room (Freire et al. [sent-568, score-0.356]

68 2 T RAINING In the first experiment, we trained the models on fully intact training data to get an estimate of the optimal performance the models could achieve on it. [sent-574, score-0.357]

69 To train the models, we selected random batches of 100 frames from the train data and selected another set of variables as missing that were not already truly missing in the data. [sent-578, score-0.819]

70 This way, the models never had access to the values that were labelled as missing by the training data mask. [sent-579, score-0.44]

71 In both experiments, the number of dimensions that we pretended to be missing in order to train the models was uniformly sampled from {1, . [sent-581, score-0.5]

72 The RNN-RBM had 250 recurrent units and was trained with 5 iterations of Contrastive Divergence. [sent-593, score-0.541]

73 We used these settings for all the experiments, regardless of the number of missing training dimensions. [sent-595, score-0.322]

74 54 Table 6: Results in mean squared error on the wall robot data without missing dimensions during training. [sent-607, score-0.335]

75 Figure 5: Mean square error as a function of the number of missing dimensions in the train data. [sent-608, score-0.411]

76 3 R ESULTS Table 6 shows the results for the experiment without missing train data. [sent-611, score-0.379]

77 5 shows the error scores for the three energy-based models as a function of the number of missing input dimensions. [sent-618, score-0.358]

78 The CEBM has more trouble with missing values and is not learning to reconstruct the data well any more after about 10 dimensions are missing. [sent-621, score-0.328]

79 This indicates that our training strategy is more suited for missing value imputation than the Contrastive Divergence algorithm. [sent-626, score-0.419]

80 While this comparison is less valid because the two models are not entirely the same, it still suggests that our training method also works better for recurrent architectures. [sent-628, score-0.369]

81 We actually suspect that, compared to other approximate maximum likelihood methods, Contrastive Divergence is still one of the best candidates for missing value imputation because it optimizes the energy landscape locally by pushing up wrong predictions that are near the data itself. [sent-633, score-0.56]

82 When the number of missing values is not too large, inference would start near one of the regions in which the energy landscape has a shape that promotes good predictions. [sent-634, score-0.507]

83 When the number of missing values is too high however, the inference algorithm might start in a region that was not explicitly shaped by the learning algorithm because it was too far away from the data. [sent-635, score-0.339]

84 This might explain the bad performance of the Convolutional RBM when the whole upper body was missing in the motion capture task. [sent-636, score-0.397]

85 Except for the CEBM, our models also have a far more complicated structure and we think that the discriminative training methods are not just more computationally efficient but actually allow us to train models that would otherwise be very difficult to train at all. [sent-640, score-0.494]

86 By constraining the latent states of a DFG to operate under Gaussian noise with fixed covariance, the partition function of the model becomes constant so that a minimization of the energy for a certain state will automatically push up the energy values of all other possible states. [sent-645, score-0.338]

87 Inference for missing values in DFGs is done with gradient descent to find the minimum energy latent state sequence, ignoring the missing values in the gradient computations. [sent-647, score-0.928]

88 • In the CEBM and REBM, energy minimization only takes place with respect to the missing values and not with respect to the latent variables which are marginalized out analytically. [sent-651, score-0.463]

89 The most important difference is that our models only focus on recovery of the missing values given the observed variables and not on reconstructing both. [sent-663, score-0.412]

90 The way the DTBM model dealt with missing training data in Section 6. [sent-664, score-0.354]

91 In this method, missing values are also filled in by updating them as if they are part of a recurrent neural network. [sent-666, score-0.527]

92 An important difference with our work is that in the work by Bengio and Gingras (1996) the goal was not to predict the missing values themselves, but to perform better on classification tasks when missing values in the inputs are present. [sent-667, score-0.538]

93 It would probably not have been feasible to train this model with a more generic approach in which the energy optimization would have to be executed until convergence for every training sample during training. [sent-674, score-0.335]

94 As more hidden units were used, these models became more difficult to optimize and this explains why the optimal number of hidden units was generally lower than for the other models. [sent-684, score-0.653]

95 The gradients of recurrent neural networks are known to be prone to exponential growth or decay and a single bad gradient can lead to a divergence of the learning algorithm. [sent-686, score-0.43]

96 As the performance of parallel computing hardware increases, the computational time required to model dependencies that are as long as the sequence itself might become more comparable to simulating a regular recurrent neural network. [sent-692, score-0.324]

97 3 Conclusion We presented a strategy for training models for missing value imputation in high dimensional timeseries. [sent-695, score-0.532]

98 The three models we proposed showed promising performance on concatenated digits inpainting, missing marker restoration for motion capture and imputation of values for robot sensors. [sent-696, score-0.673]

99 Our training methods appear to be more suitable for missing value imputation than Contrastive Divergence learning, given similar model architectures. [sent-697, score-0.451]

100 Furthermore, the models could also handle missing values in the training data itself and seem to be relatively robust to these corruptions. [sent-698, score-0.411]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cebm', 0.335), ('rebm', 0.335), ('rbm', 0.29), ('missing', 0.269), ('dtbm', 0.253), ('recurrent', 0.227), ('convolutional', 0.225), ('units', 0.151), ('energy', 0.14), ('contrastive', 0.136), ('boltzmann', 0.135), ('hidden', 0.131), ('vk', 0.126), ('trained', 0.126), ('chrauwen', 0.118), ('mputation', 0.118), ('troobandt', 0.118), ('train', 0.11), ('rakel', 0.101), ('crbm', 0.1), ('gconv', 0.1), ('imputation', 0.097), ('eries', 0.091), ('models', 0.089), ('ime', 0.084), ('gradient', 0.08), ('motion', 0.08), ('domke', 0.072), ('eld', 0.072), ('inference', 0.07), ('hyper', 0.07), ('backpropagation', 0.07), ('odels', 0.064), ('hinton', 0.064), ('descent', 0.064), ('variational', 0.062), ('markers', 0.062), ('gradients', 0.059), ('jt', 0.059), ('visible', 0.059), ('training', 0.053), ('vt', 0.053), ('bh', 0.051), ('lecun', 0.051), ('reconstructions', 0.049), ('capture', 0.048), ('generative', 0.047), ('dynamical', 0.046), ('bv', 0.046), ('brakel', 0.045), ('dfgs', 0.045), ('larochelle', 0.045), ('discriminative', 0.043), ('temporal', 0.038), ('iterations', 0.037), ('deep', 0.037), ('htt', 0.036), ('leg', 0.036), ('nade', 0.036), ('sequences', 0.035), ('rnn', 0.035), ('dependencies', 0.034), ('fields', 0.034), ('robot', 0.034), ('bengio', 0.034), ('divergence', 0.033), ('frames', 0.033), ('dimensions', 0.032), ('model', 0.032), ('free', 0.032), ('impute', 0.031), ('inpainting', 0.031), ('graphical', 0.031), ('neural', 0.031), ('cd', 0.031), ('digits', 0.029), ('labelled', 0.029), ('differentiation', 0.028), ('variables', 0.028), ('landscape', 0.028), ('tanh', 0.028), ('barbu', 0.027), ('desjardins', 0.027), ('gingras', 0.027), ('marker', 0.027), ('mirowski', 0.027), ('stoyanov', 0.027), ('stroobandt', 0.027), ('sutskever', 0.027), ('tieleman', 0.027), ('trainable', 0.027), ('trouble', 0.027), ('salakhutdinov', 0.027), ('latent', 0.026), ('predictions', 0.026), ('reconstructing', 0.026), ('network', 0.025), ('dimensional', 0.024), ('tractable', 0.024), ('consisted', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

Author: Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen

Abstract: Imputing missing values in high dimensional time-series is a difficult problem. This paper presents a strategy for training energy-based graphical models for imputation directly, bypassing difficulties probabilistic approaches would face. The training strategy is inspired by recent work on optimization-based learning (Domke, 2012) and allows complex neural models with convolutional and recurrent structures to be trained for imputation tasks. In this work, we use this training strategy to derive learning rules for three substantially different neural architectures. Inference in these models is done by either truncated gradient descent or variational mean-field iterations. In our experiments, we found that the training methods outperform the Contrastive Divergence learning algorithm. Moreover, the training methods can easily handle missing values in the training data itself during learning. We demonstrate the performance of this learning scheme and the three models we introduce on one artificial and two real-world data sets. Keywords: neural networks, energy-based models, time-series, missing values, optimization

2 0.10293677 108 jmlr-2013-Stochastic Variational Inference

Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley

Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics

3 0.082020067 121 jmlr-2013-Variational Inference in Nonconjugate Models

Author: Chong Wang, David M. Blei

Abstract: Mean-field variational methods are widely used for approximate posterior inference in many probabilistic models. In a typical application, mean-field methods approximately compute the posterior with a coordinate-ascent optimization algorithm. When the model is conditionally conjugate, the coordinate updates are easily derived and in closed form. However, many models of interest—like the correlated topic model and Bayesian logistic regression—are nonconjugate. In these models, mean-field methods cannot be directly applied and practitioners have had to develop variational algorithms on a case-by-case basis. In this paper, we develop two generic methods for nonconjugate models, Laplace variational inference and delta method variational inference. Our methods have several advantages: they allow for easily derived variational algorithms with a wide class of nonconjugate models; they extend and unify some of the existing algorithms that have been derived for specific models; and they work well on real-world data sets. We studied our methods on the correlated topic model, Bayesian logistic regression, and hierarchical Bayesian logistic regression. Keywords: variational inference, nonconjugate models, Laplace approximations, the multivariate delta method

4 0.063755132 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone

Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction

5 0.06093419 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju

Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models

6 0.055464033 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

7 0.055096608 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

8 0.053285867 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

9 0.049253386 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

10 0.046337757 120 jmlr-2013-Variational Algorithms for Marginal MAP

11 0.043237671 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

12 0.042128608 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

13 0.039270345 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

14 0.038705766 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

15 0.037857197 15 jmlr-2013-Bayesian Canonical Correlation Analysis

16 0.03672317 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

17 0.035872612 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

18 0.03549229 22 jmlr-2013-Classifying With Confidence From Incomplete Information

19 0.034248475 86 jmlr-2013-Parallel Vector Field Embedding

20 0.033684902 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.181), (1, -0.137), (2, -0.095), (3, -0.002), (4, -0.101), (5, 0.038), (6, 0.065), (7, 0.014), (8, 0.032), (9, -0.011), (10, 0.026), (11, 0.035), (12, -0.036), (13, -0.059), (14, 0.016), (15, -0.048), (16, 0.052), (17, 0.062), (18, -0.074), (19, 0.034), (20, 0.031), (21, -0.047), (22, -0.077), (23, 0.022), (24, -0.092), (25, -0.157), (26, -0.044), (27, -0.024), (28, 0.11), (29, 0.035), (30, -0.016), (31, -0.029), (32, -0.146), (33, 0.077), (34, 0.173), (35, -0.026), (36, 0.147), (37, -0.073), (38, 0.114), (39, -0.209), (40, 0.149), (41, 0.072), (42, -0.052), (43, -0.049), (44, 0.168), (45, -0.008), (46, -0.052), (47, 0.159), (48, 0.084), (49, -0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93822551 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

Author: Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen

Abstract: Imputing missing values in high dimensional time-series is a difficult problem. This paper presents a strategy for training energy-based graphical models for imputation directly, bypassing difficulties probabilistic approaches would face. The training strategy is inspired by recent work on optimization-based learning (Domke, 2012) and allows complex neural models with convolutional and recurrent structures to be trained for imputation tasks. In this work, we use this training strategy to derive learning rules for three substantially different neural architectures. Inference in these models is done by either truncated gradient descent or variational mean-field iterations. In our experiments, we found that the training methods outperform the Contrastive Divergence learning algorithm. Moreover, the training methods can easily handle missing values in the training data itself during learning. We demonstrate the performance of this learning scheme and the three models we introduce on one artificial and two real-world data sets. Keywords: neural networks, energy-based models, time-series, missing values, optimization

2 0.45105878 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

Author: Lisha Chen, Andreas Buja

Abstract: Multidimensional scaling (MDS) is the art of reconstructing pointsets (embeddings) from pairwise distance data, and as such it is at the basis of several approaches to nonlinear dimension reduction and manifold learning. At present, MDS lacks a unifying methodology as it consists of a discrete collection of proposals that differ in their optimization criteria, called “stress functions”. To correct this situation we propose (1) to embed many of the extant stress functions in a parametric family of stress functions, and (2) to replace the ad hoc choice among discrete proposals with a principled parameter selection method. This methodology yields the following benefits and problem solutions: (a) It provides guidance in tailoring stress functions to a given data situation, responding to the fact that no single stress function dominates all others across all data situations; (b) the methodology enriches the supply of available stress functions; (c) it helps our understanding of stress functions by replacing the comparison of discrete proposals with a characterization of the effect of parameters on embeddings; (d) it builds a bridge to graph drawing, which is the related but not identical art of constructing embeddings from graphs. Keywords: multidimensional scaling, force-directed layout, cluster analysis, clustering strength, unsupervised learning, Box-Cox transformations

3 0.38569728 15 jmlr-2013-Bayesian Canonical Correlation Analysis

Author: Arto Klami, Seppo Virtanen, Samuel Kaski

Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation

4 0.37662593 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

Author: Anastasios Roussos, Stavros Theodorakis, Vassilis Pitsikalis, Petros Maragos

Abstract: We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification

5 0.34854329 108 jmlr-2013-Stochastic Variational Inference

Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley

Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics

6 0.33988777 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

7 0.33927006 22 jmlr-2013-Classifying With Confidence From Incomplete Information

8 0.33492628 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

9 0.32919693 82 jmlr-2013-Optimally Fuzzy Temporal Memory

10 0.32649976 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

11 0.31266183 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

12 0.30967873 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

13 0.30764782 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization

14 0.3035478 121 jmlr-2013-Variational Inference in Nonconjugate Models

15 0.28528956 106 jmlr-2013-Stationary-Sparse Causality Network Learning

16 0.28176591 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

17 0.26915869 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

18 0.26402444 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

19 0.26398033 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

20 0.26244208 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.021), (5, 0.104), (6, 0.046), (10, 0.064), (20, 0.023), (23, 0.048), (44, 0.01), (68, 0.019), (70, 0.013), (75, 0.5), (85, 0.012), (87, 0.015), (93, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99014306 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes

Author: Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, Aki Vehtari

Abstract: The GPstuff toolbox is a versatile collection of Gaussian process models and computational tools required for Bayesian inference. The tools include, among others, various inference methods, sparse approximations and model assessment methods. Keywords: Gaussian process, Bayesian hierarchical model, nonparametric Bayes

2 0.96681666 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

Author: Lisha Chen, Andreas Buja

Abstract: Multidimensional scaling (MDS) is the art of reconstructing pointsets (embeddings) from pairwise distance data, and as such it is at the basis of several approaches to nonlinear dimension reduction and manifold learning. At present, MDS lacks a unifying methodology as it consists of a discrete collection of proposals that differ in their optimization criteria, called “stress functions”. To correct this situation we propose (1) to embed many of the extant stress functions in a parametric family of stress functions, and (2) to replace the ad hoc choice among discrete proposals with a principled parameter selection method. This methodology yields the following benefits and problem solutions: (a) It provides guidance in tailoring stress functions to a given data situation, responding to the fact that no single stress function dominates all others across all data situations; (b) the methodology enriches the supply of available stress functions; (c) it helps our understanding of stress functions by replacing the comparison of discrete proposals with a characterization of the effect of parameters on embeddings; (d) it builds a bridge to graph drawing, which is the related but not identical art of constructing embeddings from graphs. Keywords: multidimensional scaling, force-directed layout, cluster analysis, clustering strength, unsupervised learning, Box-Cox transformations

same-paper 3 0.91331416 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

Author: Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen

Abstract: Imputing missing values in high dimensional time-series is a difficult problem. This paper presents a strategy for training energy-based graphical models for imputation directly, bypassing difficulties probabilistic approaches would face. The training strategy is inspired by recent work on optimization-based learning (Domke, 2012) and allows complex neural models with convolutional and recurrent structures to be trained for imputation tasks. In this work, we use this training strategy to derive learning rules for three substantially different neural architectures. Inference in these models is done by either truncated gradient descent or variational mean-field iterations. In our experiments, we found that the training methods outperform the Contrastive Divergence learning algorithm. Moreover, the training methods can easily handle missing values in the training data itself during learning. We demonstrate the performance of this learning scheme and the three models we introduce on one artificial and two real-world data sets. Keywords: neural networks, energy-based models, time-series, missing values, optimization

4 0.88529724 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

Abstract: Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)

5 0.83441526 21 jmlr-2013-Classifier Selection using the Predicate Depth

Author: Ran Gilad-Bachrach, Christopher J.C. Burges

Abstract: Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. We present an alternative approach by defining a median of the distribution, which we show is both more robust, and has good generalization guarantees. We present algorithms to approximate this median. One contribution of this work is an efficient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we define: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work. Keywords: classification, estimation, median, Tukey depth

6 0.68019611 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

7 0.61212736 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

8 0.58480334 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

9 0.56323391 120 jmlr-2013-Variational Algorithms for Marginal MAP

10 0.52924043 86 jmlr-2013-Parallel Vector Field Embedding

11 0.52362007 108 jmlr-2013-Stochastic Variational Inference

12 0.52074116 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

13 0.51114029 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

14 0.50142235 32 jmlr-2013-Differential Privacy for Functions and Functional Data

15 0.49439859 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

16 0.4908708 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

17 0.49016809 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

18 0.4854345 59 jmlr-2013-Large-scale SVD and Manifold Learning

19 0.48368782 22 jmlr-2013-Classifying With Confidence From Incomplete Information

20 0.48246974 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features