jmlr jmlr2011 jmlr2011-96 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
Reference: text
sentIndex sentText sentNum sentScore
1 The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. [sent-12, score-0.894]
2 We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. [sent-14, score-0.32]
3 The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. [sent-17, score-0.801]
4 Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture 1. [sent-22, score-0.39]
5 Introduction The simplest time series models, and the earliest studied, contain no hidden variables. [sent-23, score-0.192]
6 More powerful models, such as the popular hidden Markov model (HMM), introduce a hidden (or latent) state variable that controls the dependence of the current observation on the history of observations. [sent-34, score-0.384]
7 To model N bits of information about the past, they require 2N hidden states. [sent-36, score-0.192]
8 linear dynamical systems (LDS) have a continuous, and therefore componential hidden state, but in order to make inference in these models tractable, the relationship between latent and visible variables is constrained to be linear. [sent-45, score-0.732]
9 However, the RBM has an efficient, approximate learning algorithm, contrastive divergence (CD) (Hinton, 2002), that has been shown to scale well to large problems. [sent-54, score-0.145]
10 A resurgence in the study of deep architectures has been sparked by the discovery that deep networks can be initialized by greedy unsupervised learning of each layer (Hinton et al. [sent-72, score-0.201]
11 Markerless systems permit motion capture in more natural environments (e. [sent-82, score-0.382]
12 Recent advances in motion capture technology have fueled interest in the analysis and synthesis of motion data for computer animation and tracking. [sent-85, score-0.676]
13 Several large motion capture data repositories are available, and people are very good at detecting anomalies in data that is generated from a model, so it is easy to judge the relative generative ability of two models. [sent-87, score-0.356]
14 In the following discussion, we briefly review related work in mocap-driven motion synthesis. [sent-90, score-0.259]
15 A variety of methods have been developed to exploit the plethora of high-quality motion sequences available for animation. [sent-95, score-0.259]
16 1 C ONCATENATION M ETHODS Perhaps the simplest way to generate new motion sequences based on data is to sensibly concatenate short examples from a motion database to meet sparse user-specified constraints (Tanco and Hilton, 2000; Arikan and Forsyth, 2002; Kovar et al. [sent-99, score-0.518]
17 The obvious benefit of concatenation approaches is the high-quality motion that is produced. [sent-104, score-0.259]
18 However, the “synthesized” motions are restricted to content already in the database and therefore many resources must be devoted to capture all desired content. [sent-105, score-0.17]
19 3 T RANSFORMING E XISTING M OTION Another method is to transform motion in the training data to new sequences by learning to adjust its style or other characteristics (Urtasun et al. [sent-115, score-0.38]
20 Such approaches have produced impressive results given user-supplied motion content but we seek more powerful methods that can synthesize both style and content. [sent-119, score-0.419]
21 , 1998) to estimate physical parameters from motion capture data (Liu et al. [sent-123, score-0.32]
22 5 G ENERATIVE M ODELS Data from modern motion capture systems is high-dimensional and contains complex nonlinear relationships among the components of each observation, which is typically a series of joint angles with respect to some skeletal structure. [sent-129, score-0.363]
23 Brand and Hertzmann (2000) model style and content of human motion with hidden Markov models (HMMs) whose emission distributions depend on stylistic parameters learned directly from the data. [sent-132, score-0.562]
24 Their approach permits sampling of novel sequences from the model and applying new styles to existing content. [sent-133, score-0.146]
25 This problem has been addressed by applying piecewise-linear models to synthesize motion (Pavlovic et al. [sent-136, score-0.308]
26 This model has been shown to discover interesting structure in motion data and permit synthesis of simple actions. [sent-145, score-0.377]
27 Conditional Restricted Boltzmann Machines We have emphasized that models with distributed hidden state are necessary for efficiently modeling complex time series. [sent-152, score-0.192]
28 But using distributed representations for hidden state in directed models of time series (Bayes nets) makes inference difficult in all but the simplest models (HMMs and linear dynamical systems). [sent-153, score-0.324]
29 In this section, we first review the RBM and then propose a simple extension to capture temporal dependencies yet maintain its most important computational properties: simple, exact inference and efficient approximate learning using the contrastive divergence algorithm. [sent-155, score-0.255]
30 It has a layer of visible units fully connected to a layer of hidden units but no connections within a layer. [sent-158, score-1.54]
31 This bi-partite structure ensures that the hidden units are conditionally independent given a setting of the visible units and vice-versa. [sent-159, score-1.341]
32 To make the distinction between visible and hidden units clear, we use vi to denote the state of visible unit i and h j to denote the state of hidden unit j. [sent-161, score-1.702]
33 We also distinguish biases on the visible units, ai from biases on the hidden units, b j . [sent-162, score-0.655]
34 The RBM assigns a probability to any joint setting of the visible units, v and hidden units, h: p(v, h) = exp (−E(v, h)) Z 1029 (1) TAYLOR , H INTON AND ROWEIS hj sj si sj si vi (a) (b) (c) Figure 1: a) A Boltzmann machine. [sent-163, score-0.663]
35 b) A Boltzmann machine partitioned into visible (shaded) and hidden units. [sent-164, score-0.573]
36 When both the visible and the hidden units are binary with states 1 and 0, the energy function is E(v, h) = − ∑ Wi j vi h j − ∑ ai vi − ∑ b j h j ij i j where Z is a normalization constant called the partition function, whose name comes from statistical physics. [sent-167, score-1.202]
37 v′ ,h′ Marginalizing over the hidden units in Equation 1 and maximizing the likelihood leads to a very simple maximum likelihood weight update rule: ∆Wi j ∝ vi h j data − vi h j model . [sent-169, score-0.756]
38 Because of the conditional independence properties of the RBM, we can easily obtain an unbiased sample of vi h j data by clamping the visible units to a vector in the training data set, and sampling the hidden units in parallel according to p(h j = 1|v) = 1 . [sent-171, score-1.51]
39 1 + exp(−b j − ∑i Wi j vi ) (3) This is repeated for each vector in a representative “mini-batch” from the training set to obtain an empirical estimate of vi h j data To compute vi h j model requires us to obtain unbiased samples from the joint distribution p(v, h). [sent-172, score-0.314]
40 Empirical evidence suggests that rather than running the Gibbs sampler to convergence, learning works well if we replace Equation 2 with ∆Wi j ∝ vi h j data − vi h j recon , (5) where the second expectation is with respect to the distribution of “reconstructed” data. [sent-176, score-0.229]
41 The reconstruction is obtained by starting with a data vector on the visible units and alternating between sampling all of the hidden units using Equation 3 and all of the visible units using Equation 4 K times. [sent-177, score-2.141]
42 The learning rules for the biases are just simplified versions of Equation 5: ∆ai ∝ vi data − vi ∆b j ∝ h j data − hj recon , (6) recon . [sent-178, score-0.319]
43 We use the notation CD-K to denote contrastive divergence using K full steps of alternating Gibbs sampling after first inferring the states of the hidden units for a datavector from the training set. [sent-180, score-0.8]
44 2 RBMs with Real-Valued Observations Typically, RBMs use stochastic binary units for both the visible data and hidden variables, but for many applications the observed data is non-binary. [sent-183, score-0.957]
45 , modeling handwritten digits) we can normalize the data and use the real-valued probabilities of the binary visible units in place of their activations. [sent-186, score-0.765]
46 When we use mean-field logistic units to model data that is very non-binary (e. [sent-187, score-0.384]
47 , modeling patches of natural images), it is difficult to obtain sharp predictions for intermediate values and so it is more desirable to use units that match the distribution of the data. [sent-189, score-0.384]
48 Fortunately, the stochastic binary units of RBMs can be generalized to any distribution that falls in the exponential family (Welling et al. [sent-190, score-0.384]
49 This includes multinomial units, Poisson units and linear, real-valued units that have Gaussian noise (Freund and Haussler, 1992). [sent-192, score-0.768]
50 , mocap), we use a modified RBM with binary logistic hidden units and real-valued Gaussian visible units. [sent-195, score-0.957]
51 The joint probability of v and h follows the form of Equation 1 where the energy function is now E(v, h) = ∑ i (vi − ai )2 vi − ∑ Wi j h j − ∑ b j h j . [sent-196, score-0.155]
52 2 σi 2σi ij j where ai is the bias of visible unit i, b j is the bias of hidden unit j and σi is the standard deviation of the Gaussian noise for visible unit i. [sent-197, score-1.077]
53 The symmetric weight, Wi j , connects visible unit i to hidden unit j. [sent-198, score-0.655]
54 Any setting of the hidden units makes a linear contribution to the mean of each visible unit: p(vi |h) = N ai + σi ∑ Wi j h j , σ2 . [sent-199, score-0.957]
55 1 + exp(−b j − ∑i Wi j σii ) Given the hidden units, the distribution of each visible unit is defined by a parabolic log likelihood function that makes extreme values very improbable. [sent-201, score-0.614]
56 For any setting of the parameters, the gradient of the quadratic term with respect to a visible unit will always overwhelm the gradient due to the weighted input from the binary hidden units provided the value vi of a visible unit is far enough from its bias, ai . [sent-202, score-1.51]
57 Conveniently, the contrastive divergence learning rules remain the same as in an RBM with binary visible units. [sent-203, score-0.526]
58 We can model temporal dependencies by treating the visible variables in the previous time slice(s) as additional fixed inputs. [sent-210, score-0.381]
59 We add two types of directed connections: autoregressive connections from the past N configurations (time steps) of the visible units to the current visible configuration, and connections from the past M configurations of the visible units to the current hidden configuration. [sent-211, score-2.319]
60 The autoregressive weights can model linear, temporally local structure very well, leaving the hidden units to model nonlinear, higher-level structure. [sent-213, score-0.65]
61 In modeling motion capture with higher frame rates, we have found that a good rule of thumb is to set N = F/10 where F is the frame rate of the data (in frames per second). [sent-217, score-0.487]
62 We ignore uncertainty in the past hidden 1 , although more computationally demanding to train, are significantly better generative models (Salakhutdinov and Murray, 2008). [sent-231, score-0.228]
63 Another way to ensure that the energy surface is low only around the training set is to eliminate the partition function and replace it with a term that limits the volume of the input space over which the energy surface can take low value (Ranzato et al. [sent-241, score-0.174]
64 Using both a contrastive term and sparsity, as we have done here, is a two-fold approach to sculpting energy surfaces. [sent-244, score-0.171]
65 To implement sparsity, we maintained a damped “average activation” estimate for each hidden unit. [sent-245, score-0.192]
66 1 times the average activation of the hidden units while the visible units were clamped to the data. [sent-250, score-1.47]
67 , autoregressive weights and visible biases) were unaffected by the sparsity term. [sent-256, score-0.455]
68 A single-layer CRBM with 1200 hidden units and N = 12 was trained for 200 epochs on data for 10 different walking styles, with the parameters being updated after every 100 training cases. [sent-261, score-0.62]
69 In addition to the real-valued mocap data, the hidden units received additive input from a “one-hot” encoding of the matching style label through another matrix of weights. [sent-263, score-0.763]
70 After training the model, we generated motion by initializing with 12 frames of training data and holding the label units clamped to the style matching the initialization. [sent-265, score-0.91]
71 With a single layer we could generate high-quality motion of 9/10 styles (see the supplemental videos), however, the model failed to produce good generation of the old-man style. [sent-266, score-0.445]
72 In examining the activity of the hidden units over time while clamped to training data, we observed that the model devotes most of its hidden capacity to capturing the more “active” styles as it pays a higher cost for failing to model more pronounced frame-to-frame changes. [sent-268, score-0.978]
73 We also learned a deeper network by first training a CRBM with 600 binary hidden units and real-valued visible units and then training a higher-level CRBM with 600 binary hidden and 600 binary visible units. [sent-270, score-2.002]
74 The data for training the higherlevel CRBM consisted of the activation probabilities of the hidden units of the first CRBM while driven by the training data. [sent-272, score-0.738]
75 The second level CRBM layer effectively replaces the prior over the first layer of hidden units, p(ht |v 5. [sent-276, score-0.342]
76 All of the models we have presented can generate motion at least as fast as 60fps (i. [sent-280, score-0.259]
77 In practice, this is slightly optimistic since larger and more complex data sets will require more hidden units, and learning and inference are also linear in the number of hidden units. [sent-284, score-0.433]
78 We derived the contrastive divergence (CD) learning rules for CRBMs and showed how CRBMs can be stacked to form conditional deep belief nets. [sent-306, score-0.208]
79 5 A major criticism of contrastive divergence learning is that by “pulling up” on the energy of individual reconstructed data points, the algorithm fails to visit regions far away from the training 5. [sent-316, score-0.299]
80 In Section 4 we extended the CRBM to permit context units to modulate the existing pairwise interactions. [sent-325, score-0.446]
81 We demonstrated that the resulting model could capture several different motion styles, as well as transition and blend naturally between them. [sent-328, score-0.32]
82 Specifically, we aim to make our representation of motion invariant to rotation about the gravitational vertical (which we will simply call the vertical) and translation in the ground-plane. [sent-339, score-0.386]
83 1 Original Representation Data from a motion capture system typically consists of the 3D cartesian coordinates of 15-30 virtual markers (usually representing joint centres) for a series of discrete time-steps, which we call frames. [sent-342, score-0.32]
84 Euler angles describe a one, two or three dof orientation by a sequence of rotations about axes in the global or 1059 TAYLOR , H INTON AND ROWEIS local coordinate system. [sent-349, score-0.148]
85 Euler angles do not permit distances between rotations to be directly computed nor do they support interpolation or optimization since the orientation space is highly nonlinear. [sent-351, score-0.161]
86 The exponential map parameterization is also known as “axis-angle” representation since it consists of a three-element vector, whose direction specifies an axis of rotation and whose magnitude specifies the angle by which to rotate about this axis. [sent-356, score-0.247]
87 The orientation of the root does not need to be converted to exponential maps since we build an alternative representation in the following section which requires the orientation to be expressed as a 3 × 3 rotation matrix. [sent-370, score-0.187]
88 Measuring the angle that the dorsoventral axis makes with the vertical gives us a measure of pitch: ut · wt0 ut · wt0 = cos−1 φt = cos−1 . [sent-375, score-0.321]
89 ||ut || ||ut || wt0 Similarly, measuring the angle that the lateral axis makes with the gravitational vertical gives us a measure of roll: vt · wt0 vt · wt0 . [sent-376, score-0.29]
90 By projecting ut into the ground-plane, this provides a measure of yaw, or rotation about the vertical: uy θt = tan−1 tx ut where utx and uty are the first two components of vector ut . [sent-378, score-0.237]
91 4 Conversion to Incremental Changes We represent the rotation about the vertical, as well as translations in the ground plane by their incremental changes (forward differences) and not their absolute values: ˙ θt = θt+1 − θt , xt = xt+1 − xt , ˙ yt = yt+1 − yt . [sent-382, score-0.147]
92 We can represent velocity in the ground-plane by its magnitude: αt = xt 2 + yt 2 ˙ ˙ and its angle with respect to the x-axis: βt = tan−1 yt ˙ xt ˙ . [sent-385, score-0.172]
93 The velocity is then expressed with respect to the orientation about the vertical, θt , in both a forward and lateral component: ˙ γt = αt cos (θt − βt ) , ˙ t = αt sin (θt − βt ) ξ where we have used “dot” notation to imply that these quantities are incremental values. [sent-387, score-0.189]
94 While training a CRBM, we replace vi,t in Equation 10, 11 and 13 by its expected value and we also use the expected value of vi,t when computing the probability of activation of the hidden units (Equation 8). [sent-397, score-0.694]
95 However, to compute each of the K reconstructions of the data (Equation 9), we use stochastically chosen binary values of the hidden units. [sent-398, score-0.284]
96 This prevents the hidden activities from transmitting an unbounded amount of information from the data to the reconstruction (Teh and Hinton, 2001). [sent-399, score-0.192]
97 We take this approach because the reconstruction of the data depends on the binary choices made when selecting hidden state. [sent-401, score-0.192]
98 Thus, when we infer the hiddens from the reconstructed data, the probabilities are highly correlated with the binary hidden states inferred from the data. [sent-402, score-0.237]
99 So we make similar approximations during generation: using stochastically chosen binary values of the hidden units but the expected values of the reconstructed visible units. [sent-405, score-1.039]
100 As a further step to reduce noise, on the final iteration of Gibbs sampling, we use the real-valued probabilities of the hidden units when updating the visible units. [sent-406, score-0.957]
wordName wordTfidf (topN-words)
[('units', 0.384), ('visible', 0.381), ('crbm', 0.275), ('motion', 0.259), ('rbm', 0.22), ('hidden', 0.192), ('hinton', 0.179), ('inton', 0.129), ('boltzmann', 0.129), ('styles', 0.111), ('mocap', 0.11), ('contrastive', 0.106), ('enerating', 0.097), ('fcrbm', 0.097), ('tate', 0.097), ('rbms', 0.096), ('vi', 0.09), ('euler', 0.083), ('istributed', 0.083), ('axis', 0.078), ('style', 0.077), ('rotation', 0.075), ('layer', 0.075), ('activation', 0.074), ('autoregressive', 0.074), ('eries', 0.074), ('ime', 0.068), ('taylor', 0.067), ('energy', 0.065), ('jhf', 0.065), ('wlzf', 0.065), ('igh', 0.063), ('imensional', 0.063), ('roweis', 0.063), ('deep', 0.063), ('hmms', 0.062), ('permit', 0.062), ('capture', 0.061), ('frame', 0.06), ('odels', 0.058), ('orientation', 0.056), ('synthesis', 0.056), ('clamped', 0.055), ('reconstructions', 0.055), ('ut', 0.054), ('wo', 0.054), ('salakhutdinov', 0.053), ('cd', 0.053), ('vertical', 0.052), ('angle', 0.051), ('inference', 0.049), ('crbms', 0.049), ('dof', 0.049), ('hertzmann', 0.049), ('recon', 0.049), ('synthesize', 0.049), ('velocity', 0.049), ('wiv', 0.049), ('connections', 0.049), ('frames', 0.047), ('ranzato', 0.045), ('reconstructed', 0.045), ('training', 0.044), ('directed', 0.044), ('gibbs', 0.043), ('angles', 0.043), ('parameterization', 0.043), ('cos', 0.043), ('animation', 0.041), ('brand', 0.041), ('rtt', 0.041), ('lateral', 0.041), ('tieleman', 0.041), ('motions', 0.041), ('aligns', 0.041), ('biases', 0.041), ('unit', 0.041), ('dynamical', 0.039), ('latent', 0.039), ('divergence', 0.039), ('layers', 0.038), ('nets', 0.037), ('ais', 0.037), ('salient', 0.037), ('stochastically', 0.037), ('wi', 0.036), ('dynamics', 0.036), ('generative', 0.036), ('yt', 0.036), ('sampling', 0.035), ('singularities', 0.034), ('orientations', 0.034), ('vt', 0.034), ('content', 0.034), ('restricted', 0.034), ('tan', 0.033), ('abandon', 0.032), ('componential', 0.032), ('courant', 0.032), ('dorsoventral', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999893 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
2 0.36159843 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
3 0.12037653 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
4 0.065219767 54 jmlr-2011-Learning Latent Tree Graphical Models
Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY
5 0.061879702 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
6 0.056750238 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
7 0.046799734 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
8 0.041274782 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
9 0.040289246 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
10 0.037488468 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
11 0.036635779 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
12 0.034676231 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
13 0.034483958 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
14 0.033968188 58 jmlr-2011-Learning from Partial Labels
15 0.032945022 14 jmlr-2011-Better Algorithms for Benign Bandits
16 0.032906197 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
17 0.032573882 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
18 0.032436013 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
19 0.032213748 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
20 0.030775957 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
topicId topicWeight
[(0, 0.2), (1, -0.09), (2, -0.142), (3, -0.08), (4, -0.291), (5, 0.012), (6, 0.292), (7, -0.435), (8, -0.009), (9, 0.12), (10, 0.117), (11, 0.026), (12, -0.046), (13, -0.014), (14, 0.099), (15, 0.027), (16, -0.086), (17, -0.002), (18, -0.005), (19, 0.027), (20, -0.018), (21, -0.062), (22, -0.016), (23, -0.056), (24, -0.056), (25, -0.045), (26, 0.066), (27, 0.086), (28, 0.067), (29, -0.052), (30, 0.006), (31, 0.021), (32, -0.007), (33, -0.07), (34, -0.07), (35, -0.051), (36, 0.033), (37, 0.046), (38, -0.07), (39, -0.028), (40, -0.0), (41, -0.035), (42, -0.059), (43, -0.052), (44, -0.023), (45, -0.046), (46, 0.105), (47, 0.026), (48, 0.05), (49, -0.103)]
simIndex simValue paperId paperTitle
same-paper 1 0.96441114 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
2 0.90629536 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
3 0.55476916 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
4 0.28263342 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
5 0.24155974 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
Author: Lars Omlor, Martin A. Giese
Abstract: Blind source separation problems emerge in many applications, where signals can be modeled as superpositions of multiple sources. Many popular applications of blind source separation are based on linear instantaneous mixture models. If specific invariance properties are known about the sources, for example, translation or rotation invariance, the simple linear model can be extended by inclusion of the corresponding transformations. When the sources are invariant against translations (spatial displacements or time shifts) the resulting model is called an anechoic mixing model. We present a new algorithmic framework for the solution of anechoic problems in arbitrary dimensions. This framework is derived from stochastic time-frequency analysis in general, and the marginal properties of the Wigner-Ville spectrum in particular. The method reduces the general anechoic problem to a set of anechoic problems with non-negativity constraints and a phase retrieval problem. The first type of subproblem can be solved by existing algorithms, for example by an appropriate modification of non-negative matrix factorization (NMF). The second subproblem is solved by established phase retrieval methods. We discuss and compare implementations of this new algorithmic framework for several example problems with synthetic and real-world data, including music streams, natural 2D images, human motion trajectories and two-dimensional shapes. Keywords: blind source separation, anechoic mixtures, time-frequency transformations, linear canonical transform, Wigner-Ville spectrum
6 0.23132442 54 jmlr-2011-Learning Latent Tree Graphical Models
7 0.22346272 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
8 0.20460322 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
9 0.20242228 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
10 0.2020193 61 jmlr-2011-Logistic Stick-Breaking Process
11 0.16714819 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
12 0.16402274 36 jmlr-2011-Generalized TD Learning
13 0.16148327 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
14 0.15767868 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
15 0.14776868 12 jmlr-2011-Bayesian Co-Training
16 0.14716269 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
17 0.14072022 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
18 0.13937651 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
19 0.13785438 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
20 0.1349753 23 jmlr-2011-DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model
topicId topicWeight
[(4, 0.045), (9, 0.02), (10, 0.021), (11, 0.011), (24, 0.049), (31, 0.408), (32, 0.026), (41, 0.019), (45, 0.012), (60, 0.081), (71, 0.013), (73, 0.038), (78, 0.054), (84, 0.074), (90, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.96335173 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
2 0.95776683 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon
Abstract: Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever. Keywords: classification, large margin, maximum likelihood
3 0.95550883 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes
Author: David M. Blei, Peter I. Frazier
Abstract: We develop the distance dependent Chinese restaurant process, a flexible class of distributions over partitions that allows for dependencies between the elements. This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies arising from time, space, and network connectivity. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings. We study its empirical performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data and network data. We also show that the distance dependent CRP representation of the traditional CRP mixture leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation. Keywords: Chinese restaurant processes, Bayesian nonparametrics
4 0.94881994 95 jmlr-2011-Training SVMs Without Offset
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We develop, analyze, and test a training algorithm for support vector machine classifiers without offset. Key features of this algorithm are a new, statistically motivated stopping criterion, new warm start options, and a set of inexpensive working set selection strategies that significantly reduce the number of iterations. For these working set strategies, we establish convergence rates that, not surprisingly, coincide with the best known rates for SVMs with offset. We further conduct various experiments that investigate both the run time behavior and the performed iterations of the new training algorithm. It turns out, that the new algorithm needs significantly less iterations and also runs substantially faster than standard training algorithms for SVMs with offset. Keywords: support vector machines, decomposition algorithms
5 0.94605953 101 jmlr-2011-Variable Sparsity Kernel Learning
Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman
Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efficient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efficiency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN
6 0.77777153 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
7 0.77114367 36 jmlr-2011-Generalized TD Learning
8 0.76282614 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
9 0.75310367 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
10 0.74860609 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
11 0.74478859 12 jmlr-2011-Bayesian Co-Training
12 0.73376834 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data
13 0.72892135 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
14 0.72284776 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
15 0.72229242 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
16 0.72169816 48 jmlr-2011-Kernel Analysis of Deep Networks
17 0.72140467 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
18 0.72049224 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
19 0.71949553 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
20 0.71267259 16 jmlr-2011-Clustering Algorithms for Chains