nips nips2005 nips2005-80 knowledge-graph by maker-knowledge-mining

80 nips-2005-Gaussian Process Dynamical Models

Source: pdf

Author: Jack Wang, Aaron Hertzmann, David M. Blei

Abstract: This paper introduces Gaussian Process Dynamical Models (GPDM) for nonlinear time series analysis. A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space. We marginalize out the model parameters in closed-form, using Gaussian Process (GP) priors for both the dynamics and the observation mappings. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We demonstrate the approach on human motion capture data in which each pose is 62-dimensional. Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces. Webpage: http://www.dgp.toronto.edu/∼ jmwang/gpdm/ 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space. [sent-8, score-0.79]

2 We marginalize out the model parameters in closed-form, using Gaussian Process (GP) priors for both the dynamics and the observation mappings. [sent-9, score-0.246]

3 This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. [sent-10, score-0.258]

4 We demonstrate the approach on human motion capture data in which each pose is 62-dimensional. [sent-11, score-0.228]

5 Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces. [sent-12, score-0.228]

6 In contrast, existing nonlinear models can model complex dynamics, but may require large training sets to learn accurate MAP models. [sent-18, score-0.171]

7 In this paper we investigate learning nonlinear dynamical models for high-dimensional datasets. [sent-19, score-0.3]

8 We take a Bayesian approach to modeling dynamics, averaging over dynamics parameters rather than estimating them. [sent-20, score-0.171]

9 Inspired by the fact that averaging over nonlinear regression models leads to a Gaussian Process (GP) model, we show that integrating over parameters in nonlinear dynamical systems can also be performed in closed-form. [sent-21, score-0.38]

10 The resulting Gaussian Process Dynamical Model (GPDM) is fully deﬁned by a set of lowdimensional representations of the training data, with both dynamics and observation mappings learned from GP regression. [sent-22, score-0.304]

11 As a natural consequence of GP regression, the GPDM removes the need to select many parameters associated with function approximators while retaining the expressiveness of nonlinear dynamics and observation. [sent-23, score-0.228]

12 Our work is motivated by modeling human motion for video-based people tracking and data-driven animation. [sent-24, score-0.214]

13 Bayesian people tracking requires dynamical models in the form of transition densities in order to specify prediction distributions over new poses at each time instant (e. [sent-25, score-0.42]

14 , [11, 14]); similarly, data-driven computer animation requires prior distributions over poses and motion (e. [sent-27, score-0.227]

15 An individual human pose is typically parameterized with more than 60 parameters. [sent-30, score-0.124]

16 Despite the large state space, the space of activity-speciﬁc human poses and motions has a much smaller intrinsic dimensionality; in our experiments with walking and golf swings, 3 dimensions often sufﬁce. [sent-31, score-0.362]

17 Because the mapping ¯ parameters A and B have been marginalized over, all latent coordinates X = [x1 , . [sent-36, score-0.565]

18 In Kernel Dynamical Modeling [12], linear dynamics are kernelized to model nonlinear systems, but a density function over data is not produced. [sent-49, score-0.315]

19 Supervised learning with GP regression has been used to model dynamics for a variety of applications [3, 7, 13]. [sent-50, score-0.175]

20 These methods model dynamics directly in observation space, which is impractical for the high-dimensionality of motion capture data. [sent-51, score-0.279]

21 Our approach is most directly inspired by the unsupervised Gaussian Process Latent Variable Model (GPLVM) [5], which models the joint distribution of the observed data and their corresponding representation in a low dimensional latent space. [sent-52, score-0.399]

22 However, the GPLVM is not a dynamical model; it assumes that data are generated independently. [sent-54, score-0.197]

23 Accordingly it does not respect temporal continuity of the data, nor does it model the dynamics in the latent space. [sent-55, score-0.551]

24 Here we augment the GPLVM with a latent dynamical model. [sent-56, score-0.573]

25 The result is a Bayesian generalization of subspace dynamical models to nonlinear latent mappings and dynamics. [sent-57, score-0.732]

26 2 Gaussian Process Dynamics The Gaussian Process Dynamical Model (GPDM) comprises a mapping from a latent space to the data space, and a dynamical model in the latent space (Figure 1). [sent-58, score-1.056]

27 The GPDM is obtained by marginalizing out the parameters of the two mappings, and optimizing the latent coordinates of training data. [sent-60, score-0.537]

28 While linear mappings have been used extensively in auto-regressive models, here we consider the nonlinear case for which f and g are linear combinations of basis functions: f (x; A) ai φi (x) (3) bj ψj (x) = (4) i g(x; B) = j for weights A = [a1 , a2 , . [sent-70, score-0.238]

29 In order to ﬁt the parameters of this model to training data, one must select an appropriate number of basis functions, and one must ensure that there is enough data to constrain the shape of each basis function. [sent-77, score-0.132]

30 The elements of kernel matrix are deﬁned by a kernel function, (KY )i,j = kY (xi , xj ). [sent-87, score-0.192]

31 For the latent mapping, X → Y, we currently use the RBF kernel kY (x, x ) = β1 exp − β2 ||x − x ||2 2 −1 + β3 δx,x . [sent-88, score-0.472]

32 The dynamic mapping on the latent coordinates X is conceptually similar, but more subtle. [sent-96, score-0.575]

33 1 As above, we form the joint probability density over the latent coordinates and the dynamics weights A in (3). [sent-97, score-0.669]

34 We model dynamics using both the RBF kernel of the form of Eqn. [sent-110, score-0.271]

35 (10) 2 The kernel corresponds to representing g as the sum of a linear term and RBF terms. [sent-112, score-0.131]

36 The inclusion of the linear term is motivated by the fact that linear dynamical models, such as Conceptually, we would like to model each pair (xt , xt+1 ) as a training pair for regression with g. [sent-113, score-0.335]

37 It should be noted that, due to the nonlinear dynamical mapping in (3), the joint distribution of the latent coordinates is not Gaussian. [sent-125, score-0.815]

38 One can also see this in (9) since xt variables occur inside the kernel matrix, as well as outside of it. [sent-127, score-0.371]

39 Together, the priors, the latent mapping, and the dynamics deﬁne a generative model for time-series observations (Figure 1(b)): ¯ ¯ p(X, Y, α, β) = p(Y|X, β) p(X|¯) p(¯) p(β) . [sent-130, score-0.551]

40 Each sequence has associated latent coordinates X1 , . [sent-136, score-0.551]

41 For the latent mapping g we can conceptually concatenate all sequences within the GP likelihood (Eqn. [sent-140, score-0.529]

42 A similar concatenation applies for the dynamics, but omitting the ﬁrst frame of each sequence from Xout , and omitting the ﬁnal frame of each sequence from the kernel matrix KX . [sent-142, score-0.324]

43 That is, if we learn from a sequence Y1 , and then infer the latent coordinates for a new sequence Y2 , then the joint likelihood entails full kernel matrices KX and KY formed from both sequences. [sent-144, score-0.776]

44 For example, a second-order dynamical model, xt = f (xt−1 , xt−2 ; A) + nx,t (12) may be used to explicitly model the dependence of the prediction on two past frames (or on velocity). [sent-147, score-0.576]

45 Using Euler integration with time-step ∆t, we have xt = xt−1 + vt−1 ∆t. [sent-149, score-0.275]

46 The dynamics likelihood p(X | α) can then be written by redeﬁning Xout = [x2 − x1 , . [sent-150, score-0.174]

47 ¯ Figure 2 shows a GPDM 3D latent space learned from a human motion capture data comprising three walk cycles. [sent-160, score-0.643]

48 Each pose was deﬁned by 56 Euler angles for joints, 3 global (torso) pose angles, and 3 global (torso) translational velocities. [sent-161, score-0.162]

49 For learning, the data was mean-subtracted, and the latent coordinates were initialized with PCA. [sent-162, score-0.525]

50 We used 3D latent spaces for all experiments shown here. [sent-164, score-0.376]

51 Using 2D latent spaces leads to intersecting latent trajectories. [sent-165, score-0.752]

52 2(a) shows a 3D SGPLVM learned from walking data. [sent-168, score-0.145]

53 Note that the latent trajectories are not smooth; there are numerous cases where consecutive poses in the walking sequence are relatively far apart in the latent space. [sent-169, score-1.049]

54 2(b) shows that the GPDM produces a much smoother conﬁguration of latent positions. [sent-171, score-0.403]

55 Here the GPDM arranges the latent positions roughly in the shape of a saddle. [sent-172, score-0.418]

56 This shows the conﬁdence with which the model reconstructs a pose ¯ from latent positions x. [sent-176, score-0.526]

57 2(d) shows 25 fair samples from the latent dynamics of the GPDM. [sent-179, score-0.557]

58 As noted above, because we marginalize over the weights of the dynamic mapping, A, the distribution over a pose sequence cannot be factored into a sequence of low-order Markov transitions (Fig. [sent-181, score-0.235]

59 1 Mean Prediction Sequences For both 3D people tracking and computer animation, it is desirable to generate new motions efﬁciently. [sent-187, score-0.134]

60 In particular, we set the latent position at each time-step to be the most-likely (mean) point given the previous step: xt = µX (˜t−1 ). [sent-190, score-0.651]

61 On our data sets, mean-prediction trajectories from the GPDM with an RBF or linear+RBF kernel for dynamics usually produce sequences that roughly follow the training data (e. [sent-196, score-0.382]

62 We also found that meanprediction motions are often very close to the mean obtained from the HMC sampler; by (a) (b) (d) (c) (e) Figure 2: Models learned from a walking sequence of 2. [sent-200, score-0.266]

63 The latent positions learned with a GPLVM (a) and a GPDM (b) are shown in blue. [sent-202, score-0.477]

64 (c) - log variance for reconstruction shows regions of latent space that are reconstructed with high conﬁdence. [sent-204, score-0.435]

65 (e) A GPDM of walk data learned with RBF+linear kernel dynamics. [sent-206, score-0.216]

66 The poses were reconstructed from points on the optimized trajectory. [sent-208, score-0.168]

67 (b) The GPDM model was learned from 3 swings of a golf club, using a 2nd order RBF kernel for dynamics. [sent-212, score-0.294]

68 The two plots show 2D orthogonal projections of the 3D latent space. [sent-213, score-0.376]

69 Compared to the RBF kernels, mean-prediction motions generated from GPDMs with the linear kernel often deviate from the original data (e. [sent-215, score-0.197]

70 Figure 3(b) shows a 3D GPDM learned from three swings of a golf club. [sent-218, score-0.171]

71 Thus, it is sometimes desirable to optimize a particular motion under the GPDM, which often reduces drift of the mean-prediction mo- (a) (b) Figure 4: GPDM from walk sequence with missing data learned with (a) a RBF+linear kernel for dynamics, and (b) a linear kernel for dynamics. [sent-222, score-0.561]

72 The likelihood p(X | X, α) of the new sequence X is then optimized directly ¯ (holding the latent positions of the previously learned latent positions, X, and hyperparameters, α, ﬁxed). [sent-227, score-0.961]

73 To see why optimization generates motion close to the traing data, note ¯ 2 that the variance of pose xt+1 is determined by σX (˜t ), which will be lower when xt is ˜ x ˜ nearer the training data. [sent-228, score-0.501]

74 Consequently, the likelihood of xt+1 can be increased by moving ˜ xt closer to the training data. [sent-229, score-0.342]

75 2(e) shows an optimized walk sequence initialized from the mean-prediction. [sent-232, score-0.172]

76 3 Forecasting We performed a simple experiment to compare the predictive power of the GPDM to a linear dynamical system, implemented as a GPDM with linear kernel in the latent space and RBF latent mapping. [sent-234, score-1.115]

77 We trained each model on the ﬁrst 130 frames of the 60Hz walking sequence (corresponding to 2 cycles), and tested on the remaining 23 frames. [sent-235, score-0.22]

78 From each test frame mean-prediction was used to predict the pose 8 frames ahead, and then the RMS pose error was computed against ground truth. [sent-236, score-0.241]

79 97 Due to the nonlinear nature of the walking dynamics in latent space, the RBF and Linear+RBF kernels outperform the linear kernel. [sent-243, score-0.756]

80 4 Missing Data The GPDM model can also handle incomplete data (a common problem with human motion capture sequences). [sent-246, score-0.174]

81 (16)), but with the terms corresponding to missing poses yt removed. [sent-248, score-0.24]

82 The latent coordinates for missing data are initialized by cubic spline interpolation from the 3D PCA initialization of observations. [sent-249, score-0.604]

83 , 10–15 frames of the 157-frame walk sequence used in Fig. [sent-252, score-0.168]

84 The problem lies with the difﬁculty in initializing the missing latent positions sufﬁciently close to the training data. [sent-254, score-0.565]

85 Reducing sampling density effectively increases uncertainty in the reconstruction process so that the probability density over the latent space falls off more smoothly from the data. [sent-256, score-0.549]

86 We then restart the learning with the entire data set, but with the kernel hyperparameters ﬁxed. [sent-257, score-0.145]

87 In doing so, the dynamics terms in the objective function exert more inﬂuence over the latent coordinates of the training data, and a smooth model is learned. [sent-258, score-0.712]

88 With 50 missing frames of the 157-frame walk sequence, this optimization produces mod- els (Fig. [sent-259, score-0.216]

89 The linear kernel is able to pull the latent coordinates onto a cylinder (Fig. [sent-262, score-0.627]

90 4 produce estimates of the missing poses that are visually indistinguishable from the ground truth. [sent-265, score-0.186]

91 For example, if we hold xt ﬁxed during optimization, then it is unlikely that the optimizer will make much adjustment to xt+1 or xt−1 . [sent-270, score-0.275]

92 Speciﬁcally, consider a dynamical model of the form vt = f (xt−1 , vt−1 ). [sent-272, score-0.281]

93 Since adjacent time-steps are related only by the velocity vt ≈ (xt − xt−1 )/∆t, we can handle irregularly-sampled datapoints by adjusting the timestep ∆t, possibly using a different ∆t at each step. [sent-273, score-0.146]

94 It would be straightforward to include a control signal ut in the dynamics f (xt , ut ). [sent-275, score-0.148]

95 It would also be interesting to explore uncertainty in latent variable estimation (e. [sent-276, score-0.41]

96 Our use of maximum likelihood latent coordinates is motivated by Lawrence’s observation that model uncertainty and latent coordinate uncertainty are interchangeable when learning PCA [5]. [sent-279, score-0.993]

97 However, in some applications, uncertainty about latent coordinates may be highly structured (e. [sent-280, score-0.53]

98 Gaussian process latent variable models for visualisation of high dimensional data. [sent-328, score-0.435]

99 Interactive control of avatars animated with human motion data. [sent-341, score-0.123]

100 Dynamical modeling with kernels for nonlinear time series e prediction. [sent-379, score-0.134]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gpdm', 0.621), ('latent', 0.376), ('xt', 0.275), ('dynamical', 0.197), ('kx', 0.191), ('rbf', 0.154), ('dynamics', 0.148), ('coordinates', 0.12), ('poses', 0.107), ('xout', 0.1), ('kernel', 0.096), ('ky', 0.095), ('gplvm', 0.087), ('gp', 0.086), ('walking', 0.086), ('pose', 0.081), ('nonlinear', 0.08), ('motion', 0.08), ('missing', 0.079), ('motions', 0.066), ('walk', 0.061), ('golf', 0.06), ('hmc', 0.06), ('learned', 0.059), ('vt', 0.057), ('mappings', 0.056), ('sequence', 0.055), ('yt', 0.054), ('ln', 0.052), ('gaussian', 0.052), ('sgplvm', 0.052), ('swings', 0.052), ('frames', 0.052), ('velocity', 0.052), ('trajectories', 0.049), ('hyperparameters', 0.049), ('sequences', 0.048), ('entails', 0.048), ('marginalize', 0.044), ('tracking', 0.044), ('human', 0.043), ('mapping', 0.042), ('positions', 0.042), ('training', 0.041), ('xn', 0.041), ('animation', 0.04), ('club', 0.04), ('gpdms', 0.04), ('leithead', 0.04), ('solak', 0.04), ('torso', 0.04), ('da', 0.04), ('comprises', 0.038), ('conceptually', 0.037), ('timestep', 0.037), ('process', 0.036), ('linear', 0.035), ('hertzmann', 0.035), ('reconstructed', 0.034), ('switching', 0.034), ('uncertainty', 0.034), ('tr', 0.034), ('fair', 0.033), ('isotropic', 0.032), ('basis', 0.032), ('euler', 0.032), ('omitting', 0.032), ('autoregressive', 0.032), ('fleet', 0.032), ('kernels', 0.031), ('ahead', 0.03), ('initialized', 0.029), ('nips', 0.029), ('smoothly', 0.028), ('frame', 0.027), ('model', 0.027), ('optimized', 0.027), ('priors', 0.027), ('depict', 0.027), ('marginalized', 0.027), ('initializing', 0.027), ('smoother', 0.027), ('likelihood', 0.026), ('rasmussen', 0.025), ('density', 0.025), ('prediction', 0.025), ('reconstruction', 0.025), ('bayesian', 0.025), ('optimization', 0.024), ('capture', 0.024), ('people', 0.024), ('graphics', 0.024), ('sampler', 0.024), ('modeling', 0.023), ('inverse', 0.023), ('yn', 0.023), ('models', 0.023), ('july', 0.023), ('lawrence', 0.023), ('green', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 80 nips-2005-Gaussian Process Dynamical Models

Author: Jack Wang, Aaron Hertzmann, David M. Blei

2 0.33195642 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

Author: Aaron Shon, Keith Grochow, Aaron Hertzmann, Rajesh P. Rao

Abstract: We propose an algorithm that uses Gaussian process regression to learn common hidden structure shared between corresponding sets of heterogenous observations. The observation spaces are linked via a single, reduced-dimensionality latent variable space. We present results from two datasets demonstrating the algorithms’s ability to synthesize novel data from learned correspondences. We ﬁrst show that the method can learn the nonlinear mapping between corresponding views of objects, ﬁlling in missing data as needed to synthesize novel views. We then show that the method can learn a mapping between human degrees of freedom and robotic degrees of freedom for a humanoid robot, allowing robotic imitation of human poses from motion capture data. 1

3 0.2464903 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

Author: Purnamrita Sarkar, Andrew W. Moore

Abstract: This paper explores two aspects of social network modeling. First, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. Second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. The generalized model associates each entity with a point in p-dimensional Euclidian latent space. The points can move as time progresses but large moves in latent space are improbable. Observed links between entities are more likely if the entities are close in latent space. We show how to make such a model tractable (subquadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efﬁcient dynamic adaptation of multidimensional scaling for a ﬁrst pass of approximate projection of entities into latent space; and an efﬁcient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is O(log n). We use both synthetic and real-world data on upto 11,000 entities which indicate linear scaling in computation time and improved performance over four alternative approaches. We also illustrate the system operating on twelve years of NIPS co-publication data. We present a detailed version of this work in [1]. 1

4 0.21138854 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reﬂecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target speciﬁcation and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1

5 0.13206856 45 nips-2005-Conditional Visual Tracking in Kernel Space

Author: Cristian Sminchisescu, Atul Kanujia, Zhiguo Li, Dimitris Metaxas

Abstract: We present a conditional temporal probabilistic framework for reconstructing 3D human motion in monocular video based on descriptors encoding image silhouette observations. For computational efÄ?Ĺš ciency we restrict visual inference to low-dimensional kernel induced non-linear state spaces. Our methodology (kBME) combines kernel PCA-based non-linear dimensionality reduction (kPCA) and Conditional Bayesian Mixture of Experts (BME) in order to learn complex multivalued predictors between observations and model hidden states. This is necessary for accurate, inverse, visual perception inferences, where several probable, distant 3D solutions exist due to noise or the uncertainty of monocular perspective projection. Low-dimensional models are appropriate because many visual processes exhibit strong non-linear correlations in both the image observations and the target, hidden state variables. The learned predictors are temporally combined within a conditional graphical model in order to allow a principled propagation of uncertainty. We study several predictors and empirically show that the proposed algorithm positively compares with techniques based on regression, Kernel Dependency Estimation (KDE) or PCA alone, and gives results competitive to those of high-dimensional mixture predictors at a fraction of their computational cost. We show that the method successfully reconstructs the complex 3D motion of humans in real monocular video sequences. 1 Introduction and Related Work We consider the problem of inferring 3D articulated human motion from monocular video. This research topic has applications for scene understanding including human-computer interfaces, markerless human motion capture, entertainment and surveillance. A monocular approach is relevant because in real-world settings the human body parts are rarely completely observed even when using multiple cameras. This is due to occlusions form other people or objects in the scene. A robust system has to necessarily deal with incomplete, ambiguous and uncertain measurements. Methods for 3D human motion reconstruction can be classiÄ?Ĺš ed as generative and discriminative. They both require a state representation, namely a 3D human model with kinematics (joint angles) or shape (surfaces or joint positions) and they both use a set of image features as observations for state inference. The computational goal in both cases is the conditional distribution for the model state given image observations. Generative model-based approaches [6, 16, 14, 13] have been demonstrated to Ä?Ĺš‚exibly reconstruct complex unknown human motions and to naturally handle problem constraints. However it is difÄ?Ĺš cult to construct reliable observation likelihoods due to the complexity of modeling human appearance. This varies widely due to different clothing and deformation, body proportions or lighting conditions. Besides being somewhat indirect, the generative approach further imposes strict conditional independence assumptions on the temporal observations given the states in order to ensure computational tractability. Due to these factors inference is expensive and produces highly multimodal state distributions [6, 16, 13]. Generative inference algorithms require complex annealing schedules [6, 13] or systematic non-linear search for local optima [16] in order to ensure continuing tracking. These difÄ?Ĺš culties motivate the advent of a complementary class of discriminative algorithms [10, 12, 18, 2], that approximate the state conditional directly, in order to simplify inference. However, inverse, observation-to-state multivalued mappings are difÄ?Ĺš cult to learn (see e.g. Ä?Ĺš g. 1a) and a probabilistic temporal setting is necessary. In an earlier paper [15] we introduced a probabilistic discriminative framework for human motion reconstruction. Because the method operates in the originally selected state and observation spaces that can be task generic, therefore redundant and often high-dimensional, inference is more expensive and can be less robust. To summarize, reconstructing 3D human motion in a Figure 1: (a, Left) Example of 180o ambiguity in predicting 3D human poses from silhouette image features (center). It is essential that multiple plausible solutions (e.g. F 1 and F2 ) are correctly represented and tracked over time. A single state predictor will either average the distant solutions or zig-zag between them, see also tables 1 and 2. (b, Right) A conditional chain model. The local distributions p(yt |ytĂ˘ˆ’1 , zt ) or p(yt |zt ) are learned as in Ä?Ĺš g. 2. For inference, the predicted local state conditional is recursively combined with the Ä?Ĺš ltered prior c.f . (1). conditional temporal framework poses the following difÄ?Ĺš culties: (i) The mapping between temporal observations and states is multivalued (i.e. the local conditional distributions to be learned are multimodal), therefore it cannot be accurately represented using global function approximations. (ii) Human models have multivariate, high-dimensional continuous states of 50 or more human joint angles. The temporal state conditionals are multimodal which makes efÄ?Ĺš cient Kalman Ä?Ĺš ltering algorithms inapplicable. General inference methods (particle Ä?Ĺš lters, mixtures) have to be used instead, but these are expensive for high-dimensional models (e.g. when reconstructing the motion of several people that operate in a joint state space). (iii) The components of the human state and of the silhouette observation vector exhibit strong correlations, because many repetitive human activities like walking or running have low intrinsic dimensionality. It appears wasteful to work with high-dimensional states of 50+ joint angles. Even if the space were truly high-dimensional, predicting correlated state dimensions independently may still be suboptimal. In this paper we present a conditional temporal estimation algorithm that restricts visual inference to low-dimensional, kernel induced state spaces. To exploit correlations among observations and among state variables, we model the local, temporal conditional distributions using ideas from Kernel PCA [11, 19] and conditional mixture modeling [7, 5], here adapted to produce multiple probabilistic predictions. The corresponding predictor is referred to as a Conditional Bayesian Mixture of Low-dimensional Kernel-Induced Experts (kBME). By integrating it within a conditional graphical model framework (Ä?Ĺš g. 1b), we can exploit temporal constraints probabilistically. We demonstrate that this methodology is effective for reconstructing the 3D motion of multiple people in monocular video. Our contribution w.r.t. [15] is a probabilistic conditional inference framework that operates over a non-linear, kernel-induced low-dimensional state spaces, and a set of experiments (on both real and artiÄ?Ĺš cial image sequences) that show how the proposed framework positively compares with powerful predictors based on KDE, PCA, or with the high-dimensional models of [15] at a fraction of their cost. 2 Probabilistic Inference in a Kernel Induced State Space We work with conditional graphical models with a chain structure [9], as shown in Ä?Ĺš g. 1b, These have continuous temporal states yt , t = 1 . . . T , observations zt . For compactness, we denote joint states Yt = (y1 , y2 , . . . , yt ) or joint observations Zt = (z1 , . . . , zt ). Learning and inference are based on local conditionals: p(yt |zt ) and p(yt |ytĂ˘ˆ’1 , zt ), with yt and zt being low-dimensional, kernel induced representations of some initial model having state xt and observation rt . We obtain zt , yt from rt , xt using kernel PCA [11, 19]. Inference is performed in a low-dimensional, non-linear, kernel induced latent state space (see Ä?Ĺš g. 1b and Ä?Ĺš g. 2 and (1)). For display or error reporting, we compute the original conditional p(x|r), or a temporally Ä?Ĺš ltered version p(xt |Rt ), Rt = (r1 , r2 , . . . , rt ), using a learned pre-image state map [3]. 2.1 Density Propagation for Continuous Conditional Chains For online Ä?Ĺš ltering, we compute the optimal distribution p(yt |Zt ) for the state yt , conditioned by observations Zt up to time t. The Ä?Ĺš ltered density can be recursively derived as: p(yt |Zt ) = p(yt |ytĂ˘ˆ’1 , zt )p(ytĂ˘ˆ’1 |ZtĂ˘ˆ’1 ) (1) ytĂ˘ˆ’1 We compute using a conditional mixture for p(yt |ytĂ˘ˆ’1 , zt ) (a Bayesian mixture of experts c.f . Ă‚Â§2.2) and the prior p(ytĂ˘ˆ’1 |ZtĂ˘ˆ’1 ), each having, say M components. We integrate M 2 pairwise products of Gaussians analytically. The means of the expanded posterior are clustered and the centers are used to initialize a reduced M -component Kullback-Leibler approximation that is reÄ?Ĺš ned using gradient descent [15]. The propagation rule (1) is similar to the one used for discrete state labels [9], but here we work with multivariate continuous state spaces and represent the local multimodal state conditionals using kBME (Ä?Ĺš g. 2), and not log-linear models [9] (these would require intractable normalization). This complex continuous model rules out inference based on Kalman Ä?Ĺš ltering or dynamic programming [9]. 2.2 Learning Bayesian Mixtures over Kernel Induced State Spaces (kBME) In order to model conditional mappings between low-dimensional non-linear spaces we rely on kernel dimensionality reduction and conditional mixture predictors. The authors of KDE [19] propose a powerful structured unimodal predictor. This works by decorrelating the output using kernel PCA and learning a ridge regressor between the input and each decorrelated output dimension. Our procedure is also based on kernel PCA but takes into account the structure of the studied visual problem where both inputs and outputs are likely to be low-dimensional and the mapping between them multivalued. The output variables xi are projected onto the column vectors of the principal space in order to obtain their principal coordinates y i . A z Ă˘ˆˆ P(Fr ) O p(y|z) kP CA ĂŽĹšr (r) Ă˘Š‚ Fr O / y Ă˘ˆˆ P(Fx ) O QQQ QQQ QQQ kP CA QQQ Q( ĂŽĹšx (x) Ă˘Š‚ Fx x Ă˘‰ˆ PreImage(y) O ĂŽĹšr ĂŽĹšx r Ă˘ˆˆ R Ă˘Š‚ Rr x Ă˘ˆˆ X Ă˘Š‚ Rx p(x|r) Ă˘‰ˆ p(x|y) Figure 2: The learned low-dimensional predictor, kBME, for computing p(x|r) Ă˘‰Ä„ p(xt |rt ), Ă˘ˆ€t. (We similarly learn p(xt |xtĂ˘ˆ’1 , rt ), with input (x, r) instead of r Ă˘€“ here we illustrate only p(x|r) for clarity.) The input r and the output x are decorrelated using Kernel PCA to obtain z and y respectively. The kernels used for the input and output are ĂŽĹš r and ĂŽĹšx , with induced feature spaces Fr and Fx , respectively. Their principal subspaces obtained by kernel PCA are denoted by P(Fr ) and P(Fx ), respectively. A conditional Bayesian mixture of experts p(y|z) is learned using the low-dimensional representation (z, y). Using learned local conditionals of the form p(yt |zt ) or p(yt |ytĂ˘ˆ’1 , zt ), temporal inference can be efÄ?Ĺš ciently performed in a low-dimensional kernel induced state space (see e.g. (1) and Ä?Ĺš g. 1b). For visualization and error measurement, the Ä?Ĺš ltered density, e.g. p(yt |Zt ), can be mapped back to p(xt |Rt ) using the pre-image c.f . (3). similar procedure is performed on the inputs ri to obtain zi . In order to relate the reduced feature spaces of z and y (P(Fr ) and P(Fx )), we estimate a probability distribution over mappings from training pairs (zi , yi ). We use a conditional Bayesian mixture of experts (BME) [7, 5] in order to account for ambiguity when mapping similar, possibly identical reduced feature inputs to very different feature outputs, as common in our problem (Ä?Ĺš g. 1a). This gives a model that is a conditional mixture of low-dimensional kernel-induced experts (kBME): M g(z|ĂŽÂ´ j )N (y|Wj z, ĂŽĹ j ) p(y|z) = (2) j=1 where g(z|ĂŽÂ´ j ) is a softmax function parameterized by ĂŽÂ´ j and (Wj , ĂŽĹ j ) are the parameters and the output covariance of expert j, here a linear regressor. As in many Bayesian settings [17, 5], the weights of the experts and of the gates, Wj and ĂŽÂ´ j , are controlled by hierarchical priors, typically Gaussians with 0 mean, and having inverse variance hyperparameters controlled by a second level of Gamma distributions. We learn this model using a double-loop EM and employ ML-II type approximations [8, 17] with greedy (weight) subset selection [17, 15]. Finally, the kBME algorithm requires the computation of pre-images in order to recover the state distribution x from itĂ˘€™s image y Ă˘ˆˆ P(Fx ). This is a closed form computation for polynomial kernels of odd degree. For more general kernels optimization or learning (regression based) methods are necessary [3]. Following [3, 19], we use a sparse Bayesian kernel regressor to learn the pre-image. This is based on training data (xi , yi ): p(x|y) = N (x|AĂŽĹšy (y), Ă˘„Ĺš) (3) with parameters and covariances (A, Ă˘„Ĺš). Since temporal inference is performed in the low-dimensional kernel induced state space, the pre-image function needs to be calculated only for visualizing results or for the purpose of error reporting. Propagating the result from the reduced feature space P(Fx ) to the output space X pro- duces a Gaussian mixture with M elements, having coefÄ?Ĺš cients g(z|ĂŽÂ´ j ) and components N (x|AĂŽĹšy (Wj z), AJĂŽĹšy ĂŽĹ j JĂŽĹšy A + Ă˘„Ĺš), where JĂŽĹšy is the Jacobian of the mapping ĂŽĹšy . 3 Experiments We run experiments on both real image sequences (Ä?Ĺš g. 5 and Ä?Ĺš g. 6) and on sequences where silhouettes were artiÄ?Ĺš cially rendered. The prediction error is reported in degrees (for mixture of experts, this is w.r.t. the most probable one, but see also Ä?Ĺš g. 4a), and normalized per joint angle, per frame. The models are learned using standard cross-validation. Pre-images are learned using kernel regressors and have average error 1.7o . Training Set and Model State Representation: For training we gather pairs of 3D human poses together with their image projections, here silhouettes, using the graphics package Maya. We use realistically rendered computer graphics human surface models which we animate using human motion capture [1]. Our original human representation (x) is based on articulated skeletons with spherical joints and has 56 skeletal d.o.f. including global translation. The database consists of 8000 samples of human activities including walking, running, turns, jumps, gestures in conversations, quarreling and pantomime. Image Descriptors: We work with image silhouettes obtained using statistical background subtraction (with foreground and background models). Silhouettes are informative for pose estimation although prone to ambiguities (e.g. the left / right limb assignment in side views) or occasional lack of observability of some of the d.o.f. (e.g. 180o ambiguities in the global azimuthal orientation for frontal views, e.g. Ä?Ĺš g. 1a). These are multiplied by intrinsic forward / backward monocular ambiguities [16]. As observations r, we use shape contexts extracted on the silhouette [4] (5 radial, 12 angular bins, size range 1/8 to 3 on log scale). The features are computed at different scales and sizes for points sampled on the silhouette. To work in a common coordinate system, we cluster all features in the training set into K = 50 clusters. To compute the representation of a new shape feature (a point on the silhouette), we Ă˘€˜projectĂ˘€™ onto the common basis by (inverse distance) weighted voting into the cluster centers. To obtain the representation (r) for a new silhouette we regularly sample 200 points on it and add all their feature vectors into a feature histogram. Because the representation uses overlapping features of the observation the elements of the descriptor are not independent. However, a conditional temporal framework (Ä?Ĺš g. 1b) Ä?Ĺš‚exibly accommodates this. For experiments, we use Gaussian kernels for the joint angle feature space and dot product kernels for the observation feature space. We learn state conditionals for p(yt |zt ) and p(yt |ytĂ˘ˆ’1 , zt ) using 6 dimensions for the joint angle kernel induced state space and 25 dimensions for the observation induced feature space, respectively. In Ä?Ĺš g. 3b) we show an evaluation of the efÄ?Ĺš cacy of our kBME predictor for different dimensions in the joint angle kernel induced state space (the observation feature space dimension is here 50). On the analyzed dancing sequence, that involves complex motions of the arms and the legs, the non-linear model signiÄ?Ĺš cantly outperforms alternative PCA methods and gives good predictions for compact, low-dimensional models.1 In tables 1 and 2, as well as Ä?Ĺš g. 4, we perform quantitative experiments on artiÄ?Ĺš cially rendered silhouettes. 3D ground truth joint angles are available and this allows a more 1 Running times: On a Pentium 4 PC (3 GHz, 2 GB RAM), a full dimensional BME model with 5 experts takes 802s to train p(xt |xtĂ˘ˆ’1 , rt ), whereas a kBME (including the pre-image) takes 95s to train p(yt |ytĂ˘ˆ’1 , zt ). The prediction time is 13.7s for BME and 8.7s (including the pre-image cost 1.04s) for kBME. The integration in (1) takes 2.67s for BME and 0.31s for kBME. The speed-up for kBME is signiÄ?Ĺš cant and likely to increase with original models having higher dimensionality. Prediction Error Number of Clusters 100 1000 100 10 1 1 2 3 4 5 6 7 8 Degree of Multimodality kBME KDE_RVM PCA_BME PCA_RVM 10 1 0 20 40 Number of Dimensions 60 Figure 3: (a, Left) Analysis of Ă˘€˜multimodalityĂ˘€™ for a training set. The input zt dimension is 25, the output yt dimension is 6, both reduced using kPCA. We cluster independently in (ytĂ˘ˆ’1 , zt ) and yt using many clusters (2100) to simulate small input perturbations and we histogram the yt clusters falling within each cluster in (ytĂ˘ˆ’1 , zt ). This gives intuition on the degree of ambiguity in modeling p(yt |ytĂ˘ˆ’1 , zt ), for small perturbations in the input. (b, Right) Evaluation of dimensionality reduction methods for an artiÄ?Ĺš cial dancing sequence (models trained on 300 samples). The kBME is our model Ă‚Â§2.2, whereas the KDE-RVM is a KDE model learned with a Relevance Vector Machine (RVM) [17] feature space map. PCA-BME and PCA-RVM are models where the mappings between feature spaces (obtained using PCA) is learned using a BME and a RVM. The non-linearity is signiÄ?Ĺš cant. Kernel-based methods outperform PCA and give low prediction error for 5-6d models. systematic evaluation. Notice that the kernelized low-dimensional models generally outperform the PCA ones. At the same time, they give results competitive to the ones of high-dimensional BME predictors, while being lower-dimensional and therefore signiÄ?Ĺš cantly less expensive for inference, e.g. the integral in (1). In Ä?Ĺš g. 5 and Ä?Ĺš g. 6 we show human motion reconstruction results for two real image sequences. Fig. 5 shows the good quality reconstruction of a person performing an agile jump. (Given the missing observations in a side view, 3D inference for the occluded body parts would not be possible without using prior knowledge!) For this sequence we do inference using conditionals having 5 modes and reduced 6d states. We initialize tracking using p(yt |zt ), whereas for inference we use p(yt |ytĂ˘ˆ’1 , zt ) within (1). In the second sequence in Ä?Ĺš g. 6, we simultaneously reconstruct the motion of two people mimicking domestic activities, namely washing a window and picking an object. Here we do inference over a product, 12-dimensional state space consisting of the joint 6d state of each person. We obtain good 3D reconstruction results, using only 5 hypotheses. Notice however, that the results are not perfect, there are small errors in the elbow and the bending of the knee for the subject at the l.h.s., and in the different wrist orientations for the subject at the r.h.s. This reÄ?Ĺš‚ects the bias of our training set. Walk and turn Conversation Run and turn left KDE-RR 10.46 7.95 5.22 RVM 4.95 4.96 5.02 KDE-RVM 7.57 6.31 6.25 BME 4.27 4.15 5.01 kBME 4.69 4.79 4.92 Table 1: Comparison of average joint angle prediction error for different models. All kPCA-based models use 6 output dimensions. Testing is done on 100 video frames for each sequence, the inputs are artiÄ?Ĺš cially generated silhouettes, not in the training set. 3D joint angle ground truth is used for evaluation. KDE-RR is a KDE model with ridge regression (RR) for the feature space mapping, KDE-RVM uses an RVM. BME uses a Bayesian mixture of experts with no dimensionality reduction. kBME is our proposed model. kPCAbased methods use kernel regressors to compute pre-images. Expert Prediction Frequency Ă˘ˆ’ Closest to Ground truth Frequency Ă˘ˆ’ Close to ground truth 30 25 20 15 10 5 0 1 2 3 4 Expert Number 14 10 8 6 4 2 0 5 1st Probable Prev Output 2nd Probable Prev Output 3rd Probable Prev Output 4th Probable Prev Output 5th Probable Prev Output 12 1 2 3 4 Current Expert 5 Figure 4: (a, Left) Histogram showing the accuracy of various expert predictors: how many times the expert ranked as the k-th most probable by the model (horizontal axis) is closest to the ground truth. The model is consistent (the most probable expert indeed is the most accurate most frequently), but occasionally less probable experts are better. (b, Right) Histograms show the dynamics of p(yt |ytĂ˘ˆ’1 , zt ), i.e. how the probability mass is redistributed among experts between two successive time steps, in a conversation sequence. Walk and turn back Run and turn KDE-RR 7.59 17.7 RVM 6.9 16.8 KDE-RVM 7.15 16.08 BME 3.6 8.2 kBME 3.72 8.01 Table 2: Joint angle prediction error computed for two complex sequences with walks, runs and turns, thus more ambiguity (100 frames). Models have 6 state dimensions. Unimodal predictors average competing solutions. kBME has signiÄ?Ĺš cantly lower error. Figure 5: Reconstruction of a jump (selected frames). Top: original image sequence. Middle: extracted silhouettes. Bottom: 3D reconstruction seen from a synthetic viewpoint. 4 Conclusion We have presented a probabilistic framework for conditional inference in latent kernelinduced low-dimensional state spaces. Our approach has the following properties: (a) Figure 6: Reconstructing the activities of 2 people operating in an 12-d state space (each person has its own 6d state). Top: original image sequence. Bottom: 3D reconstruction seen from a synthetic viewpoint. Accounts for non-linear correlations among input or output variables, by using kernel nonlinear dimensionality reduction (kPCA); (b) Learns probability distributions over mappings between low-dimensional state spaces using conditional Bayesian mixture of experts, as required for accurate prediction. In the resulting low-dimensional kBME predictor ambiguities and multiple solutions common in visual, inverse perception problems are accurately represented. (c) Works in a continuous, conditional temporal probabilistic setting and offers a formal management of uncertainty. We show comparisons that demonstrate how the proposed approach outperforms regression, PCA or KDE alone for reconstructing the 3D human motion in monocular video. Future work we will investigate scaling aspects for large training sets and alternative structured prediction methods. References [1] CMU Human Motion DataBase. Online at http://mocap.cs.cmu.edu/search.html, 2003. [2] A. Agarwal and B. Triggs. 3d human pose from silhouettes by Relevance Vector Regression. In CVPR, 2004. [3] G. Bakir, J. Weston, and B. Scholkopf. Learning to Ä?Ĺš nd pre-images. In NIPS, 2004. [4] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 24, 2002. [5] C. Bishop and M. Svensen. Bayesian mixtures of experts. In UAI, 2003. [6] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by Annealed Particle Filtering. In CVPR, 2000. [7] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, (6):181Ă˘€“214, 1994. [8] D. Mackay. Bayesian interpolation. Neural Computation, 4(5):720Ă˘€“736, 1992. [9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, 2000. [10] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps. In NIPS, 2002. [11] B. SchĂ‚Â¨ lkopf, A. Smola, and K. MĂ‚Â¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299Ă˘€“1319, 1998. [12] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter Sensitive Hashing. In ICCV, 2003. [13] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard. Tracking Loose-limbed People. In CVPR, 2004. [14] C. Sminchisescu and A. Jepson. Generative Modeling for Continuous Non-Linearly Embedded Visual Inference. In ICML, pages 759Ă˘€“766, Banff, 2004. [15] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative Density Propagation for 3D Human Motion Estimation. In CVPR, 2005. [16] C. Sminchisescu and B. Triggs. Kinematic Jump Processes for Monocular 3D Human Tracking. In CVPR, volume 1, pages 69Ă˘€“76, Madison, 2003. [17] M. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. JMLR, 2001. [18] C. Tomasi, S. Petrov, and A. Sastry. 3d tracking = classiÄ?Ĺš cation + interpolation. In ICCV, 2003. [19] J. Weston, O. Chapelle, A. Elisseeff, B. Scholkopf, and V. Vapnik. Kernel dependency estimation. In NIPS, 2002.

6 0.12967639 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

7 0.11151282 98 nips-2005-Infinite latent feature models and the Indian buffet process

8 0.1089641 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

9 0.10353869 50 nips-2005-Convex Neural Networks

10 0.096473396 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

11 0.094517969 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

12 0.08502984 30 nips-2005-Assessing Approximations for Gaussian Process Classification

13 0.074019276 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

14 0.072568297 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

15 0.070483498 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

16 0.068401106 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

17 0.064490855 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

18 0.062652916 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

19 0.061478686 52 nips-2005-Correlated Topic Models

20 0.061091524 102 nips-2005-Kernelized Infomax Clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.225), (1, 0.024), (2, 0.012), (3, 0.11), (4, 0.127), (5, -0.317), (6, 0.247), (7, -0.089), (8, 0.09), (9, 0.142), (10, 0.008), (11, 0.248), (12, -0.309), (13, -0.013), (14, 0.062), (15, 0.043), (16, 0.038), (17, -0.107), (18, 0.128), (19, -0.026), (20, 0.103), (21, -0.032), (22, 0.092), (23, 0.003), (24, -0.053), (25, 0.05), (26, 0.013), (27, -0.098), (28, 0.056), (29, -0.038), (30, -0.095), (31, 0.0), (32, 0.032), (33, -0.034), (34, -0.038), (35, 0.062), (36, -0.022), (37, -0.049), (38, 0.034), (39, -0.072), (40, 0.029), (41, 0.048), (42, -0.055), (43, -0.033), (44, 0.002), (45, -0.104), (46, 0.004), (47, -0.005), (48, 0.078), (49, -0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96809268 80 nips-2005-Gaussian Process Dynamical Models

Author: Jack Wang, Aaron Hertzmann, David M. Blei

2 0.84414279 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

Author: Aaron Shon, Keith Grochow, Aaron Hertzmann, Rajesh P. Rao

3 0.70869964 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

Author: Purnamrita Sarkar, Andrew W. Moore

4 0.56876016 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

5 0.45960128 45 nips-2005-Conditional Visual Tracking in Kernel Space

Author: Cristian Sminchisescu, Atul Kanujia, Zhiguo Li, Dimitris Metaxas

6 0.3978194 81 nips-2005-Gaussian Processes for Multiuser Detection in CDMA receivers

7 0.38931164 35 nips-2005-Bayesian model learning in human visual perception

8 0.3862344 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

9 0.3829996 73 nips-2005-Fast biped walking with a reflexive controller and real-time policy searching

10 0.38225999 98 nips-2005-Infinite latent feature models and the Indian buffet process

11 0.34185609 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

12 0.30853435 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

13 0.30620554 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

14 0.30059776 50 nips-2005-Convex Neural Networks

15 0.29968315 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters

16 0.29801363 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

17 0.29359865 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

18 0.29350665 156 nips-2005-Prediction and Change Detection

19 0.28669751 30 nips-2005-Assessing Approximations for Gaussian Process Classification

20 0.28229889 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.027), (10, 0.034), (27, 0.022), (31, 0.037), (34, 0.06), (41, 0.014), (50, 0.011), (55, 0.018), (65, 0.015), (69, 0.092), (73, 0.02), (77, 0.012), (88, 0.506), (91, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99081933 80 nips-2005-Gaussian Process Dynamical Models

Author: Jack Wang, Aaron Hertzmann, David M. Blei

2 0.98749834 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

Author: Vinod Nair, Geoffrey E. Hinton

Abstract: We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. The inferred motor programs can be used directly for digit classiﬁcation, but they can also be used in other ways. By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. We can also use the motor programs as additional, highly informative outputs which reduce overﬁtting when training a feed-forward classiﬁer. 1 Overview The idea that patterns can be recognized by ﬁguring out how they were generated has been around for at least half a century [1, 2] and one of the ﬁrst proposed applications was the recognition of handwriting using a generative model that involved pairs of opposing springs [3, 4]. The “analysis-by-synthesis” approach is attractive because the true generative model should provide the most natural way to characterize a class of patterns. The handwritten 2’s in ﬁgure 1, for example, are very variable when viewed as pixels but they have very similar motor programs. Despite its obvious merits, analysis-by-synthesis has had few successes, partly because it is computationally expensive to invert non-linear generative models and partly because the underlying parameters of the generative model are unknown for most large data sets. For example, the only source of information about how the MNIST digits were drawn is the images themselves. We describe a simple generative model in which a pen is controlled by two pairs of opposing springs whose stiffnesses are speciﬁed by a motor program. If the sequence of stiffnesses is speciﬁed correctly, the model can produce images which look very like the MNIST digits. Using a separate network for each digit class, we show that backpropagation can be used to learn a “recognition” network that maps images to the motor programs required to produce them. An interesting aspect of this learning is that the network creates its own training data, so it does not require the training images to be labelled with motor programs. Each recognition network starts with a single example of a motor program and grows an “island of competence” around this example, progressively extending the region over which it can map small changes in the image to the corresponding small changes in the motor program (see ﬁgure 2). Figure 1: An MNIST image of a 2 and the additional images that can be generated by inferring the motor program and then adding random noise to it. The pixels are very different, but they are all clearly twos. Fairly good digit recognition can be achieved by using the 10 recognition networks to ﬁnd 10 motor programs for a test image and then scoring each motor program by its squared error in reconstructing the image. The 10 scores are then fed into a softmax classiﬁer. Recognition can be improved by using PCA to model the distribution of motor trajectories for each class and using the distance of a motor trajectory from the relevant PCA hyperplane as an additional score. Each recognition network is solving a difﬁcult global search problem in which the correct motor program must be found by a single, “open-loop” pass through the network. More accurate recognition can be achieved by using this open-loop global search to initialize an iterative, closed-loop local search which uses the error in the reconstructed image to revise the motor program. This requires reconstruction errors in pixel space to be mapped to corrections in the space of spring stiffnesses. We cannot backpropagate errors through the generative model because it is just a hand-coded computer program. So we learn “generative” networks, one per digit class, that emulate the generator. After learning, backpropagation through these generative networks is used to convert pixel reconstruction errors into stiffness corrections. Our ﬁnal system gives 1.82% error on the MNIST test set which is similar to the 1.7% achieved by a very different generative approach [5] but worse than the 1.53% produced by the best backpropagation networks or the 1.4% produced by support vector machines [6]. It is much worse than the 0.4% produced by convolutional neural networks that use cleverly enhanced training sets [7]. Recognition of test images is quite slow because it uses ten different recognition networks followed by iterative local search. There is, however, a much more efﬁcient way to make use of our ability to extract motor programs. They can be treated as additional output labels when using backpropagation to train a single, multilayer, discriminative neural network. These additional labels act as a very informative regularizer that reduces the error rate from 1.53% to 1.27% in a network with two hidden layers of 500 units each. This is a new method of improving performance that can be used in conjunction with other tricks such as preprocessing the images, enhancing the training set or using convolutional neural nets [8, 7]. 2 A simple generative model for drawing digits The generative model uses two pairs of opposing springs at right angles. One end of each spring is attached to a frictionless horizontal or vertical rail that is 39 pixels from the center of the image. The other end is attached to a “pen” that has signiﬁcant mass. The springs themselves are weightless and have zero rest length. The pen starts at the equilibrium position deﬁned by the initial stiffnesses of the four springs. It then follows a trajectory that is determined by the stiffness of each spring at each of the 16 subsequent time steps in the motor program. The mass is large compared with the rate at which the stiffnesses change, so the system is typically far from equilibrium as it follows the smooth trajectory. On each time step, the momentum is multiplied by 0.9 to simulate viscosity. A coarse-grain trajectory is computed by using one step of forward integration for each time step in the motor program, so it contains 17 points. The code is at www.cs.toronto.edu/∼ hinton/code. Figure 2: The training data for each class-speciﬁc recognition network is produced by adding noise to motor programs that are inferred from MNIST images using the current parameters of the recognition network. To initiate this process, the biases of the output units are set by hand so that they represent a prototypical motor program for the class. Given a coarse-grain trajectory, we need a way of assigning an intensity to each pixel. We tried various methods until we hand-evolved one that was able to reproduce the MNIST images fairly accurately, but we suspect that many other methods would be just as good. For each point on the coarse trajectory, we share two units of ink between the the four closest pixels using bilinear interpolation. We also use linear interpolation to add three ﬁne-grain trajectory points between every pair of coarse-grain points. These ﬁne-grain points also contribute ink to the pixels using bilinear interpolation, but the amount of ink they contribute is zero if they are less than one pixel apart and rises linearly to the same amount as the coarse-grain points if they are more than two pixels apart. This generates a thin skeleton with a fairly uniform ink density. To ﬂesh-out the skeleton, we use two “ink parameters”, a a a a a, b, to specify a 3 × 3 kernel of the form b(1 + a)[ 12 , a , 12 ; a , 1 − a, a ; 12 , a , 12 ] which 6 6 6 6 is convolved with the image four times. Finally, the pixel intensities are clipped to lie in the interval [0,1]. The matlab code is at www.cs.toronto.edu/∼ hinton/code. The values of 2a and b/1.5 are additional, logistic outputs of the recognition networks1 . 3 Training the recognition networks The obvious way to learn a recognition network is to use a training set in which the inputs are images and the target outputs are the motor programs that were used to generate those images. If we knew the distribution over motor programs for a given digit class, we could easily produce such a set by running the generator. Unfortunately, the distribution over motor programs is exactly what we want to learn from the data, so we need a way to train 1 We can add all sorts of parameters to the hand-coded generative model and then get the recognition networks to learn to extract the appropriate values for each image. The global mass and viscosity as well as the spacing of the rails that hold the springs can be learned. We can even implement afﬁnelike transformations by attaching the four springs to endpoints whose eight coordinates are given by the recognition networks. These extra parameters make the learning slower and, for the normalized digits, they do not improve discrimination, probably because they help the wrong digit models as much as the right one. the recognition network without knowing this distribution in advance. Generating scribbles from random motor programs will not work because the capacity of the network will be wasted on irrelevant images that are far from the real data. Figure 2 shows how a single, prototype motor program can be used to initialize a learning process that creates its own training data. The prototype consists of a sequence of 4 × 17 spring stiffnesses that are used to set the biases on 68 of the 70 logistic output units of the recognition net. If the weights coming from the 400 hidden units are initially very small, the recognition net will then output a motor program that is a close approximation to the prototype, whatever the input image. Some random noise is then added to this motor program and it is used to generate a training image. So initially, all of the generated training images are very similar to the one produced by the prototype. The recognition net will therefore devote its capacity to modeling the way in which small changes in these images map to small changes in the motor program. Images in the MNIST training set that are close to the prototype will then be given their correct motor programs. This will tend to stretch the distribution of motor programs produced by the network along the directions that correspond to the manifold on which the digits lie. As time goes by, the generated training set will expand along the manifold for that digit class until all of the MNIST training images of that class are well modelled by the recognition network. It takes about 10 hours in matlab on a 3 GHz Xeon to train each recognition network. We use minibatches of size 100, momentum of 0.9, and adaptive learning rates on each connection that increase additively when the sign of the gradient agrees with the sign of the previous weight change and decrease multiplicatively when the signs disagree [9]. The net is generating its own training data, so the objective function is always changing which makes it inadvisable to use optimization methods that go as far as they can in a carefully chosen direction. Figures 3 and 4 show some examples of how well the recognition nets perform after training. Nearly all models achieve an average squared pixel error of less than 15 per image on their validation set (pixel intensities are between 0 and 1 with a preponderance of extreme values). The inferred motor programs are clearly good enough to capture the diverse handwriting styles in the data. They are not good enough, however, to give classiﬁcation performance comparable to the state-of-the-art on the MNIST database. So we added a series of enhancements to the basic system to improve the classiﬁcation accuracy. 4 Enhancements to the basic system Extra strokes in ones and sevens. One limitation of the basic system is that it draws digits using only a single stroke (i.e. the trajectory is a single, unbroken curve). But when people draw digits, they often add extra strokes to them. Two of the most common examples are the dash at the bottom of ones, and the dash through the middle of sevens (see examples in ﬁgure 5). About 2.2% of ones and 13% of sevens in the MNIST training set are dashed and not modelling the dashes reduces classiﬁcation accuracy signiﬁcantly. We model dashed ones and sevens by augmenting their basic motor programs with another motor program to draw the dash. For example, a dashed seven is generated by ﬁrst drawing an ordinary seven using the motor program computed by the seven model, and then drawing the dash with a motor program computed by a separate neural network that models only dashes. Dashes in ones and sevens are modeled with two different networks. Their training proceeds the same way as with the other models, except now there are only 50 hidden units and the training set contains only the dashed cases of the digit. (Separating these cases from the rest of the MNIST training set is easy because they can be quickly spotted by looking at the difference between the images and their reconstructions by the dashless digit model.) The net takes the entire image of a digit as input, and computes the motor program for just the dash. When reconstructing an unlabelled image as say, a seven, we compute both Figure 3: Examples of validation set images reconstructed by their corresponding model. In each case the original image is on the left and the reconstruction is on the right. Superimposed on the original image is the pen trajectory. the dashed and dashless versions of seven and pick the one with the lower squared pixel error to be that image’s reconstruction as a seven. Figure 5 shows examples of images reconstructed using the extra stroke. Local search. When reconstructing an image in its own class, a digit model often produces a sensible, overall approximation of the image. However, some of the ﬁner details of the reconstruction may be slightly wrong and need to be ﬁxed up by an iterative local search that adjusts the motor program to reduce the reconstruction error. We ﬁrst approximate the graphics model with a neural network that contains a single hidden layer of 500 logistic units. We train one such generative network for each of the ten digits and for the dashed version of ones and sevens (for a total of 12 nets). The motor programs used for training are obtained by adding noise to the motor programs inferred from the training data by the relevant, fully trained recognition network. The images produced from these motor programs by the graphics model are used as the targets for the supervised learning of each generative network. Given these targets, the weight updates are computed in the same way as for the recognition networks. Figure 4: To model 4’s we use a single smooth trajectory, but turn off the ink for timesteps 9 and 10. For images in which the pen does not need to leave the paper, the recognition net ﬁnds a trajectory in which points 8 and 11 are close together so that points 9 and 10 are not needed. For 5’s we leave the top until last and turn off the ink for timesteps 13 and 14. Figure 5: Examples of dashed ones and sevens reconstructed using a second stroke. The pen trajectory for the dash is shown in blue, superimposed on the original image. Initial squared pixel error = 33.8 10 iterations, error = 15.2 20 iterations, error = 10.5 30 iterations, error = 9.3 Figure 6: An example of how local search improves the detailed registration of the trajectory found by the correct model. After 30 iterations, the squared pixel error is less than a third of its initial value. Once the generative network is trained, we can use it to iteratively improve the initial motor program computed by the recognition network for an image. The main steps in one iteration are: 1) compute the error between the image and the reconstruction generated from the current motor program by the graphics model; 2) backpropagate the reconstruction error through the generative network to calculate its gradient with respect to the motor program; 3) compute a new motor program by taking a step along the direction of steepest descent plus 0.5 times the previous step. Figure 6 shows an example of how local search improves the reconstruction by the correct model. Local search is usually less effective at improving the ﬁts of the wrong models, so it eliminates about 20% of the classiﬁcation errors on the validation set. PCA model of the image residuals. The sum of squared pixel errors is not the best way of comparing an image with its reconstruction, because it treats the residual pixel errors as independent and zero-mean Gaussian distributed, which they are not. By modelling the structure in the residual vectors, we can get a better estimate of the conditional probability of the image given the motor program. For each digit class, we construct a PCA model of the image residual vectors for the training images. Then, given a test image, we project the image residual vector produced by each inferred motor program onto the relevant PCA hyperplane and compute the squared distance between the residual and its projection. This gives ten scores for the image that measure the quality of its reconstructions by the digit models. We don’t discard the old sum of squared pixel errors as they are still useful for classifying most images correctly. Instead, all twenty scores are used as inputs to the classiﬁer, which decides how to combine both types of scores to achieve high classiﬁcation accuracy. PCA model of trajectories. Classifying an image by comparing its reconstruction errors for the different digit models tacitly relies on the assumption that the incorrect models will reconstruct the image poorly. Since the models have only been trained on images in their Squared error = 24.9, Shape prior score = 31.5 Squared error = 15.0, Shape prior score = 104.2 Figure 7: Reconstruction of a two image by the two model (left box) and by the three model (right box), with the pen trajectory superimposed on the original image. The three model sharply bends the bottom of its trajectory to better explain the ink, but the trajectory prior for three penalizes it with a high score. The two model has a higher squared error, but a much lower prior score, which allows the classiﬁer to correctly label the image. own class, they often do reconstruct images from other classes poorly, but occasionally they ﬁt an image from another class well. For example, ﬁgure 7 shows how the three model reconstructs a two image better than the two model by generating a highly contorted three. This problem becomes even more pronounced with local search which sometimes contorts the wrong model to ﬁt the image really well. The solution is to learn a PCA model of the trajectories that a digit model infers from images in its own class. Given a test image, the trajectory computed by each digit model is scored by its squared distance from the relevant PCA hyperplane. These 10 “prior” scores are then given to the classiﬁer along with the 20 “likelihood” scores described above. The prior scores eliminate many classiﬁcation mistakes such as the one in ﬁgure 7. 5 Classiﬁcation results To classify a test image, we apply multinomial logistic regression to the 30 scores – i.e. we use a neural network with no hidden units, 10 softmax output units and a cross-entropy error. The net is trained by gradient descent using the scores for the validation set images. To illustrate the gain in classiﬁcation accuracy achieved by the enhancements explained above, table 1 gives the percent error on the validation set as each enhancement is added to the system. Together, the enhancements almost halve the number of mistakes. Enhancements None 1 1, 2 1, 2, 3 1, 2, 3, 4 Validation set % error 4.43 3.84 3.01 2.67 2.28 Test set % error 1.82 Table 1: The gain in classiﬁcation accuracy on the validation set as the following enhancements are added: 1) extra stroke for dashed ones and sevens, 2) local search, 3) PCA model of image residual, and 4) PCA trajectory prior. To avoid using the test set for model selection, the performance on the ofﬁcial test set was only measured for the ﬁnal system. 6 Discussion After training a single neural network to output both the class label and the motor program for all classes (as described in section 1) we tried ignoring the label output and classifying the test images by using the cost, under 10 different PCA models, of the trajectory deﬁned by the inferred motor program. Each PCA model was ﬁtted to the trajectories extracted from the training images for a given class. This gave 1.80% errors which is as good as the 1.82% we got using the 10 separate recognition networks and local search. This is quite surprising because the motor programs produced by the single network were simpliﬁed to make them all have the same dimensionality and they produced signiﬁcantly poorer reconstructions. By only using the 10 digit-speciﬁc recognition nets to create the motor programs for the training data, we get much faster recognition of test data because at test time we can use a single recognition network for all classes. It also means we do not need to trade-off prior scores against image residual scores because there is only one image residual. The ability to extract motor programs could also be used to enhance the training set. [7] shows that error rates can be halved by using smooth vector distortion ﬁelds to create extra training data. They argue that these ﬁelds simulate “uncontrolled oscillations of the hand muscles dampened by inertia”. Motor noise may be better modelled by adding noise to an actual motor program as shown in ﬁgure 1. Notice that this produces a wide variety of non-blurry images and it can also change the topology. The techniques we have used for extracting motor programs from digit images may be applicable to speech. There are excellent generative models that can produce almost perfect speech if they are given the right formant parameters [10]. Using one of these generative models we may be able to train a large number of specialized recognition networks to extract formant parameters from speech without requiring labeled training data. Once this has been done, labeled data would be available for training a single feed-forward network that could recover accurate formant parameters which could be used for real-time recognition. Acknowledgements We thank Steve Isard, David MacKay and Allan Jepson for helpful discussions. This research was funded by NSERC, CFI and OIT. GEH is a fellow of the Canadian Institute for Advanced Research and holds a Canada Research Chair in machine learning. References [1] D. M. MacKay. Mindlike behaviour in artefacts. British Journal for Philosophy of Science, 2:105–121, 1951. [2] M. Halle and K. Stevens. Speech recognition: A model and a program for research. IRE Transactions on Information Theory, IT-8 (2):155–159, 1962. [3] Murray Eden. Handwriting and pattern recognition. IRE Transactions on Information Theory, IT-8 (2):160–166, 1962. [4] J.M. Hollerbach. An oscillation theory of handwriting. Biological Cybernetics, 39:139–156, 1981. [5] G. Mayraz and G. E. Hinton. Recognizing hand-written digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:189–197, 2001. [6] D. Decoste and B. Schoelkopf. Training invariant support vector machines. Machine Learning, 46:161–190, 2002. [7] Patrice Y. Simard, Dave Steinkraus, and John Platt. Best practice for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, Los Alamitos, pages 958–962, 2003. [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. [9] A. Jacobs R. Increased Rates of Convergence Through Learning Rate Adaptation. Technical Report: UM-CS-1987-117. University of Massachusetts, Amherst, MA, 1987. [10] W. Holmes, J. Holmes, and M. Judd. Extension of the bandwith of the jsru parallel-formant synthesizer for high quality synthesis of male and female speech. In Proceedings of ICASSP 90 (1), pages 313–316, 1990.

3 0.987131 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly speciﬁed, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulﬁlled. In this paper, we ﬁrst show that the existing active learning method can be theoretically justiﬁed under slightly weaker condition: the model does not have to be correctly speciﬁed, but slightly misspeciﬁed models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justiﬁed for a wider class of misspeciﬁed models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspeciﬁcation of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ deﬁned on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are ﬁxed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly speciﬁed [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we ﬁrst show that the existing active learning method can still be theoretically justiﬁed when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justiﬁed for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-deﬁned distribution with density ÔÜ Ü , and discuss the problem of ﬁnding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error deﬁned by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ Ý¾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly speciﬁed, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly speciﬁed model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed Â ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspeciﬁcation of Models In this section, we investigate the validity of the existing active learning method for misspeciﬁed models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · ÆÖ´Üµ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual Ö´Üµ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ Ö´Üµ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ Ö´Ü¾µ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing Â is still justiﬁed if Æ ½ ¾ µ. However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ Â ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly speciﬁed, «ÇÄË is an unbiased estimator of «£ . However, for misspeciﬁed models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly speciﬁed models, inﬂuence of the covariate shift can be ignored, as the existing active learning method does. However, for misspeciﬁed models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ ´Üµ ¾ Ô ´Ü µ Ü ´Üµ ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be conﬁrmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that ¾ Â · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justiﬁed when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We ﬁrst illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly speciﬁed”, “approximately correct”, and “misspeciﬁed” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ speciﬁed”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspeciﬁed”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspeciﬁcation of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. signiﬁcance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspeciﬁed” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly speciﬁed” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives signiﬁcantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly speciﬁed case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspeciﬁcation of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and Ø ´¾ ¾ ÅÄ Ô ´Üµ is unknown. Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we ﬁrst create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for ﬁve data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for ﬁve data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly speciﬁed, the covariate shift does not matter. However, for misspeciﬁed models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly speciﬁed. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspeciﬁcation of models and thus it is reliable. The proposed method can be theoretically justiﬁed if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspeciﬁed models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspeciﬁed models. It is also important to notice that when the model is totally misspeciﬁed, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial ﬁnancial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artiﬁcial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspeciﬁed models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.

4 0.98300081 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efﬁcient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classiﬁcation data sets show that the proposed approach is promising.

5 0.87512195 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

Author: Edward Meeds, Simon Osindero

Abstract: We present an inﬁnite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian speciﬁcation of the effective ‘gating network’ for the different experts. 1

6 0.84176564 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

7 0.83749318 136 nips-2005-Noise and the two-thirds power Law

8 0.8221184 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

9 0.82030386 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

10 0.80571461 37 nips-2005-Benchmarking Non-Parametric Statistical Tests

11 0.80534858 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

12 0.79996181 35 nips-2005-Bayesian model learning in human visual perception

13 0.79601592 161 nips-2005-Radial Basis Function Network for Multi-task Learning

14 0.79302597 48 nips-2005-Context as Filtering

15 0.79045242 45 nips-2005-Conditional Visual Tracking in Kernel Space

16 0.78625643 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

17 0.7824859 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

18 0.78067619 171 nips-2005-Searching for Character Models

19 0.77489436 30 nips-2005-Assessing Approximations for Gaussian Process Classification

20 0.76546717 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity