nips nips2001 nips2001-108 knowledge-graph by maker-knowledge-mining

108 nips-2001-Learning Body Pose via Specialized Maps


Source: pdf

Author: Rómer Rosales, Stan Sclaroff

Abstract: A nonlinear supervised learning model, the Specialized Mappings Architecture (SMA), is described and applied to the estimation of human body pose from monocular images. The SMA consists of several specialized forward mapping functions and an inverse mapping function. Each specialized function maps certain domains of the input space (image features) onto the output space (body pose parameters). The key algorithmic problems faced are those of learning the specialized domains and mapping functions in an optimal way, as well as performing inference given inputs and knowledge of the inverse function. Solutions to these problems employ the EM algorithm and alternating choices of conditional independence assumptions. Performance of the approach is evaluated with synthetic and real video sequences of human motion. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A nonlinear supervised learning model, the Specialized Mappings Architecture (SMA), is described and applied to the estimation of human body pose from monocular images. [sent-5, score-0.879]

2 The SMA consists of several specialized forward mapping functions and an inverse mapping function. [sent-6, score-0.913]

3 Each specialized function maps certain domains of the input space (image features) onto the output space (body pose parameters). [sent-7, score-0.83]

4 The key algorithmic problems faced are those of learning the specialized domains and mapping functions in an optimal way, as well as performing inference given inputs and knowledge of the inverse function. [sent-8, score-0.924]

5 Solutions to these problems employ the EM algorithm and alternating choices of conditional independence assumptions. [sent-9, score-0.149]

6 Performance of the approach is evaluated with synthetic and real video sequences of human motion. [sent-10, score-0.189]

7 1 Introduction In everyday life, humans can easily estimate body part locations (body pose) from relatively low-resolution images of the projected 3D world (e. [sent-11, score-0.542]

8 However, body pose estimation is a very difficult computer vision problem. [sent-14, score-0.672]

9 It is believed that humans employ extensive prior knowledge about human body structure and motion in this task [10]. [sent-15, score-0.717]

10 Assuming this , we consider how a computer might learn the underlying structure and thereby infer body pose. [sent-16, score-0.474]

11 In computer vision, this task is usually posed as a tracking problem. [sent-17, score-0.169]

12 Typically, models comprised of 2D or 3D geometric primitives are designed for tracking a specific articulated body [13, 5, 2, 15]. [sent-18, score-0.732]

13 At each frame, these models are fitted to the image to optimize some cost function. [sent-19, score-0.033]

14 Careful manual placement of the model on the first frame is required, and tracking in subsequent frames tends to be sensitive to errors in initialization and numerical drift. [sent-20, score-0.33]

15 Generally, these systems cannot recover from tracking errors in the middle of a sequence. [sent-21, score-0.143]

16 To address these weaknesses, more complex dynamic models have been proposed [14, 13,9]; these methods learn a prior over some specific motion (such as walking). [sent-22, score-0.203]

17 This strong prior however, substantially limits the generality of the motions that can be tracked. [sent-23, score-0.084]

18 Departing from the aforementioned tracking paradigm, in [8] a Gaussian probability model was learned for short human motion sequences. [sent-24, score-0.447]

19 In [17] dynamic programming was used to calculate the best global labeling according to the learned joint probability density function of the position and velocity of body features. [sent-25, score-0.464]

20 Still, in these approaches, the joint locations, correspondences, or model initialization must be provided by hand. [sent-26, score-0.066]

21 In [1], the manifold of human body dynamics was modeled via a hidden Markov model and learned via entropic minimization. [sent-27, score-0.618]

22 Although the approach presented here can be used to model dynamics, we argue that when general human motion dynamics are intended to be learned, the amount of training data, model complexity, and computational resources required are impractical. [sent-29, score-0.316]

23 As a consequence, models with large priors towards specific motions (e . [sent-30, score-0.129]

24 In this paper we describe a non-linear supervised learning algorithm, the Specialized Maps Architecture (SMA), for recovering articulated body pose from single monocular images. [sent-33, score-0.819]

25 This approach avoids the need for initialization and tracking per se, and reduces the above mentioned disadvantages. [sent-34, score-0.209]

26 2 Specialized Maps There at least two key characteristics of the problem we are trying to solve which make it different from other supervised learning problems. [sent-35, score-0.087]

27 We are trying to learn unknown probabilistic maps from inputs to outputs space, but we have access to the map (in general probabilistic) from outputs to inputs. [sent-37, score-0.227]

28 In our pose estimation problem, it is easy to see how we can artificially, using computer graphics (CG), produce some visual features (e. [sent-38, score-0.39]

29 Second, it is one-to-many: one input can be associated with more than one output. [sent-41, score-0.044]

30 Features obtained from silhouettes (and many other visual features) are ambiguous. [sent-42, score-0.129]

31 This last observation precludes the use of standard algorithms for supervised learning that fit a single mapping function to the data. [sent-44, score-0.147]

32 Given input and output spaces ~c and ~t, and the inverse function ( : ~t -+ ~c, we describe a solution for these supervised learning problems. [sent-45, score-0.308]

33 Our approach consists in generating a series of m functions ¢k : ~c -+ ~t. [sent-46, score-0.087]

34 Each of these functions is specialized to map only certain inputs (for a specialized sub-domain) better than others. [sent-47, score-1.042]

35 For example, each sub-domain can be a region of the input space. [sent-48, score-0.044]

36 However, the specialized sub-domain of ¢k can be more general than just a connected region in the input space. [sent-49, score-0.501]

37 Several other learning models use a similar concept of fitting surfaces to the observed data by splitting the input space into several regions and approximating simpler functions in these regions (e. [sent-50, score-0.095]

38 However, in these approaches, the inverse map is not incorporated in the estimation algorithm because it is not considered in the problem definition and the forward model is usually more complex, making inference and learning more difficult. [sent-53, score-0.534]

39 We propose to determine the specialized domains and functions using an approximate EM algorithm and to perform inference using, in an alternating fashion, the conditional independence assumptions specified by the forward and inverse models. [sent-55, score-1.081]

40 3 Probabilistic Model Let the training sets of output-input observations be \)! [sent-59, score-0.029]

41 , M} oflabels for the specialized functions , and can be thought as the function number used to map data point i; thus M is the number of specialized mapping functions. [sent-78, score-1.101]

42 , 8M , A) , 8k represents the parameters of the mapping function k; A = (AI"", AM), where Ak represents P(Yi = kI8): the prior probability that mapping function with label i will be used to map an unknown point. [sent-82, score-0.195]

43 Using Bayes' rule and assuming independence of observations given 8, we have the log-probability of our data given the modellogp(ZI8), which we want to maximize: argm;x 2:)og LP(1jIi lvi, Yi = k,8)P(Yi = kI8)p(Vi ), i (1) k where we used the independence assumption p(vI8) = p(v). [sent-84, score-0.092]

44 However, there exist practical approximate optimization procedures, one of them is Expectation Maximization (EM) [3,4, 12]. [sent-87, score-0.03]

45 1 Learning The EM algorithm is well known, therefore here we only provide the derivations specific to SMA's. [sent-89, score-0.045]

46 The E-step consists of finding P(y = klz, 8) = P(y). [sent-90, score-0.103]

47 For the implementation described in this paper we use N(1/Ji; ¢k(Vi,B k ), ~k)' where Bk are the parameters of the k-th specialized function, and ~k the error covariance of the specialized function k . [sent-92, score-0.914]

48 One way to interpret this choice is to think that the error cost in estimating 1/J once we know the specialized function to use, is a Gaussian distribution with mean the output of the specialized function and some covariance which is map dependent. [sent-93, score-1.037]

49 The M-step consists of finding B(t) = argmaxoEj>(t) [logp(Z,y IB)]. [sent-96, score-0.103]

50 (4) In keeping the formulation general, we have not defined the form of the specialized functions ¢k. [sent-99, score-0.508]

51 Whether or not we can find a closed form solution for the update of Bk depends on the form of ¢k. [sent-100, score-0.032]

52 In case ¢k yield a quadratic form, then a closed form update exists. [sent-102, score-0.032]

53 4 Inference Once learning is accomplished, each specialized function maps (with different levels of accuracy) the input space. [sent-105, score-0.565]

54 If p(hlx, y) = N(h ; ¢y(x) , ~y), the MAP estimate involves finding the maximum in a mixture of Gaussians. [sent-107, score-0.067]

55 However, no closed form solution exists and moreover, we have not incorporated the potentially useful knowledge of the inverse function C. [sent-108, score-0.216]

56 1 MAP by Using the Inverse Function ( The access to a forward kinematics function ( (called here the inverse function) allows to formulate a different inference algorithm. [sent-110, score-0.483]

57 We are again interested in finding an optimal h* given an input x (e. [sent-111, score-0.111]

58 , an optimal body pose given features taken from an image). [sent-113, score-0.629]

59 5 Approximate Inference using ( Let us assume that we can approximate Lyp(hly, x)P(y) by a set of samples generated according to p(hly,x)P(y) and a kernel function K(h,hs). [sent-118, score-0.03]

60 An approximate to Lyp(hly,x)P(y) is formally built by ~ L;=l K(h , h s ), with the normalizing condition any given h s . [sent-123, score-0.03]

61 h= After some simple manipulations, this can be reduced to the following equivalent discrete optimization problem whose goal is to find the most likely sample s*: (9) where the last equivalence used the assumption p(xlh) = N(x; ((h), ~d. [sent-126, score-0.033]

62 A S If K(h, h s) = N(h ; hs , ~Spl)' we have: h = argmaxhP(xlh) L S =l N(h ; hs , ~Spl). [sent-127, score-0.184]

63 1 A Deterministic Approximation based on the Functions Mean Output The structure of the inference in SMA, and the choice of probabilities p(hlx, y) allows us to construct a newer approximation that is considerably less expensive to compute, and it is deterministic. [sent-131, score-0.127]

64 Intuitively they idea consists of asking each of the specialized functions ¢k what their most likely estimate for h is, given the observed input x. [sent-132, score-0.645]

65 The opinions of each of these specialized functions are then evaluated using our distribution p(xlh) similar to the above sampling method. [sent-133, score-0.508]

66 Thus by considering the means ¢k(X), we would be considering the most likely output of each specialized function. [sent-135, score-0.536]

67 l(b) to illustrate the mean-output (MO) approximate inference process. [sent-138, score-0.157]

68 When generating an estimate of body pose, denoted h, given an input x (the gray point with a dark contour in the lower plane), the SMA generates a series of output hypotheses tl q, = {h! [sent-139, score-0.543]

69 2 Bayesian Inference Note that in many cases, there may not be any need to simply provide a point estimate, in terms of a most likely output h. [sent-143, score-0.079]

70 In fact we could instead use the whole distribution found in the inference process. [sent-144, score-0.127]

71 We can show that using the above choices for K we can respectively obtain. [sent-145, score-0.031]

72 1 s p(hlx) = S 2: N (x; ((hs ), ~d, (11) 8= 1 s p(hlx) = N(h; h8' ~Spz) 2:N(x; ((h) , ~d· (12) 8=1 6 Experiments The described architecture was tested using a computer graphics rendering as our ( inverse function. [sent-146, score-0.338]

73 7,000 frames of human body poses obtained through motion capture. [sent-148, score-0.786]

74 The output consisted of 20 2D marker positions (i. [sent-149, score-0.147]

75 , 3D markers projected to the image plane using a perspective model) but linearly encoded by 8 real values using Principal Component Analysis (PCA). [sent-151, score-0.094]

76 The input (visual features) consisted of 7 real-valued Hu moments computed on synthetically generated silhouettes of the articulated figure. [sent-152, score-0.284]

77 For training/testing we generated 120,000 data points: our 3D poses from motion capture were projected to 16 views along the view-sphere equator. [sent-153, score-0.238]

78 The only free parameter in this test, related to the given SMA, was the number of specialized functions used; this was set to 15. [sent-155, score-0.508]

79 Due to space limitations, in this paper we show results using the mean-output inference algorithm only, readers are referred to http://cs-people. [sent-157, score-0.127]

80 2(left) shows the reconstruction obtained in several single images coming from three different artificial sequences. [sent-161, score-0.131]

81 The agreement between reconstruction and observation is easy to perceive for all sequences. [sent-162, score-0.129]

82 Note that for self-occluding configurations, reconstruction is harder, but still the estimate is close to ground-truth. [sent-163, score-0.103]

83 No human intervention nor pose initialization was required. [sent-164, score-0.368]

84 2(right) shows the average marker error and variance per body orientation in percentage of body height. [sent-166, score-0.966]

85 This intuitively agrees with the notion that at those angles (side-views) , there is less visibility of the body parts. [sent-168, score-0.447]

86 By choosing poses at random from training set, the RMSE was 17% of body height. [sent-170, score-0.522]

87 75 14 16 Figure 2: Left: Example reconstruction of several test sequences with CG-generated silhouettes. [sent-181, score-0.131]

88 Each set consists of input images and reconstruction (every 5th frame). [sent-182, score-0.211]

89 3 shows examples of system performance with real segmented visual data, obtained from observing a human subject. [sent-190, score-0.243]

90 Note that even though the characteristics of the segmented body differ from the ones used for training, good performance is still achieved. [sent-192, score-0.464]

91 Most reconstructions are visually close to what can be thought as the right pose reconstruction. [sent-193, score-0.172]

92 A learning algorithm was developed for this architecture using ideas from ML estimation and latent variable models. [sent-196, score-0.108]

93 Inference was based on the possibility of alternatively use different sets of conditional independence assumptions specified by the forward and inverse models. [sent-197, score-0.326]

94 The incorporation of the inverse function in the model allows for simpler forward models. [sent-198, score-0.251]

95 For example the inverse function is an architectural alternative to the gating networks of Mixture of Experts [11]. [sent-199, score-0.156]

96 SMA advantages for body pose estimation include: no iterative methods for inference are used, the Figure 3: Reconstruction obtained from observing a human subject (every 10th frame). [sent-200, score-0.955]

97 Bayesian reconstruction of 3d human motion from single-camera video. [sent-243, score-0.366]

98 Visual perception of biological motion and a model for its analysis. [sent-252, score-0.133]

99 A view of the em algorithm that justifies incremental , sparse, and other variants. [sent-264, score-0.051]

100 Specialized mappings and the estimation of body pose from a single image. [sent-284, score-0.701]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('specialized', 0.457), ('body', 0.423), ('sma', 0.307), ('xlh', 0.195), ('pose', 0.172), ('hlx', 0.167), ('inverse', 0.156), ('tracking', 0.143), ('yi', 0.133), ('motion', 0.133), ('human', 0.13), ('inference', 0.127), ('articulated', 0.121), ('reconstruction', 0.103), ('forward', 0.095), ('hs', 0.092), ('hly', 0.084), ('motions', 0.084), ('silhouettes', 0.084), ('map', 0.077), ('poses', 0.07), ('finding', 0.067), ('marker', 0.066), ('initialization', 0.066), ('maps', 0.064), ('supervised', 0.062), ('graphics', 0.062), ('mapping', 0.059), ('bk', 0.058), ('cvpr', 0.058), ('architecture', 0.057), ('boston', 0.056), ('argmaxhp', 0.056), ('lvi', 0.056), ('lyp', 0.056), ('rosales', 0.056), ('sclaroff', 0.056), ('spl', 0.056), ('mappings', 0.055), ('frame', 0.054), ('em', 0.051), ('functions', 0.051), ('estimation', 0.051), ('argmaxp', 0.048), ('domains', 0.047), ('ak', 0.047), ('independence', 0.046), ('output', 0.046), ('visual', 0.045), ('specific', 0.045), ('input', 0.044), ('kinematics', 0.044), ('alternating', 0.043), ('monocular', 0.041), ('segmented', 0.041), ('walking', 0.041), ('learned', 0.041), ('rendering', 0.037), ('manual', 0.037), ('access', 0.036), ('consists', 0.036), ('projected', 0.035), ('consisted', 0.035), ('zi', 0.034), ('features', 0.034), ('likely', 0.033), ('image', 0.033), ('closed', 0.032), ('choices', 0.031), ('humans', 0.031), ('video', 0.031), ('jordan', 0.03), ('approximate', 0.03), ('frames', 0.03), ('experts', 0.03), ('contour', 0.03), ('conditional', 0.029), ('percentage', 0.029), ('training', 0.029), ('incorporated', 0.028), ('sequences', 0.028), ('images', 0.028), ('observing', 0.027), ('algorithmic', 0.027), ('quantitative', 0.027), ('plane', 0.026), ('observation', 0.026), ('computer', 0.026), ('trying', 0.025), ('relatively', 0.025), ('iterative', 0.025), ('learn', 0.025), ('approaches', 0.025), ('orientation', 0.025), ('formulate', 0.025), ('dynamics', 0.024), ('visibility', 0.024), ('asking', 0.024), ('illustrating', 0.024), ('sallans', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 108 nips-2001-Learning Body Pose via Specialized Maps

Author: Rómer Rosales, Stan Sclaroff

Abstract: A nonlinear supervised learning model, the Specialized Mappings Architecture (SMA), is described and applied to the estimation of human body pose from monocular images. The SMA consists of several specialized forward mapping functions and an inverse mapping function. Each specialized function maps certain domains of the input space (image features) onto the output space (body pose parameters). The key algorithmic problems faced are those of learning the specialized domains and mapping functions in an optimal way, as well as performing inference given inputs and knowledge of the inverse function. Solutions to these problems employ the EM algorithm and alternating choices of conditional independence assumptions. Performance of the approach is evaluated with synthetic and real video sequences of human motion. 1

2 0.1593339 193 nips-2001-Unsupervised Learning of Human Motion Models

Author: Yang Song, Luis Goncalves, Pietro Perona

Abstract: This paper presents an unsupervised learning algorithm that can derive the probabilistic dependence structure of parts of an object (a moving human body in our examples) automatically from unlabeled data. The distinguished part of this work is that it is based on unlabeled data, i.e., the training features include both useful foreground parts and background clutter and the correspondence between the parts and detected features are unknown. We use decomposable triangulated graphs to depict the probabilistic independence of parts, but the unsupervised technique is not limited to this type of graph. In the new approach, labeling of the data (part assignments) is taken as hidden variables and the EM algorithm is applied. A greedy algorithm is developed to select parts and to search for the optimal structure based on the differential entropy of these variables. The success of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled real image sequences.

3 0.073092803 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

Author: Yun Gao, Michael J. Black, Elie Bienenstock, Shy Shoham, John P. Donoghue

Abstract: Statistical learning and probabilistic inference techniques are used to infer the hand position of a subject from multi-electrode recordings of neural activity in motor cortex. First, an array of electrodes provides training data of neural firing conditioned on hand kinematics. We learn a nonparametric representation of this firing activity using a Bayesian model and rigorously compare it with previous models using cross-validation. Second, we infer a posterior probability distribution over hand motion conditioned on a sequence of neural test data using Bayesian inference. The learned firing models of multiple cells are used to define a nonGaussian likelihood term which is combined with a prior probability for the kinematics. A particle filtering method is used to represent, update, and propagate the posterior distribution over time. The approach is compared with traditional linear filtering methods; the results suggest that it may be appropriate for neural prosthetic applications.

4 0.072558507 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

Author: Brendan J. Frey, Anitha Kannan, Nebojsa Jojic

Abstract: Factor analysis and principal components analysis can be used to model linear relationships between observed variables and linearly map high-dimensional data to a lower-dimensional hidden space. In factor analysis, the observations are modeled as a linear combination of normally distributed hidden variables. We describe a nonlinear generalization of factor analysis , called

5 0.066504732 84 nips-2001-Global Coordination of Local Linear Models

Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton

Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as:  RP&¤§¢  Q  ¡ I 0 (  3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡    ¡   ¥ ¡     ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨

6 0.064830825 46 nips-2001-Categorization by Learning and Combining Object Parts

7 0.063785784 163 nips-2001-Risk Sensitive Particle Filters

8 0.062771358 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

9 0.062363908 43 nips-2001-Bayesian time series classification

10 0.056163784 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models

11 0.055778459 190 nips-2001-Thin Junction Trees

12 0.055266794 54 nips-2001-Contextual Modulation of Target Saliency

13 0.05442534 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

14 0.052716203 115 nips-2001-Linear-time inference in Hierarchical HMMs

15 0.051966578 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates

16 0.051824577 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

17 0.050794587 3 nips-2001-ACh, Uncertainty, and Cortical Inference

18 0.05016087 179 nips-2001-Tempo tracking and rhythm quantization by sequential Monte Carlo

19 0.048832659 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

20 0.047198638 170 nips-2001-Spectral Kernel Methods for Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.187), (1, -0.027), (2, -0.027), (3, -0.048), (4, -0.099), (5, -0.053), (6, -0.065), (7, 0.043), (8, 0.017), (9, 0.009), (10, 0.034), (11, 0.037), (12, 0.075), (13, -0.057), (14, 0.012), (15, -0.016), (16, -0.036), (17, -0.054), (18, -0.04), (19, 0.06), (20, -0.047), (21, 0.049), (22, -0.058), (23, -0.008), (24, -0.043), (25, -0.023), (26, 0.087), (27, -0.007), (28, 0.046), (29, -0.051), (30, 0.128), (31, 0.04), (32, 0.112), (33, -0.061), (34, 0.078), (35, 0.064), (36, -0.038), (37, -0.019), (38, -0.089), (39, 0.034), (40, -0.039), (41, 0.057), (42, 0.086), (43, -0.162), (44, 0.081), (45, -0.215), (46, -0.058), (47, -0.189), (48, -0.036), (49, 0.129)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93880659 108 nips-2001-Learning Body Pose via Specialized Maps

Author: Rómer Rosales, Stan Sclaroff

Abstract: A nonlinear supervised learning model, the Specialized Mappings Architecture (SMA), is described and applied to the estimation of human body pose from monocular images. The SMA consists of several specialized forward mapping functions and an inverse mapping function. Each specialized function maps certain domains of the input space (image features) onto the output space (body pose parameters). The key algorithmic problems faced are those of learning the specialized domains and mapping functions in an optimal way, as well as performing inference given inputs and knowledge of the inverse function. Solutions to these problems employ the EM algorithm and alternating choices of conditional independence assumptions. Performance of the approach is evaluated with synthetic and real video sequences of human motion. 1

2 0.79651976 193 nips-2001-Unsupervised Learning of Human Motion Models

Author: Yang Song, Luis Goncalves, Pietro Perona

Abstract: This paper presents an unsupervised learning algorithm that can derive the probabilistic dependence structure of parts of an object (a moving human body in our examples) automatically from unlabeled data. The distinguished part of this work is that it is based on unlabeled data, i.e., the training features include both useful foreground parts and background clutter and the correspondence between the parts and detected features are unknown. We use decomposable triangulated graphs to depict the probabilistic independence of parts, but the unsupervised technique is not limited to this type of graph. In the new approach, labeling of the data (part assignments) is taken as hidden variables and the EM algorithm is applied. A greedy algorithm is developed to select parts and to search for the optimal structure based on the differential entropy of these variables. The success of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled real image sequences.

3 0.59132093 158 nips-2001-Receptive field structure of flow detectors for heading perception

Author: J. A. Beintema, M. Lappe, Alexander C. Berg

Abstract: Observer translation relative to the world creates image flow that expands from the observer's direction of translation (heading) from which the observer can recover heading direction. Yet, the image flow is often more complex, depending on rotation of the eye, scene layout and translation velocity. A number of models [1-4] have been proposed on how the human visual system extracts heading from flow in a neurophysiologic ally plausible way. These models represent heading by a set of neurons that respond to large image flow patterns and receive input from motion sensed at different image locations. We analysed these models to determine the exact receptive field of these heading detectors. We find most models predict that, contrary to widespread believe, the contribut ing motion sensors have a preferred motion directed circularly rather than radially around the detector's preferred heading. Moreover, the results suggest to look for more refined structure within the circular flow, such as bi-circularity or local motion-opponency.

4 0.4152315 190 nips-2001-Thin Junction Trees

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.

5 0.39923388 91 nips-2001-Improvisation and Learning

Author: Judy A. Franklin

Abstract: This article presents a 2-phase computational learning model and application. As a demonstration, a system has been built, called CHIME for Computer Human Interacting Musical Entity. In phase 1 of training, recurrent back-propagation trains the machine to reproduce 3 jazz melodies. The recurrent network is expanded and is further trained in phase 2 with a reinforcement learning algorithm and a critique produced by a set of basic rules for jazz improvisation. After each phase CHIME can interactively improvise with a human in real time. 1 Foundations Jazz improvisation is the creation of a jazz melody in real time. Charlie Parker, Dizzy Gillespie, Miles Davis, John Coltrane, Charles Mingus, Thelonious Monk, and Sonny Rollins et al. were the founders of bebop and post bop jazz [9] where drummers, bassists, and pianists keep the beat and maintain harmonic structure. Other players improvise over this structure and even take turns improvising for 4 bars at a time. This is called trading fours. Meanwhile, artificial neural networks have been used in computer music [4, 12]. In particular, the work of (Todd [11]) is the basis for phase 1 of CHIME, a novice machine improvisor that learns to trade fours. Firstly, a recurrent network is trained with back-propagation to play three jazz melodies by Sonny Rollins [1], as described in Section 2. Phase 2 uses actor-critic reinforcement learning and is described in Section 3. This section is on jazz basics. 1.1 Basics: Chords, the ii-V-I Chord Progression and Scales The harmonic structure mentioned above is a series of chords that may be reprated and that are often grouped into standard subsequences. A chord is a group of notes played simultaneously. In the chromatic scale, C-Db-D-Eb-E-F-Gb-G-Ab-A-Bb-B-C, notes are separated by a half step. A flat (b) note is a half step below the original note; a sharp (#) is a half above. Two half steps are a whole step. Two whole steps are a major third. Three half steps are a minor third. A major triad (chord) is the first or tonic note, then the note a major third up, then the note a minor third up. When F is the tonic, F major triad is F-A-C. A minor triad (chord) is the tonic ¡ www.cs.smith.edu/˜jfrankli then a minor third, then a major third. F minor triad is F-Ab-C. The diminished triad is the tonic, then a minor third, then a minor third. F diminished triad is F-Ab-Cb. An augmented triad is the tonic, then a major third, then a major third. The F augmented triad is F-A-Db. A third added to the top of a triad forms a seventh chord. A major triad plus a major third is the major seventh chord. F-A-C-E is the F major seventh chord (Fmaj7). A minor triad plus a minor third is a minor seventh chord. For F it is F-Ab-C-Eb (Fm7). A major triad plus a minor third is a dominant seventh chord. For F it is F-A-C-Eb (F7). These three types of chords are used heavily in jazz harmony. Notice that each note in the chromatic scales can be the tonic note for any of these types of chords. A scale, a subset of the chromatic scale, is characterized by note intervals. Let W be a whole step and H be a half. The chromatic scale is HHHHHHHHHHHH. The major scale or ionian mode is WWHWWWH. F major scale is F-G-A-Bb-C-D-E-F. The notes in a scale are degrees; E is the seventh degree of F major. The first, third, fifth, and seventh notes of a major scale are the major seventh chord. The first, third, fifth, and seventh notes of other modes produce the minor seventh and dominant seventh chords. Roman numerals represent scale degrees and their seventh chords. Upper case implies major or dominant seventh and lower case implies minor seventh [9]. The major seventh chord starting at the scale tonic is the I (one) chord. G is the second degree of F major, and G-Bb-D-F is Gm7, the ii chord, with respect to F. The ii-V-I progression is prevalent in jazz [9], and for F it is Gm7-C7-Fmaj7. The minor ii-V-i progression is obtained using diminished and augmented triads, their seventh chords, and the aeolian mode. Seventh chords can be extended by adding major or minor thirds, e.g. Fmaj9, Fmaj11, Fmaj13, Gm9, Gm11, and Gm13. Any extension can be raised or lowered by 1 step [9] to obtain, e.g. Fmaj7#11, C7#9, C7b9, C7#11. Most jazz compositions are either the 12 bar blues or sectional forms (e.g. ABAB, ABAC, or AABA) [8]. The 3 Rollins songs are 12 bar blues. “Blue 7” has a simple blues form. In “Solid” and “Tenor Madness”, Rollins adds bebop variations to the blues form [1]. ii-V-I and VI-II-V-I progressions are added and G7+9 substitutes for the VI and F7+9 for the V (see section 1.2 below); the II-V in the last bar provides the turnaround to the I of the first bar to foster smooth repetition of the form. The result is at left and in Roman numeral notation Bb7 Bb7 Bb7 Bb7 I I I I Eb7 Eb7 Bb7 G7+9 IV IV I VI at right: Cm7 F7 Bb7 G7+9 C7 F7+9 ii V I VI II V 1.2 Scale Substitutions and Rules for Reinforcement Learning First note that the theory and rules derived in this subsection are used in Phase 2, to be described in Section 3. They are presented here since they derive from the jazz basics immediately preceding. One way a novice improvisor can play is to associate one scale with each chord and choose notes from that scale when the chord is presented in the musical score. Therefore, Rule 1 is that an improvisor may choose notes from a “standard” scale associated with a chord. Next, the 4th degree of the scale is often avoided on a major or dominant seventh chord (Rule 3), unless the player can resolve its dissonance. The major 7th is an avoid note on a dominant seventh chord (Rule 4) since a dominant seventh chord and its scale contain the flat 7th, not the major 7th. Rule 2 contains many notes that can be added. A brief rationale is given next. The C7 in Gm7-C7-Fmaj7 may be replaced by a C7#11, a C7+ chord, or a C7b9b5 or C7alt chord [9]. The scales for C7+ and C7#11 make available the raised fourth (flat 5), and flat 6 (flat 13) for improvising. The C7b9b5 and C7alt (C7+9) chords and their scales make available the flat9, raised 9, flat5 and raised 5 [1]. These substitutions provide the notes of Rule 2. These rules (used in phase 2) are stated below, using for reinforcement values very bad (-1.0), bad (-0.5), a little bad (-0.25), ok (0.25), good (0.5), and very good (1.0). The rules are discussed further in Section 4. The Rule Set: 1) Any note in the scale associated with the chord is ok (except as noted in rule 3). 2) On a dominant seventh, hip notes 9, flat9, #9, #11, 13 or flat13 are very good. One hip note 2 times in a row is a little bad. 2 hip notes more than 2 times in a row is a little bad. 3) If the chord is a dominant seventh chord, a natural 4th note is bad. 4) If the chord is a dominant seventh chord, a natural 7th is very bad. 5) A rest is good unless it is held for more than 2 16th notes and then it is very bad. 6) Any note played longer than 1 beat (4 16th notes) is very bad. 7) If two consecutive notes match the human’s, that is good. 2 CHIME Phase 1 In Phase 1, supervised learning is used to train a recurrent network to reproduce the three Sonny Rollins melodies. 2.1 Network Details and Training The recurrent network’s output units are linear. The hidden units are nonlinear (logistic function). Todd [11] used a Jordan recurrent network [6] for classical melody learning and generation. In CHIME, a Jordan net is also used, with the addition of the chord as input (Figure 1. 24 of the 26 outputs are notes (2 chromatic octaves), the 25th is a rest, and the 26th indicates a new note. The output with the highest value above a threshold is the next note, including the rest output. The new note output indicates if this is a new note, or if it is the same note being held for another time step ( note resolution). ¥£ ¡ ¦¤¢  The 12 chord inputs (12 notes in a chromatic scale), are 1 or 0. A chord is represented as its first, third, fifth, and seventh notes and it “wraps around” within the 12 inputs. E.g., the Fm7 chord F-Ab-C-Eb is represented as C, Eb, F, Ab or 100101001000. One plan input per song enables distinguishing between songs. The 26 context inputs use eligibility traces, giving the hidden units a decaying history of notes played. CHIME (as did Todd) uses teacher forcing [13], wherein the target outputs for the previous step are used as inputs (so erroneous outputs are not used as inputs). Todd used from 8 to 15 hidden units; CHIME uses 50. The learning rate is 0.075 (Todd used 0.05). The eligibility rate is 0.9 (Todd used 0.8). Differences in values perhaps reflect contrasting styles of the songs and available computing power. Todd used 15 output units and assumed a rest when all note units are “turned off.” CHIME uses 24 output note units (2 octaves). Long rests in the Rollins tunes require a dedicated output unit for a rest. Without it, the note outputs learned to turn off all the time. Below are results of four representative experiments. In all experiments, 15,000 presentations of the songs were made. Each song has 192 16th note events. All songs are played at a fixed tempo. Weights are initialized to small random values. The squared error is the average squared error over one complete presentation of the song. “Finessing” the network may improve these values. The songs are easily recognized however, and an exact match could impair the network’s ability to improvise. Figure 2 shows the results for “Solid.” Experiment 1. Song: Blue Seven. Squared error starts at 185, decreases to 2.67. Experiment 2. Song: Tenor Madness. Squared error starts at 218, decreases to 1.05. Experiment 3. Song: Solid. Squared error starts at 184, decreases to 3.77. Experiment 4. Song: All three songs: Squared error starts at 185, decreases to 36. Figure 1: Jordan recurrent net with addition of chord input 2.2 Phase 1 Human Computer Interaction in Real Time In trading fours with the trained network, human note events are brought in via the MIDI interface [7]. Four bars of human notes are recorded then given, one note event at a time to the context inputs (replacing the recurrent inputs). The plan inputs are all 1. The chord inputs follow the “Solid” form. The machine generates its four bars and they are played in real time. Then the human plays again, etc. An accompaniment (drums, bass, and piano), produced by Band-in-a-Box software (PG Music), keeps the beat and provides chords for the human. Figure 3 shows an interaction. The machine’s improvisations are in the second and fourth lines. In bar 5 the flat 9 of the Eb7 appears; the E. This note is used on the Eb7 and Bb7 chords by Rollins in “Blue 7”, as a “passing tone.” D is played in bar 5 on the Eb7. D is the natural 7 over Eb7 (with its flat 7) but is a note that Rollins uses heavily in all three songs, and once over the Eb7. It may be a response to the rest and the Bb played by the human in bar 1. D follows both a rest and a Bb in many places in “Tenor Madness” and “Solid.” In bar 6, the long G and the Ab (the third then fourth of Eb7) figure prominently in “Solid.” At the beginning of bar 7 is the 2-note sequence Ab-E that appears in exactly the same place in the song “Blue 7.” The focus of bars 7 and 8 is jumping between the 3rd and 4th of Bb7. At the end of bar 8 the machine plays the flat 9 (Ab) then the flat 3 (Bb), of G7+9. In bars 13-16 the tones are longer, as are the human’s in bars 9-12. The tones are the 5th, the root, the 3rd, the root, the flat 7, the 3rd, the 7th, and the raised fourth. Except for the last 2, these are chord tones. 3 CHIME Phase 2 In Phase 2, the network is expanded and trained by reinforcement learning to improvise according to the rules of Section 1.2 and using its knowledge of the Sonny Rollins songs. 3.1 The Expanded Network Figure 4 shows the phase 2 network. The same inputs plus 26 human inputs brings the total to 68. The weights obtained in phase 1 initialize this network. The plan and chord weights Figure 2: At left “Solid” played by a human; at right the song reproduced by the ANN. are the same. The weights connecting context units to the hidden layer are halved. The same weights, halved, connect the 26 human inputs to the hidden layer. Each output unit gets the 100 hidden units’ outputs as input. The original 50 weights are halved and used as initial values of the two sets of 50 hidden unit weights to the output unit. 3.2 SSR and Critic Algorithms Using actor-critic reinforcement learning ([2, 10, 13]), the actor chooses the next note to play. The critic receives a “raw” reinforcement signal from the critique made by the . A rules of Section 1.2. For output j, the SSR (actor) computes mean Gaussian distribution with mean and standard deviation chooses the output . is generated, the critic modifies and produces . is further modified by a self-scaling algorithm that tracks, via moving average, the maximum and minimum reinforcement and uses them to scale the signal to produce .

6 0.39879566 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

7 0.39680398 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

8 0.38871729 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network

9 0.37859148 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task

10 0.37714291 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model

11 0.36021739 154 nips-2001-Products of Gaussians

12 0.35857585 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering

13 0.35846156 17 nips-2001-A Quantitative Model of Counterfactual Reasoning

14 0.35119531 78 nips-2001-Fragment Completion in Humans and Machines

15 0.35027808 177 nips-2001-Switch Packet Arbitration via Queue-Learning

16 0.33591196 84 nips-2001-Global Coordination of Local Linear Models

17 0.32954299 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

18 0.32592991 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

19 0.32533973 114 nips-2001-Learning from Infinite Data in Finite Time

20 0.32312459 68 nips-2001-Entropy and Inference, Revisited


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.019), (17, 0.029), (19, 0.026), (27, 0.093), (30, 0.076), (36, 0.011), (38, 0.018), (59, 0.429), (72, 0.047), (79, 0.035), (83, 0.016), (91, 0.122)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90477449 168 nips-2001-Sequential Noise Compensation by Sequential Monte Carlo Method

Author: K. Yao, S. Nakamura

Abstract: We present a sequential Monte Carlo method applied to additive noise compensation for robust speech recognition in time-varying noise. The method generates a set of samples according to the prior distribution given by clean speech models and noise prior evolved from previous estimation. An explicit model representing noise effects on speech features is used, so that an extended Kalman filter is constructed for each sample, generating the updated continuous state estimate as the estimation of the noise parameter, and prediction likelihood for weighting each sample. Minimum mean square error (MMSE) inference of the time-varying noise parameter is carried out over these samples by fusion the estimation of samples according to their weights. A residual resampling selection step and a Metropolis-Hastings smoothing step are used to improve calculation efficiency. Experiments were conducted on speech recognition in simulated non-stationary noises, where noise power changed artificially, and highly non-stationary Machinegun noise. In all the experiments carried out, we observed that the method can have significant recognition performance improvement, over that achieved by noise compensation with stationary noise assumption. 1

same-paper 2 0.87094605 108 nips-2001-Learning Body Pose via Specialized Maps

Author: Rómer Rosales, Stan Sclaroff

Abstract: A nonlinear supervised learning model, the Specialized Mappings Architecture (SMA), is described and applied to the estimation of human body pose from monocular images. The SMA consists of several specialized forward mapping functions and an inverse mapping function. Each specialized function maps certain domains of the input space (image features) onto the output space (body pose parameters). The key algorithmic problems faced are those of learning the specialized domains and mapping functions in an optimal way, as well as performing inference given inputs and knowledge of the inverse function. Solutions to these problems employ the EM algorithm and alternating choices of conditional independence assumptions. Performance of the approach is evaluated with synthetic and real video sequences of human motion. 1

3 0.85086364 164 nips-2001-Sampling Techniques for Kernel Methods

Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf

Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.

4 0.81997544 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

Author: Lehel Csató, Manfred Opper, Ole Winther

Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.

5 0.56805873 74 nips-2001-Face Recognition Using Kernel Methods

Author: Ming-Hsuan Yang

Abstract: Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based methods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods provide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13] ,and Fisher Linear Discriminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to lighting, facial expressions, viewing points, etc.) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction

6 0.56758541 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

7 0.55402923 154 nips-2001-Products of Gaussians

8 0.54775518 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

9 0.53733057 71 nips-2001-Estimating the Reliability of ICA Projections

10 0.53526282 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition

11 0.52951294 179 nips-2001-Tempo tracking and rhythm quantization by sequential Monte Carlo

12 0.52647138 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

13 0.5224033 155 nips-2001-Quantizing Density Estimators

14 0.51459897 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

15 0.5141626 84 nips-2001-Global Coordination of Local Linear Models

16 0.5091182 46 nips-2001-Categorization by Learning and Combining Object Parts

17 0.50905579 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

18 0.50854236 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation

19 0.5072161 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

20 0.5055213 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation